PhenoFeatureFinder.omics_analysis ================================= .. py:module:: PhenoFeatureFinder.omics_analysis Classes ------- .. autoapisummary:: PhenoFeatureFinder.omics_analysis.OmicsAnalysis Module Contents --------------- .. py:class:: OmicsAnalysis(metabolome_csv, metabolome_feature_id_col='feature_id') A class to streamline the filtering and exploration of a metabolome dataset. :param metabolome_csv: A path to a .csv file with the metabolome data (scaled or unscaled). Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples :type metabolome_csv: str :param metabolome_feature_id_col: The name of the column that contains the feature identifiers (default is 'feature_id'). Feature identifiers should be unique (=not duplicated). :type metabolome_feature_id_col: str, optional .. attribute:: metabolome The metabolome Pandas dataframe imported from the .csv file. :type: `pandas.core.frame.DataFrame`, (n_samples, n_features) .. attribute:: metabolome_validated Is the metabolome dataset validated? Default is False. :type: `bool` .. attribute:: blank_features_filtered Are the features present in blank samples filtered out from the metabolome data? Default by False. :type: `bool` .. attribute:: filtered_by_percentile_value Are the features filtered by percentile value? :type: bool .. attribute:: unreliable_features_filtered Are the features not reliably present within one group filtered out from the metabolome data? :type: `bool` .. attribute:: pca_performed Has PCA been performed on the metabolome data? Default is False. :type: `bool` .. attribute:: exp_variance A Pandas dataframe with explained variance per Principal Component. The index of the df contains the PC index (PC1, PC2, etc.). The second column contains the percentage of the explained variance per PC. :type: `pandas.core.frame.DataFrame`, (n_pc, 1) .. attribute:: metabolome_pca_reduced Numpy array with sample coordinates in reduced dimensions. The dimension of the numpy array is the minimum of the number of samples and features. :type: `numpy.ndarray`, (n_samples, n_pc) .. attribute:: sparsity Metabolome matrix sparsity. :type: float .. method:: validate_input_metabolome_df Check if the provided metabolome file is suitable. Turns attribute metabolome_validated to True. .. method:: discard_features_detected_in_blanks Removes features only detected in blank samples. .. method:: impute_missing_values_with_median Impute missing values with the median value of the feature. .. method:: filter_out_unreliable_features() Filter out features not reliably detectable in replicates of the same grouping factor. For instance, if a feature is detected less than 4 times within 4 biological replicates, it is discarded with argument nb_times_detected=4. .. method:: filter_features_per_group_by_percentile Filter out features whose abundance within the same grouping factor is lower than a certain percentile value. For instance, features lower than the 90th percentile within a single group are discarded with argument percentile=90. .. method:: compute_metabolome_sparsity Computes the sparsity percentage of the metabolome matrix (percentage of 0 values e.g. 100% for an matrix full of 0 values) .. method:: write_clean_metabolome_to_csv() Write the filtered and analysis-ready metabolome data to a .csv file. .. rubric:: Notes Example of an input metabolome input format (from a csv file) +----------------------+---------+---------+---------+---------+-------+-------+-------+-------+----------+----------+----------+----------+ | feature_id | blank_1 | blank_2 | blank_3 | blank_4 | MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 | LA1330_3 | LA1330_4 | +======================+=========+=========+=========+=========+=======+=======+=======+=======+==========+==========+==========+==========+ | rt-0.04_mz-241.88396 | 280 | 694 | 502 | 604 | 554 | 678 | 674 | 936 | 824 | 940 | 794 | 828 | +----------------------+---------+---------+---------+---------+-------+-------+-------+-------+----------+----------+----------+----------+ | rt-0.05_mz-143.95911 | 1036 | 1566 | 1326 | 1490 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 | 1730 | 1568 | +----------------------+---------+---------+---------+---------+-------+-------+-------+-------+----------+----------+----------+----------+ | rt-0.06_mz-124.96631 | 1308 | 992 | 1060 | 1010 | 742 | 990 | 0 | 888 | 786 | 668 | 762 | 974 | +----------------------+---------+---------+---------+---------+-------+-------+-------+-------+----------+----------+----------+----------+ | rt-0.08_mz-553.45905 | 11340 | 12260 | 10962 | 11864 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 | 11122 | 11260 | +----------------------+---------+---------+---------+---------+-------+-------+-------+-------+----------+----------+----------+----------+ | rt-0.08_mz-413.26631 | 984 | 1162 | 1292 | 1104 | 1090 | 1106 | 1290 | 1170 | 1282 | 924 | 1172 | 1062 | +----------------------+---------+---------+---------+---------+-------+-------+-------+-------+----------+----------+----------+----------+ .. rubric:: Example >>> met = OmicsAnalysis( metabolome_csv='my_metabolome_data.csv', metabolome_feature_id_col='feature_id') >>> met.validate_input_metabolome_df() Metabolome input data validated .. seealso:: :obj:`scikit-learn` .. py:attribute:: metabolome_validated :value: False .. py:attribute:: blank_features_filtered :value: False .. py:attribute:: filtered_by_percentile_value :value: False .. py:attribute:: unreliable_features_filtered :value: False .. py:attribute:: pca_performed :value: False .. py:attribute:: sparsity :value: None .. py:attribute:: metabolome .. py:method:: validate_input_metabolome_df(metabolome_feature_id_col='feature_id') Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the 'feature_id_col' column as the index of the validated dataframe. The validated metabolome dataframe is stored as the 'validated_metabolome' attribute. :param metabolome_feature_id: The name of the column that contains the feature identifiers (default is 'feature_id'). Feature identifiers should be unique (=not duplicated). :type metabolome_feature_id: str, optional :returns: **self** -- Object with attribute metabolome_validated set to True if tests are passed. :rtype: object .. rubric:: Notes Example of a valid input metabolome dataframe | genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 | |----------------|----------------|----------------|----------------| feature_id | metabolite1 | 1246 | 1245 | 12345 | 12458 | | metabolite2 | 0 | 0 | 0 | 0 | | metabolite3 | 10 | 0 | 0 | 154 | .. py:method:: impute_missing_values_with_median(missing_value_str='np.nan') Imputes missing values with the median of the column. Params ------ missing_value_str: str, optional The string that represents missing values in the input dataframe. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA. :returns: **self** :rtype: object with attribute 'metabolome' updated with imputed values. .. py:method:: discard_features_detected_in_blanks(blank_sample_contains='blank') Removes Steps: 1. Sum the abundance of each feature in the blank samples. 2. Makes a list of features to be discarded (features with a positive summed abundance). 3. Returns a filtered Pandas dataframe with only features not detected in blank samples :param blank_sample_contains: Column names with this name will be considered blank samples. Default is='blank' :type blank_sample_contains: str, optional. :returns: **metabolome** -- A filtered Pandas dataframe without features detected in blank samples and with the blank samples removed. :rtype: pandas.core.frame.DataFrame .. py:method:: create_density_plot(name_grouping_var='genotype', n_cols=3, nbins=1000) For each grouping variable (e.g. genotype), creates a histogram and density plot of all feature peak areas. This plot helps to see whether some groups have a value distribution different from the rest. The percentage is indicated on the y-axis (bar heights sum to 100). :param name_grouping_var: The name used when splitting between replicate and main factor. For example "genotype" when splitting MM_rep1 into 'MM' and 'rep1'. Default is 'genotype'. :type name_grouping_var: str, optional :param n_cols: The number of columns for the final plot. :type n_cols: int, optional :param nbins: The number of bins to create. :type nbins: int, optional :returns: Returns the Axes object with the density plots drawn onto it. :rtype: matplotlib Axes .. py:method:: filter_features_per_group_by_percentile(name_grouping_var='genotype', separator_replicates='_', percentile=50) Filter metabolome dataframe based on a selected percentile threshold. Features with a peak area values lower than the selected percentile will be discarded. The percentile value is calculated per grouping variable. For instance, selecting the 50th percentile (median) will discard 50% of the features with a peak area lower than the median/50th percentile in each group. :param name_grouping_var: The name of the grouping variable (default is "genotype") :type name_grouping_var: str, optional :param separator_replicates: The character used to separate the main grouping variable from biological replicates. Default is "_: (underscore) :type separator_replicates: str, optional :param percentile: The percentile threshold. Has to be comprised 0 and 100. :type percentile: float, optional :returns: **self** -- The object with the .metabolome attribute filtered and the filtered_by_percentile_value set to True. :rtype: object .. rubric:: Example >>> met = OmicsAnalysis( metabolome_csv='tests/metabolome_test_data.csv', metabolome_feature_id_col='feature_id') >>> met.validate_input_metabolome_df() Metabolome input data validated >>> met.discard_features_detected_in_blanks(blank_sample_contains="blank") >> met.metabolome.shape (7544, 32) >>> met.filter_features_based_on_peak_area_level(percentile=90) >>> met.metabolome.shape (3171, 32) .. seealso:: :obj:`create_density_plot` .. py:method:: filter_out_unreliable_features(name_grouping_var='genotype', nb_times_detected=4, separator_replicates='_') Removes features not reliably detectable in multiple biological replicates from the same grouping factor. Takes a dataframe with feature identifiers in index and samples as columns. Step 1: First melt and split the sample names to generate the grouping variable Step 2: count number of times a metabolite is detected in the groups. If number of times detected in a group = number of biological replicates then it is considered as reliable Each feature receives a tag 'reliable' or 'not_reliable' Step 3: discard the 'not_reliable' features and keep the filtered dataframe. Params ------ name_grouping_var: str, optional The name used when splitting between replicate and main factor. For example "genotype" when splitting MM_rep1 into 'MM' and 'rep1'. Default is 'genotype'. nb_times_detected: int, optionaldefault=4 Number of times a metabolite should be detected to be considered 'reliable'. Should be equal to the number of biological replicates for a given group of interest (e.g. genotype) separator_replicates: string, default="_" The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1) :returns: **metabolome** -- A Pandas dataframe with only features considered as reliable, sample names and their values. :rtype: ndarray .. rubric:: Notes Input dataframe | MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 | |---------- |---------- |---------- |---------- |---------- |---------- | feature_id | rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 | | rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 | | rt-0.06_mz-124.96631 | 0 | 0 | 0 | 888 | 786 | 668 | | rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 | Output df (rt-0.06_mz-124.96631 is kicked out because 3x0 and 1x888 in MM groups) | MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 | |---------- |---------- |---------- |---------- |---------- |---------- | feature_id | rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 | | rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 | | rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 | .. py:method:: write_clean_metabolome_to_csv(path_of_cleaned_csv='./data_for_manuals/filtered_metabolome.csv') A function that verify that the metabolome dataset has been cleaned up. Writes the metabolome data as a comma-separated value file on disk :param path_of_cleaned_csv: The path and filename of the .csv file to save. Default to "./data_for_manuals/filtered_metabolome.csv" :type path_of_cleaned_csv: str, optional .. py:method:: compute_pca_on_metabolites(scale=True, n_principal_components=10, auto_transpose=True) Performs a Principal Component Analysis (PCA) on the metabolome data. The PCA analysis will return transformed coordinates of the samples in a new space. It will also give the percentage of variance explained by each Principal Component. Assumes that number of samples < number of features/metabolites Performs a transpose of the metabolite dataframe if n_samples > n_features (this can be turned off with auto_transpose) :param scale: Perform scaling (standardize) the metabolite values to zero mean and unit variance. Default is True. :type scale: `bool`, optional :param n_principal_components: number of principal components to keep in the PCA analysis. if number of PCs > min(n_samples, n_features) then set to the minimum of (n_samples, n_features) Default is to calculate 10 components. :type n_principal_components: int, optional :param auto_transpose: If n_samples > n_features, performs a transpose of the feature matrix. Default is True (meaning that transposing will occur if n_samples > n_features). :type auto_transpose: `bool`, optional. :returns: **self** -- Object with .exp_variance: dataframe with explained variance per Principal Component .metabolome_pca_reduced: dataframe with samples in reduced dimensions .pca_performed: `bool`ean set to True :rtype: object .. py:method:: create_scree_plot(plot_file_name=None) Returns a barplot with the explained variance per Principal Component. Has to be preceded by perform_pca() :param plot_file_name: Path to a file where the plot will be saved. For instance 'my_scree_plot.pdf' :type plot_file_name: string, default='None' :returns: Returns the Axes object with the scree plot drawn onto it. Optionally a saved image of the plot. :rtype: matplotlib Axes .. py:method:: create_sample_score_plot(pc_x_axis=1, pc_y_axis=2, name_grouping_var='genotype', separator_replicates='_', show_color_legend=True, plot_file_name=None) Returns a sample score plot of the samples on PCx vs PCy. Samples are colored based on the grouping variable (e.g. genotype) :param pc_x_axis: Principal Component to plot on the x-axis (default is 1 so PC1 will be plotted). :type pc_x_axis: int, optional :param pc_y_axis: Principal Component to plot on the y-axis (default is 2 so PC2 will be plotted). :type pc_y_axis: int, optional. :param name_grouping_var: Name of the variable used to color samples (Default is "genotype"). :type name_grouping_var: str, optional :param separator_replicates: String separator that separates grouping factor from biological replicates (default is underscore "_"). :type separator_replicates: str, optional. :param show_color_legend: Add legend for hue (default is True). :type show_color_legend: bool, optional. :param plot_file_name: A file name and its path to save the sample score plot (default is None). For instance "mydir/sample_score_plot.pdf" Path is relative to current working directory. :type plot_file_name: str, optional :returns: Returns the Axes object with the sample score plot drawn onto it. Samples are colored by specified grouping variable. Optionally a saved image of the plot. :rtype: matplotlib Axes .. py:method:: compute_metabolome_sparsity() Determine the sparsity of the metabolome matrix. Formula: number of non zero values/number of values * 100 The higher the sparsity, the more zero values :returns: **self** -- Object with sparsity attribute filled (sparsity is a float). :rtype: object .. rubric:: References https://stackoverflow.com/questions/38708621/how-to-calculate-percentage-of-sparsity-for-a-numpy-array-matrix .. py:method:: plot_features_in_upset_plot(seperator_replicates='_', plot_file_name=None) Visuallises the presence of features per group in an UpSet plot. A feature is considered present in a group if the median>0. Params ------ separator_replicates: string, default="_" The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1) plot_file_name: str, optional A file name and its path to save the sample score plot (default is None). For instance "mydir/feature_upset_plot.pdf" Path is relative to current working directory. :returns: UpSet plot with features presence per group. :rtype: Plot .. rubric:: Notes Input dataframe | MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 | |---------- |---------- |---------- |---------- |---------- |---------- | feature_id | rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 | | rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 | | rt-0.06_mz-124.96631 | 0 | 0 | 0 | 888 | 786 | 668 | | rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 |