PhenoFeatureFinder.omics_analysis

Classes

OmicsAnalysis

A class to streamline the filtering and exploration of a metabolome dataset.

Module Contents

class PhenoFeatureFinder.omics_analysis.OmicsAnalysis(metabolome_csv, metabolome_feature_id_col='feature_id')

A class to streamline the filtering and exploration of a metabolome dataset.

Parameters:

metabolome_csv (str) – A path to a .csv file with the metabolome data (scaled or unscaled). Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples
metabolome_feature_id_col (str, optional) – The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).

metabolome

The metabolome Pandas dataframe imported from the .csv file.

Type:: pandas.core.frame.DataFrame, (n_samples, n_features)

metabolome_validated

Is the metabolome dataset validated? Default is False.

Type:: bool

blank_features_filtered

Are the features present in blank samples filtered out from the metabolome data? Default by False.

Type:: bool

filtered_by_percentile_value

Are the features filtered by percentile value?

Type:: bool

unreliable_features_filtered

Are the features not reliably present within one group filtered out from the metabolome data?

Type:: bool

pca_performed

Has PCA been performed on the metabolome data? Default is False.

Type:: bool

exp_variance

A Pandas dataframe with explained variance per Principal Component. The index of the df contains the PC index (PC1, PC2, etc.). The second column contains the percentage of the explained variance per PC.

Type:: pandas.core.frame.DataFrame, (n_pc, 1)

metabolome_pca_reduced

Numpy array with sample coordinates in reduced dimensions. The dimension of the numpy array is the minimum of the number of samples and features.

Type:: numpy.ndarray, (n_samples, n_pc)

sparsity

Metabolome matrix sparsity.

Type:: float

validate_input_metabolome_df(): Check if the provided metabolome file is suitable. Turns attribute metabolome_validated to True.

discard_features_detected_in_blanks(): Removes features only detected in blank samples.

impute_missing_values_with_median(): Impute missing values with the median value of the feature.

filter_out_unreliable_features(): Filter out features not reliably detectable in replicates of the same grouping factor. For instance, if a feature is detected less than 4 times within 4 biological replicates, it is discarded with argument nb_times_detected=4.

filter_features_per_group_by_percentile(): Filter out features whose abundance within the same grouping factor is lower than a certain percentile value. For instance, features lower than the 90th percentile within a single group are discarded with argument percentile=90.

compute_metabolome_sparsity(): Computes the sparsity percentage of the metabolome matrix (percentage of 0 values e.g. 100% for an matrix full of 0 values)

write_clean_metabolome_to_csv(): Write the filtered and analysis-ready metabolome data to a .csv file.

Notes

Example of an input metabolome input format (from a csv file)

feature_id	blank_1	blank_2	blank_3	blank_4	MM_1	MM_2	MM_3	MM_4	LA1330_1	LA1330_2	LA1330_3	LA1330_4
rt-0.04_mz-241.88396	280	694	502	604	554	678	674	936	824	940	794	828
rt-0.05_mz-143.95911	1036	1566	1326	1490	1364	1340	1692	1948	1928	1956	1730	1568
rt-0.06_mz-124.96631	1308	992	1060	1010	742	990	0	888	786	668	762	974
rt-0.08_mz-553.45905	11340	12260	10962	11864	10972	11190	12172	11820	12026	11604	11122	11260
rt-0.08_mz-413.26631	984	1162	1292	1104	1090	1106	1290	1170	1282	924	1172	1062

Example

>>> met = OmicsAnalysis(
    metabolome_csv='my_metabolome_data.csv',
    metabolome_feature_id_col='feature_id')
>>> met.validate_input_metabolome_df()
Metabolome input data validated

Params

missing_value_str: str, optional: The string that represents missing values in the input dataframe. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

returns:: self
rtype:: object with attribute ‘metabolome’ updated with imputed values.

discard_features_detected_in_blanks(blank_sample_contains='blank')

Removes Steps:

Sum the abundance of each feature in the blank samples.

Makes a list of features to be discarded (features with a positive summed abundance).

Returns a filtered Pandas dataframe with only features not detected in blank samples

Parameters:: blank_sample_contains (str, optional.) – Column names with this name will be considered blank samples. Default is=’blank’
Returns:: metabolome – A filtered Pandas dataframe without features detected in blank samples and with the blank samples removed.
Return type:: pandas.core.frame.DataFrame

create_density_plot(name_grouping_var='genotype', n_cols=3, nbins=1000)

For each grouping variable (e.g. genotype), creates a histogram and density plot of all feature peak areas. This plot helps to see whether some groups have a value distribution different from the rest. The percentage is indicated on the y-axis (bar heights sum to 100).

Parameters:

name_grouping_var (str, optional) – The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.
n_cols (int, optional) – The number of columns for the final plot.
nbins (int, optional) – The number of bins to create.

Returns:

Returns the Axes object with the density plots drawn onto it.

Return type:

matplotlib Axes

filter_features_per_group_by_percentile(name_grouping_var='genotype', separator_replicates='_', percentile=50)

Filter metabolome dataframe based on a selected percentile threshold. Features with a peak area values lower than the selected percentile will be discarded. The percentile value is calculated per grouping variable.

For instance, selecting the 50th percentile (median) will discard 50% of the features with a peak area lower than the median/50th percentile in each group.

Parameters:

name_grouping_var (str, optional) – The name of the grouping variable (default is “genotype”)
separator_replicates (str, optional) – The character used to separate the main grouping variable from biological replicates. Default is “_: (underscore)
percentile (float, optional) – The percentile threshold. Has to be comprised 0 and 100.

Returns:

self – The object with the .metabolome attribute filtered and the filtered_by_percentile_value set to True.

Return type:

object

Example

>>> met = OmicsAnalysis(
    metabolome_csv='tests/metabolome_test_data.csv',
    metabolome_feature_id_col='feature_id')
>>> met.validate_input_metabolome_df()
Metabolome input data validated
>>> met.discard_features_detected_in_blanks(blank_sample_contains="blank")
>> met.metabolome.shape
(7544, 32)
>>> met.filter_features_based_on_peak_area_level(percentile=90)
>>> met.metabolome.shape
(3171, 32)

Params

name_grouping_var: str, optional: The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.
nb_times_detected: int, optionaldefault=4: Number of times a metabolite should be detected to be considered ‘reliable’. Should be equal to the number of biological replicates for a given group of interest (e.g. genotype)
separator_replicates: string, default=”_”: The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)

returns:: metabolome – A Pandas dataframe with only features considered as reliable, sample names and their values.
rtype:: ndarray

Notes

Input dataframe

MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 |

|———- |———- |———- |———- |———- |———- |

feature_id

rt-0.04_mz-241.88396  | 554           | 678           | 674           | 936           | 824           | 940           |
rt-0.05_mz-143.95911  | 1364          | 1340          | 1692          | 1948          | 1928          | 1956          |
rt-0.06_mz-124.96631  | 0             | 0             | 0             | 888           | 786           | 668           |
rt-0.08_mz-553.45905  | 10972         | 11190         | 12172         | 11820         | 12026         | 11604         |

Output df (rt-0.06_mz-124.96631 is kicked out because 3x0 and 1x888 in MM groups)

MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 |

|———- |———- |———- |———- |———- |———- |

feature_id

rt-0.04_mz-241.88396  | 554           | 678           | 674           | 936           | 824           | 940           |
rt-0.05_mz-143.95911  | 1364          | 1340          | 1692          | 1948          | 1928          | 1956          |
rt-0.08_mz-553.45905  | 10972         | 11190         | 12172         | 11820         | 12026         | 11604         |

write_clean_metabolome_to_csv(path_of_cleaned_csv='./data_for_manuals/filtered_metabolome.csv')

A function that verify that the metabolome dataset has been cleaned up. Writes the metabolome data as a comma-separated value file on disk

Parameters:: path_of_cleaned_csv (str, optional) – The path and filename of the .csv file to save. Default to “./data_for_manuals/filtered_metabolome.csv”

compute_pca_on_metabolites(scale=True, n_principal_components=10, auto_transpose=True)

Performs a Principal Component Analysis (PCA) on the metabolome data.

The PCA analysis will return transformed coordinates of the samples in a new space. It will also give the percentage of variance explained by each Principal Component. Assumes that number of samples < number of features/metabolites Performs a transpose of the metabolite dataframe if n_samples > n_features (this can be turned off with auto_transpose)

Parameters:

scale (bool, optional) – Perform scaling (standardize) the metabolite values to zero mean and unit variance. Default is True.
n_principal_components (int, optional) – number of principal components to keep in the PCA analysis. if number of PCs > min(n_samples, n_features) then set to the minimum of (n_samples, n_features) Default is to calculate 10 components.
auto_transpose (bool, optional.) – If n_samples > n_features, performs a transpose of the feature matrix. Default is True (meaning that transposing will occur if n_samples > n_features).

Returns:

self – Object with .exp_variance: dataframe with explained variance per Principal Component .metabolome_pca_reduced: dataframe with samples in reduced dimensions .pca_performed: `bool`ean set to True

Return type:

object

create_scree_plot(plot_file_name=None)

Returns a barplot with the explained variance per Principal Component. Has to be preceded by perform_pca()

Parameters:: plot_file_name (string, default='None') – Path to a file where the plot will be saved. For instance ‘my_scree_plot.pdf’
Returns:: Returns the Axes object with the scree plot drawn onto it. Optionally a saved image of the plot.
Return type:: matplotlib Axes

create_sample_score_plot(pc_x_axis=1, pc_y_axis=2, name_grouping_var='genotype', separator_replicates='_', show_color_legend=True, plot_file_name=None)

Returns a sample score plot of the samples on PCx vs PCy. Samples are colored based on the grouping variable (e.g. genotype)

Parameters:

pc_x_axis (int, optional) – Principal Component to plot on the x-axis (default is 1 so PC1 will be plotted).
pc_y_axis (int, optional.) – Principal Component to plot on the y-axis (default is 2 so PC2 will be plotted).
name_grouping_var (str, optional) – Name of the variable used to color samples (Default is “genotype”).
separator_replicates (str, optional.) – String separator that separates grouping factor from biological replicates (default is underscore “_”).
show_color_legend (bool, optional.) – Add legend for hue (default is True).
plot_file_name (str, optional) – A file name and its path to save the sample score plot (default is None). For instance “mydir/sample_score_plot.pdf” Path is relative to current working directory.

Returns:

Returns the Axes object with the sample score plot drawn onto it. Samples are colored by specified grouping variable. Optionally a saved image of the plot.

Return type:

matplotlib Axes

compute_metabolome_sparsity()

Determine the sparsity of the metabolome matrix. Formula: number of non zero values/number of values * 100 The higher the sparsity, the more zero values

Returns:: self – Object with sparsity attribute filled (sparsity is a float).
Return type:: object

References

https://stackoverflow.com/questions/38708621/how-to-calculate-percentage-of-sparsity-for-a-numpy-array-matrix

plot_features_in_upset_plot(seperator_replicates='_', plot_file_name=None)

Visuallises the presence of features per group in an UpSet plot. A feature is considered present in a group if the median>0.

Params

separator_replicates: string, default=”_”: The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)
plot_file_name: str, optional: A file name and its path to save the sample score plot (default is None). For instance “mydir/feature_upset_plot.pdf” Path is relative to current working directory.

returns:: UpSet plot with features presence per group.
rtype:: Plot

Notes

Input dataframe

MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 |

|———- |———- |———- |———- |———- |———- |

feature_id

rt-0.04_mz-241.88396  | 554           | 678           | 674           | 936           | 824           | 940           |
rt-0.05_mz-143.95911  | 1364          | 1340          | 1692          | 1948          | 1928          | 1956          |
rt-0.06_mz-124.96631  | 0             | 0             | 0             | 888           | 786           | 668           |
rt-0.08_mz-553.45905  | 10972         | 11190         | 12172         | 11820         | 12026         | 11604         |