PhenoFeatureFinder.omics_analysis
Classes
A class to streamline the filtering and exploration of a metabolome dataset. |
Module Contents
- class PhenoFeatureFinder.omics_analysis.OmicsAnalysis(metabolome_csv, metabolome_feature_id_col='feature_id')
A class to streamline the filtering and exploration of a metabolome dataset.
- Parameters:
metabolome_csv (str) – A path to a .csv file with the metabolome data (scaled or unscaled). Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples
metabolome_feature_id_col (str, optional) – The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).
- metabolome
The metabolome Pandas dataframe imported from the .csv file.
- Type:
pandas.core.frame.DataFrame, (n_samples, n_features)
- metabolome_validated
Is the metabolome dataset validated? Default is False.
- Type:
bool
- blank_features_filtered
Are the features present in blank samples filtered out from the metabolome data? Default by False.
- Type:
bool
- filtered_by_percentile_value
Are the features filtered by percentile value?
- Type:
bool
- unreliable_features_filtered
Are the features not reliably present within one group filtered out from the metabolome data?
- Type:
bool
- pca_performed
Has PCA been performed on the metabolome data? Default is False.
- Type:
bool
- exp_variance
A Pandas dataframe with explained variance per Principal Component. The index of the df contains the PC index (PC1, PC2, etc.). The second column contains the percentage of the explained variance per PC.
- Type:
pandas.core.frame.DataFrame, (n_pc, 1)
- metabolome_pca_reduced
Numpy array with sample coordinates in reduced dimensions. The dimension of the numpy array is the minimum of the number of samples and features.
- Type:
numpy.ndarray, (n_samples, n_pc)
- sparsity
Metabolome matrix sparsity.
- Type:
float
- validate_input_metabolome_df()
Check if the provided metabolome file is suitable. Turns attribute metabolome_validated to True.
- discard_features_detected_in_blanks()
Removes features only detected in blank samples.
- impute_missing_values_with_median()
Impute missing values with the median value of the feature.
- filter_out_unreliable_features()
Filter out features not reliably detectable in replicates of the same grouping factor. For instance, if a feature is detected less than 4 times within 4 biological replicates, it is discarded with argument nb_times_detected=4.
- filter_features_per_group_by_percentile()
Filter out features whose abundance within the same grouping factor is lower than a certain percentile value. For instance, features lower than the 90th percentile within a single group are discarded with argument percentile=90.
- compute_metabolome_sparsity()
Computes the sparsity percentage of the metabolome matrix (percentage of 0 values e.g. 100% for an matrix full of 0 values)
- write_clean_metabolome_to_csv()
Write the filtered and analysis-ready metabolome data to a .csv file.
Notes
Example of an input metabolome input format (from a csv file)
feature_id
blank_1
blank_2
blank_3
blank_4
MM_1
MM_2
MM_3
MM_4
LA1330_1
LA1330_2
LA1330_3
LA1330_4
rt-0.04_mz-241.88396
280
694
502
604
554
678
674
936
824
940
794
828
rt-0.05_mz-143.95911
1036
1566
1326
1490
1364
1340
1692
1948
1928
1956
1730
1568
rt-0.06_mz-124.96631
1308
992
1060
1010
742
990
0
888
786
668
762
974
rt-0.08_mz-553.45905
11340
12260
10962
11864
10972
11190
12172
11820
12026
11604
11122
11260
rt-0.08_mz-413.26631
984
1162
1292
1104
1090
1106
1290
1170
1282
924
1172
1062
Example
>>> met = OmicsAnalysis( metabolome_csv='my_metabolome_data.csv', metabolome_feature_id_col='feature_id') >>> met.validate_input_metabolome_df() Metabolome input data validated
See also
scikit-learn- metabolome_validated = False
- blank_features_filtered = False
- filtered_by_percentile_value = False
- unreliable_features_filtered = False
- pca_performed = False
- sparsity = None
- metabolome
- validate_input_metabolome_df(metabolome_feature_id_col='feature_id')
Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the ‘feature_id_col’ column as the index of the validated dataframe. The validated metabolome dataframe is stored as the ‘validated_metabolome’ attribute.
- Parameters:
metabolome_feature_id (str, optional) – The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).
- Returns:
self – Object with attribute metabolome_validated set to True if tests are passed.
- Return type:
object
Notes
Example of a valid input metabolome dataframe
genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 ||----------------|—————-|----------------|—————-|
feature_id
metabolite1 | 1246 | 1245 | 12345 | 12458 |metabolite2 | 0 | 0 | 0 | 0 |metabolite3 | 10 | 0 | 0 | 154 |
- impute_missing_values_with_median(missing_value_str='np.nan')
Imputes missing values with the median of the column.
Params
- missing_value_str: str, optional
The string that represents missing values in the input dataframe. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
- returns:
self
- rtype:
object with attribute ‘metabolome’ updated with imputed values.
- discard_features_detected_in_blanks(blank_sample_contains='blank')
Removes Steps:
Sum the abundance of each feature in the blank samples.
Makes a list of features to be discarded (features with a positive summed abundance).
Returns a filtered Pandas dataframe with only features not detected in blank samples
- Parameters:
blank_sample_contains (str, optional.) – Column names with this name will be considered blank samples. Default is=’blank’
- Returns:
metabolome – A filtered Pandas dataframe without features detected in blank samples and with the blank samples removed.
- Return type:
pandas.core.frame.DataFrame
- create_density_plot(name_grouping_var='genotype', n_cols=3, nbins=1000)
For each grouping variable (e.g. genotype), creates a histogram and density plot of all feature peak areas. This plot helps to see whether some groups have a value distribution different from the rest. The percentage is indicated on the y-axis (bar heights sum to 100).
- Parameters:
name_grouping_var (str, optional) – The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.
n_cols (int, optional) – The number of columns for the final plot.
nbins (int, optional) – The number of bins to create.
- Returns:
Returns the Axes object with the density plots drawn onto it.
- Return type:
matplotlib Axes
- filter_features_per_group_by_percentile(name_grouping_var='genotype', separator_replicates='_', percentile=50)
Filter metabolome dataframe based on a selected percentile threshold. Features with a peak area values lower than the selected percentile will be discarded. The percentile value is calculated per grouping variable.
For instance, selecting the 50th percentile (median) will discard 50% of the features with a peak area lower than the median/50th percentile in each group.
- Parameters:
name_grouping_var (str, optional) – The name of the grouping variable (default is “genotype”)
separator_replicates (str, optional) – The character used to separate the main grouping variable from biological replicates. Default is “_: (underscore)
percentile (float, optional) – The percentile threshold. Has to be comprised 0 and 100.
- Returns:
self – The object with the .metabolome attribute filtered and the filtered_by_percentile_value set to True.
- Return type:
object
Example
>>> met = OmicsAnalysis( metabolome_csv='tests/metabolome_test_data.csv', metabolome_feature_id_col='feature_id') >>> met.validate_input_metabolome_df() Metabolome input data validated >>> met.discard_features_detected_in_blanks(blank_sample_contains="blank") >> met.metabolome.shape (7544, 32) >>> met.filter_features_based_on_peak_area_level(percentile=90) >>> met.metabolome.shape (3171, 32)
See also
- filter_out_unreliable_features(name_grouping_var='genotype', nb_times_detected=4, separator_replicates='_')
Removes features not reliably detectable in multiple biological replicates from the same grouping factor.
Takes a dataframe with feature identifiers in index and samples as columns. Step 1: First melt and split the sample names to generate the grouping variable Step 2: count number of times a metabolite is detected in the groups. If number of times detected in a group = number of biological replicates then it is considered as reliable Each feature receives a tag ‘reliable’ or ‘not_reliable’ Step 3: discard the ‘not_reliable’ features and keep the filtered dataframe.
Params
- name_grouping_var: str, optional
The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.
- nb_times_detected: int, optionaldefault=4
Number of times a metabolite should be detected to be considered ‘reliable’. Should be equal to the number of biological replicates for a given group of interest (e.g. genotype)
- separator_replicates: string, default=”_”
The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)
- returns:
metabolome – A Pandas dataframe with only features considered as reliable, sample names and their values.
- rtype:
ndarray
Notes
Input dataframe
rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 |rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 |rt-0.06_mz-124.96631 | 0 | 0 | 0 | 888 | 786 | 668 |rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 |Output df (rt-0.06_mz-124.96631 is kicked out because 3x0 and 1x888 in MM groups)
rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 |rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 |rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 |
- write_clean_metabolome_to_csv(path_of_cleaned_csv='./data_for_manuals/filtered_metabolome.csv')
A function that verify that the metabolome dataset has been cleaned up. Writes the metabolome data as a comma-separated value file on disk
- Parameters:
path_of_cleaned_csv (str, optional) – The path and filename of the .csv file to save. Default to “./data_for_manuals/filtered_metabolome.csv”
- compute_pca_on_metabolites(scale=True, n_principal_components=10, auto_transpose=True)
Performs a Principal Component Analysis (PCA) on the metabolome data.
The PCA analysis will return transformed coordinates of the samples in a new space. It will also give the percentage of variance explained by each Principal Component. Assumes that number of samples < number of features/metabolites Performs a transpose of the metabolite dataframe if n_samples > n_features (this can be turned off with auto_transpose)
- Parameters:
scale (bool, optional) – Perform scaling (standardize) the metabolite values to zero mean and unit variance. Default is True.
n_principal_components (int, optional) – number of principal components to keep in the PCA analysis. if number of PCs > min(n_samples, n_features) then set to the minimum of (n_samples, n_features) Default is to calculate 10 components.
auto_transpose (bool, optional.) – If n_samples > n_features, performs a transpose of the feature matrix. Default is True (meaning that transposing will occur if n_samples > n_features).
- Returns:
self – Object with .exp_variance: dataframe with explained variance per Principal Component .metabolome_pca_reduced: dataframe with samples in reduced dimensions .pca_performed: `bool`ean set to True
- Return type:
object
- create_scree_plot(plot_file_name=None)
Returns a barplot with the explained variance per Principal Component. Has to be preceded by perform_pca()
- Parameters:
plot_file_name (string, default='None') – Path to a file where the plot will be saved. For instance ‘my_scree_plot.pdf’
- Returns:
Returns the Axes object with the scree plot drawn onto it. Optionally a saved image of the plot.
- Return type:
matplotlib Axes
- create_sample_score_plot(pc_x_axis=1, pc_y_axis=2, name_grouping_var='genotype', separator_replicates='_', show_color_legend=True, plot_file_name=None)
Returns a sample score plot of the samples on PCx vs PCy. Samples are colored based on the grouping variable (e.g. genotype)
- Parameters:
pc_x_axis (int, optional) – Principal Component to plot on the x-axis (default is 1 so PC1 will be plotted).
pc_y_axis (int, optional.) – Principal Component to plot on the y-axis (default is 2 so PC2 will be plotted).
name_grouping_var (str, optional) – Name of the variable used to color samples (Default is “genotype”).
separator_replicates (str, optional.) – String separator that separates grouping factor from biological replicates (default is underscore “_”).
show_color_legend (bool, optional.) – Add legend for hue (default is True).
plot_file_name (str, optional) – A file name and its path to save the sample score plot (default is None). For instance “mydir/sample_score_plot.pdf” Path is relative to current working directory.
- Returns:
Returns the Axes object with the sample score plot drawn onto it. Samples are colored by specified grouping variable. Optionally a saved image of the plot.
- Return type:
matplotlib Axes
- compute_metabolome_sparsity()
Determine the sparsity of the metabolome matrix. Formula: number of non zero values/number of values * 100 The higher the sparsity, the more zero values
- Returns:
self – Object with sparsity attribute filled (sparsity is a float).
- Return type:
object
References
- plot_features_in_upset_plot(seperator_replicates='_', plot_file_name=None)
Visuallises the presence of features per group in an UpSet plot. A feature is considered present in a group if the median>0.
Params
- separator_replicates: string, default=”_”
The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)
- plot_file_name: str, optional
A file name and its path to save the sample score plot (default is None). For instance “mydir/feature_upset_plot.pdf” Path is relative to current working directory.
- returns:
UpSet plot with features presence per group.
- rtype:
Plot
Notes
Input dataframe
rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 |rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 |rt-0.06_mz-124.96631 | 0 | 0 | 0 | 888 | 786 | 668 |rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 |