PhenoFeatureFinder.omics_analysis

Classes

OmicsAnalysis

A class to streamline the filtering and exploration of a metabolome dataset.

Module Contents

class PhenoFeatureFinder.omics_analysis.OmicsAnalysis(metabolome_csv, metabolome_feature_id_col='feature_id')

A class to streamline the filtering and exploration of a metabolome dataset.

Parameters:
  • metabolome_csv (str) – A path to a .csv file with the metabolome data (scaled or unscaled). Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples

  • metabolome_feature_id_col (str, optional) – The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).

metabolome

The metabolome Pandas dataframe imported from the .csv file.

Type:

pandas.core.frame.DataFrame, (n_samples, n_features)

metabolome_validated

Is the metabolome dataset validated? Default is False.

Type:

bool

blank_features_filtered

Are the features present in blank samples filtered out from the metabolome data? Default by False.

Type:

bool

filtered_by_percentile_value

Are the features filtered by percentile value?

Type:

bool

unreliable_features_filtered

Are the features not reliably present within one group filtered out from the metabolome data?

Type:

bool

pca_performed

Has PCA been performed on the metabolome data? Default is False.

Type:

bool

exp_variance

A Pandas dataframe with explained variance per Principal Component. The index of the df contains the PC index (PC1, PC2, etc.). The second column contains the percentage of the explained variance per PC.

Type:

pandas.core.frame.DataFrame, (n_pc, 1)

metabolome_pca_reduced

Numpy array with sample coordinates in reduced dimensions. The dimension of the numpy array is the minimum of the number of samples and features.

Type:

numpy.ndarray, (n_samples, n_pc)

sparsity

Metabolome matrix sparsity.

Type:

float

validate_input_metabolome_df()

Check if the provided metabolome file is suitable. Turns attribute metabolome_validated to True.

discard_features_detected_in_blanks()

Removes features only detected in blank samples.

impute_missing_values_with_median()

Impute missing values with the median value of the feature.

filter_out_unreliable_features()

Filter out features not reliably detectable in replicates of the same grouping factor. For instance, if a feature is detected less than 4 times within 4 biological replicates, it is discarded with argument nb_times_detected=4.

filter_features_per_group_by_percentile()

Filter out features whose abundance within the same grouping factor is lower than a certain percentile value. For instance, features lower than the 90th percentile within a single group are discarded with argument percentile=90.

compute_metabolome_sparsity()

Computes the sparsity percentage of the metabolome matrix (percentage of 0 values e.g. 100% for an matrix full of 0 values)

write_clean_metabolome_to_csv()

Write the filtered and analysis-ready metabolome data to a .csv file.

Notes

Example of an input metabolome input format (from a csv file)

feature_id

blank_1

blank_2

blank_3

blank_4

MM_1

MM_2

MM_3

MM_4

LA1330_1

LA1330_2

LA1330_3

LA1330_4

rt-0.04_mz-241.88396

280

694

502

604

554

678

674

936

824

940

794

828

rt-0.05_mz-143.95911

1036

1566

1326

1490

1364

1340

1692

1948

1928

1956

1730

1568

rt-0.06_mz-124.96631

1308

992

1060

1010

742

990

0

888

786

668

762

974

rt-0.08_mz-553.45905

11340

12260

10962

11864

10972

11190

12172

11820

12026

11604

11122

11260

rt-0.08_mz-413.26631

984

1162

1292

1104

1090

1106

1290

1170

1282

924

1172

1062

Example

>>> met = OmicsAnalysis(
    metabolome_csv='my_metabolome_data.csv',
    metabolome_feature_id_col='feature_id')
>>> met.validate_input_metabolome_df()
Metabolome input data validated

See also

scikit-learn

metabolome_validated = False
blank_features_filtered = False
filtered_by_percentile_value = False
unreliable_features_filtered = False
pca_performed = False
sparsity = None
metabolome
validate_input_metabolome_df(metabolome_feature_id_col='feature_id')

Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the ‘feature_id_col’ column as the index of the validated dataframe. The validated metabolome dataframe is stored as the ‘validated_metabolome’ attribute.

Parameters:

metabolome_feature_id (str, optional) – The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).

Returns:

self – Object with attribute metabolome_validated set to True if tests are passed.

Return type:

object

Notes

Example of a valid input metabolome dataframe

genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 |

|----------------|—————-|----------------|—————-|

feature_id

metabolite1 | 1246 | 1245 | 12345 | 12458 |
metabolite2 | 0 | 0 | 0 | 0 |
metabolite3 | 10 | 0 | 0 | 154 |
impute_missing_values_with_median(missing_value_str='np.nan')

Imputes missing values with the median of the column.

Params

missing_value_str: str, optional

The string that represents missing values in the input dataframe. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

returns:

self

rtype:

object with attribute ‘metabolome’ updated with imputed values.

discard_features_detected_in_blanks(blank_sample_contains='blank')

Removes Steps:

  1. Sum the abundance of each feature in the blank samples.

  2. Makes a list of features to be discarded (features with a positive summed abundance).

  3. Returns a filtered Pandas dataframe with only features not detected in blank samples

Parameters:

blank_sample_contains (str, optional.) – Column names with this name will be considered blank samples. Default is=’blank’

Returns:

metabolome – A filtered Pandas dataframe without features detected in blank samples and with the blank samples removed.

Return type:

pandas.core.frame.DataFrame

create_density_plot(name_grouping_var='genotype', n_cols=3, nbins=1000)

For each grouping variable (e.g. genotype), creates a histogram and density plot of all feature peak areas. This plot helps to see whether some groups have a value distribution different from the rest. The percentage is indicated on the y-axis (bar heights sum to 100).

Parameters:
  • name_grouping_var (str, optional) – The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.

  • n_cols (int, optional) – The number of columns for the final plot.

  • nbins (int, optional) – The number of bins to create.

Returns:

Returns the Axes object with the density plots drawn onto it.

Return type:

matplotlib Axes

filter_features_per_group_by_percentile(name_grouping_var='genotype', separator_replicates='_', percentile=50)

Filter metabolome dataframe based on a selected percentile threshold. Features with a peak area values lower than the selected percentile will be discarded. The percentile value is calculated per grouping variable.

For instance, selecting the 50th percentile (median) will discard 50% of the features with a peak area lower than the median/50th percentile in each group.

Parameters:
  • name_grouping_var (str, optional) – The name of the grouping variable (default is “genotype”)

  • separator_replicates (str, optional) – The character used to separate the main grouping variable from biological replicates. Default is “_: (underscore)

  • percentile (float, optional) – The percentile threshold. Has to be comprised 0 and 100.

Returns:

self – The object with the .metabolome attribute filtered and the filtered_by_percentile_value set to True.

Return type:

object

Example

>>> met = OmicsAnalysis(
    metabolome_csv='tests/metabolome_test_data.csv',
    metabolome_feature_id_col='feature_id')
>>> met.validate_input_metabolome_df()
Metabolome input data validated
>>> met.discard_features_detected_in_blanks(blank_sample_contains="blank")
>> met.metabolome.shape
(7544, 32)
>>> met.filter_features_based_on_peak_area_level(percentile=90)
>>> met.metabolome.shape
(3171, 32)
filter_out_unreliable_features(name_grouping_var='genotype', nb_times_detected=4, separator_replicates='_')

Removes features not reliably detectable in multiple biological replicates from the same grouping factor.

Takes a dataframe with feature identifiers in index and samples as columns. Step 1: First melt and split the sample names to generate the grouping variable Step 2: count number of times a metabolite is detected in the groups. If number of times detected in a group = number of biological replicates then it is considered as reliable Each feature receives a tag ‘reliable’ or ‘not_reliable’ Step 3: discard the ‘not_reliable’ features and keep the filtered dataframe.

Params

name_grouping_var: str, optional

The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.

nb_times_detected: int, optionaldefault=4

Number of times a metabolite should be detected to be considered ‘reliable’. Should be equal to the number of biological replicates for a given group of interest (e.g. genotype)

separator_replicates: string, default=”_”

The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)

returns:

metabolome – A Pandas dataframe with only features considered as reliable, sample names and their values.

rtype:

ndarray

Notes

Input dataframe

MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 |

|———- |———- |———- |———- |———- |———- |

feature_id

rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 |
rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 |
rt-0.06_mz-124.96631 | 0 | 0 | 0 | 888 | 786 | 668 |
rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 |

Output df (rt-0.06_mz-124.96631 is kicked out because 3x0 and 1x888 in MM groups)

MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 |

|———- |———- |———- |———- |———- |———- |

feature_id

rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 |
rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 |
rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 |
write_clean_metabolome_to_csv(path_of_cleaned_csv='./data_for_manuals/filtered_metabolome.csv')

A function that verify that the metabolome dataset has been cleaned up. Writes the metabolome data as a comma-separated value file on disk

Parameters:

path_of_cleaned_csv (str, optional) – The path and filename of the .csv file to save. Default to “./data_for_manuals/filtered_metabolome.csv”

compute_pca_on_metabolites(scale=True, n_principal_components=10, auto_transpose=True)

Performs a Principal Component Analysis (PCA) on the metabolome data.

The PCA analysis will return transformed coordinates of the samples in a new space. It will also give the percentage of variance explained by each Principal Component. Assumes that number of samples < number of features/metabolites Performs a transpose of the metabolite dataframe if n_samples > n_features (this can be turned off with auto_transpose)

Parameters:
  • scale (bool, optional) – Perform scaling (standardize) the metabolite values to zero mean and unit variance. Default is True.

  • n_principal_components (int, optional) – number of principal components to keep in the PCA analysis. if number of PCs > min(n_samples, n_features) then set to the minimum of (n_samples, n_features) Default is to calculate 10 components.

  • auto_transpose (bool, optional.) – If n_samples > n_features, performs a transpose of the feature matrix. Default is True (meaning that transposing will occur if n_samples > n_features).

Returns:

self – Object with .exp_variance: dataframe with explained variance per Principal Component .metabolome_pca_reduced: dataframe with samples in reduced dimensions .pca_performed: `bool`ean set to True

Return type:

object

create_scree_plot(plot_file_name=None)

Returns a barplot with the explained variance per Principal Component. Has to be preceded by perform_pca()

Parameters:

plot_file_name (string, default='None') – Path to a file where the plot will be saved. For instance ‘my_scree_plot.pdf’

Returns:

Returns the Axes object with the scree plot drawn onto it. Optionally a saved image of the plot.

Return type:

matplotlib Axes

create_sample_score_plot(pc_x_axis=1, pc_y_axis=2, name_grouping_var='genotype', separator_replicates='_', show_color_legend=True, plot_file_name=None)

Returns a sample score plot of the samples on PCx vs PCy. Samples are colored based on the grouping variable (e.g. genotype)

Parameters:
  • pc_x_axis (int, optional) – Principal Component to plot on the x-axis (default is 1 so PC1 will be plotted).

  • pc_y_axis (int, optional.) – Principal Component to plot on the y-axis (default is 2 so PC2 will be plotted).

  • name_grouping_var (str, optional) – Name of the variable used to color samples (Default is “genotype”).

  • separator_replicates (str, optional.) – String separator that separates grouping factor from biological replicates (default is underscore “_”).

  • show_color_legend (bool, optional.) – Add legend for hue (default is True).

  • plot_file_name (str, optional) – A file name and its path to save the sample score plot (default is None). For instance “mydir/sample_score_plot.pdf” Path is relative to current working directory.

Returns:

Returns the Axes object with the sample score plot drawn onto it. Samples are colored by specified grouping variable. Optionally a saved image of the plot.

Return type:

matplotlib Axes

compute_metabolome_sparsity()

Determine the sparsity of the metabolome matrix. Formula: number of non zero values/number of values * 100 The higher the sparsity, the more zero values

Returns:

self – Object with sparsity attribute filled (sparsity is a float).

Return type:

object

References

https://stackoverflow.com/questions/38708621/how-to-calculate-percentage-of-sparsity-for-a-numpy-array-matrix

plot_features_in_upset_plot(seperator_replicates='_', plot_file_name=None)

Visuallises the presence of features per group in an UpSet plot. A feature is considered present in a group if the median>0.

Params

separator_replicates: string, default=”_”

The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)

plot_file_name: str, optional

A file name and its path to save the sample score plot (default is None). For instance “mydir/feature_upset_plot.pdf” Path is relative to current working directory.

returns:

UpSet plot with features presence per group.

rtype:

Plot

Notes

Input dataframe

MM_1 | MM_2 | MM_3 | MM_4 | LA1330_1 | LA1330_2 |

|———- |———- |———- |———- |———- |———- |

feature_id

rt-0.04_mz-241.88396 | 554 | 678 | 674 | 936 | 824 | 940 |
rt-0.05_mz-143.95911 | 1364 | 1340 | 1692 | 1948 | 1928 | 1956 |
rt-0.06_mz-124.96631 | 0 | 0 | 0 | 888 | 786 | 668 |
rt-0.08_mz-553.45905 | 10972 | 11190 | 12172 | 11820 | 12026 | 11604 |