Documentation

Version 0.1.3

PhenotypeAnalysis

A class to analyse data from developmental bioassays and group the samples in distinct phenotypic classes.

Parameters

bioassay_csv: string: A path to a .csv file with the bioassay count data.

Attributes

bioassay_csv: The path to the csv file with the input data.
bioassay: A pandas dataframe with the development data.
sample_id: The name of the column that contains the sample identifiers.
cumulative_data: A pandas dataframe with cumulative development data.
max_counts: A pandas dataframe with the maximum number of nymphs developed to or past each developmental stage per group.
survival_data: A pandas dataframe with the number of living nymphs per stage at each time point per group.
order_of_groups: A string with the order in which the groups should be plotted.
absolute_counts: A pandas dataframe with values from max_counts in long format.
max_relative: A pandas dataframe based on absolute_counts, with values relative to values of specified developmental stage.

Methods

reshape_to_wide: Reshapes the dataframe from a long to a wide format to make the data accessible for pre-processing. With the counts of each developmental stage in a seperate columns.
combine_seperately_counted_versions_of_last_recorded_stage: Calculates the total number of nymphs developed to the final developmental stage per sample on each timepoint.
convert_counts_to_cumulative: Calculates the total number of nymphs developed to or past each stage on each timepoint.
correct_cumulative_counts: Inner function for convert_counts_to_cumulative().
create_df_with_max_counts_per_stage: Inner function for convert_counts_to_cumulative().
prepare_for_plotting: Prepare the order in which the groups should be plotted.
plot_counts_per_stage: Plots the counts per nymphal stage in boxplots.
plot_development_over_time_in_fitted_model: Fits a 3 parameter log-logistic curve to the development over time to a specified stage.
ll3: A three parameter log-logistic function.
plot_survival_over_time_in_fitted_model: Fits a 3 parameter log-normal curve to the number of living nymphs over time.
hazard: A three parameter log-normal function.

reshape_to_wide

Reshapes the dataframe from a long to a wide format to make the data accessible for pre-processing. with the counts of each developmental stage in a seperate columns.

Usage

reshape_to_wide(
self, sample_id=’sample_id’, grouping_variable=’genotype’, developmental_stages=’stage’, count_values=’number’, time=’day’)

Parameters

sample_id: string, default=’sample_id’
The name of the column that contains the sample identifiers.

grouping_variable: string, default=’genotype’
The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

developmental_stages: string, default=’stage’
The name of the column that contains the developmental stages that were scored during the bioassay.

count_values: string, default=’numbers’
The name of the column that contains the counts.

time: string, default=’day’
The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.

Examples

Example of an input dataframe

sample_id	genotype	day	stage	count
mm_1	mm	5	eggs	45
mm_1	mm	5	first_instar	0
mm_1	mm	5	second_instar	0

Example of a reshaped output dataframe

sample_id	genotype	day	eggs	first_instar	second instar	third_instar
mm_1	mm	5	45	0	0	0
mm_1	mm	9	NA	10	5	0
mm_1	mm	11	NA	15	17	4

combine_seperately_counted_versions_of_last_recorded_stage

Calculates the total number of nymphs developed to the final developmental stage per sample on each timepoint. This is used when nymphs in the (late) final nymph stage were removed after each counting moment and/or when exuviea and last instar stage nymphs were counted seperately. Removal of late last stage nymphs could for example be used to prevent adults from emerging and escaping.

Usage

combine_seperately_counted_versions_of_last_recorded_stage(
self, exuviea=’exuviea’, late_last_stage=’late_fourth_instar’, early_last_stage=’early_fourth_instar’, new_last_stage=’fourth_instar’, seperate_exuviea=True, late_last_stage_removed=True, early_last_stage_kept=True, remove_individual_stage_columns=True)

Parameters

exuviea: string, default=’exuviea’
The name of the column that contains the exuviea counts.

late_last_stage: string, default=’late_fourth_instar’
The name of the column that contains the counts of the last developmental stage recorded in the bioassay.

early_last_stage: string, default=’early_fourth_instar’
The name of the column that contains the counts of the nymphs in early last developmental stage. Is used when nymphs counted in late_last_stage were removed after each counting moment during the bioassay.

new_last_stage: string, default=’fourth_instar’
Name for new column with the returned total final stage data

seperate_exuviea: bool, default=True
If True, sums exuviea and late_last_stage per sample per timepoint. If exuviea were counted seperately from late_last_stage, set to True. If exuviea count was included in late_last_stage, set to False

late_last_stage_removed: bool, default=True
If True, returns the cumulative number of late_last_stage(+exuviea) per sample over time. If nymphs counted in late_last_stage (and exuviea if counted seperately) were removed after each counting moment, set to True. If nymphs counted in late_last_stage (and exuviea if counted seperately) were left on the sample until ending the bioassay, set to False.

early_last_stage_kept: bool, default=True
If True, sums the early and late last stage counts per sample per timepoint If late last stage nymphs were removed after each counting moment, but early last stage nymphs were left on sample, set to True. If early and late last stage nymphs were not counted seperately, set to False

remove_individual_stage_columns: bool, default=True
If True, removes exuviea, late_last_stage, early_last_stage columns from dataframe after returning new_last_stage column.

Examples

Example of an input dataframe

sample_id	genotype	day	eggs	…	third_instar	exuviea	early_fourth_instar	late_fourth_instar
mm_1	mm	5	45	…	0	0	0	0
mm_1	mm	9	NA	…	0	1	5	0
mm_1	mm	11	NA	…	4	0	7	4

Example of an output dataframe

sample_id	genotype	day	eggs	first_instar	second instar	third_instar	fourth_instar
mm_1	mm	5	45	0	0	0	0
mm_1	mm	9	NA	10	5	0	6
mm_1	mm	11	NA	15	17	4	12

convert_counts_to_cumulative

Calculates the total number of nymphs developed to or past each stage on each timepoint. Cumulative counts make the analysis of development over time and the comparison of number of nymphs past a stage easier. If nymphs in the (late) final nymph stage were removed after each counting moment and/or when exuviea and/or early and late last instar stage nymphs were counted seperately, total_last_stage() should be used first.

Usage

convert_counts_to_cumulative(
self, n_developmental_stages=4, sample_id=’sample_id’, eggs=’eggs’, first_stage=’first_instar’, second_stage=’second_instar’, third_stage=’third_instar’, fourth_stage=’fourth_instar’, fifth_stage=’fifth_instar’, sixth_stage=’sixth_instar’)

Parameters

n_developmental_stages: integer, default=4
The number of developmental stages which were recorded seperately. Can range from 2 to 6.

sample_id: string, default=’sample_id’
The name of the column that contains the sample identifiers.

eggs: string, default=’eggs’
The name of the column that contains the counts of the eggs.

first_stage: string, default=’first_instar’
The name of the column that contains the counts of the first developmental stage recorded in the bioassay.

second_stage: string, default=’second_instar’
The name of the column that contains the counts of the second developmental stage recorded in the bioassay.

third_stage: string, default=’third_instar’
The name of the column that contains the counts of the third developmental stage recorded in the bioassay.

fourth_stage: string, default=’fourth_instar’
The name of the column that contains the counts of the fourth developmental stage recorded in the bioassay.

fifth_stage: string, default=’fifth_instar’
The name of the column that contains the counts of the fifth developmental stage recorded in the bioassay.

sixth_stage: string, default=’sixth_instar’
The name of the column that contains the counts of the sixth developmental stage recorded in the bioassay.

correct_cumulative_counts

Inner function for convert_counts_to_cumulative(). If nymphs die during the bioassay, they should be included in the cumulative count for the stages it had passed. Otherwise, the cumulative count could go down over time. This function corrects the cumulative count if it is lower than the previous count.

Usage

correct_cumulative_counts(
self, current_stage, grouping_variable)

create_df_with_max_counts_per_stage

Inner function for convert_counts_to_cumulative(). With the maximum number of nymphs developed to or past each developmental stage per plant, making graphs becomes easier.

Usage

create_df_with_max_counts_per_stage(
self, egg_column, last_stage, grouping_variable):

prepare_for_plotting

Prepare the order in which the groups should be plotted.

Usage

prepare_for_plotting( self, order_of_groups)

Parameters

order_of_groups: string
List of the group names in the prefered order for plotting For example: [‘MM’, ‘LA’, ‘PI’]

plot_counts_per_stage

Plots the counts per nymphal stage in boxplots. The nymph counts are given as the absolute number of nymphs that developed to or past each stage at the last timepoint and as a fraction of nymphs that developed to or past each stage at the last timepoint relative to another developmental stage. The other developmental stage to which the data is made relative defaults to the first instar stage, because this represents the number of hatched eggs. This means that in this case only the succes of the development is compared between groups (e.g. genotypes or treatments) and the hatching rate of the eggs is not taken into acount.

The imput dataframe ‘max_counts’ is created with convert_counts_to_cumulative.

Usage

plot_counts_per_stage(
self, grouping_variable=’genotype’, sample_id=’sample_id’, eggs=’eggs’, first_stage=’first_instar’, second_stage=’second_instar’, third_stage=’third_instar’, fourth_stage=’fourth_instar’, absolute_x_axis_label=’genotype’, absolute_y_axis_label=’counts (absolute)’, relative_x_axis_label=’genotype’, relative_y_axis_label=’relative number of nymphs’, make_nymphs_relative_to=’first_instar’)

Parameters

grouping_variable: string, default=’genotype’
The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

sample_id: string, default=’sample_id’
The name of the column that contains the sample identifiers.

eggs: string, default=’eggs’
The name of the column that contains the counts of the eggs.

first_stage: string, default=’first_instar’
The name of the column that contains the counts of the first developmental stage recorded in the bioassay.

second_stage: string, default=’second_instar’
The name of the column that contains the counts of the second developmental stage recorded in the bioassay.

third_stage: string, default=’third_instar’
The name of the column that contains the counts of the third developmental stage recorded in the bioassay.

fourth_stage: string, default=’fourth_instar’
The name of the column that contains the counts of the fourth developmental stage recorded in the bioassay.

absolute_x_axis_label: string, default=’genotype’
Label for the x-axis of the boxplots with count data.

absolute_y_axis_label: string, default=’counts (absolute)’
Label for the y-axis of the boxplots with count data.

relative_x_axis_label: string, default=’genotype’
Label for the x-axis of the boxplots with relative development.

relative_y_axis_label: string, default=’relative number of nymphs’
Label for the y-axis of the boxplots with relative development.

make_nymphs_relative_to: string, default=’first_instar’
The name of the column that contains the counts of the developmental stage which should be used to calculate the relative development to all developmental stages.

Examples

Example of an input dataframe

sample_id	genotype	day	eggs	first_instar	second_instar	third_instar	fourth_instar
mm_1	mm	28	45	34	30	30	29
mm_2	mm	28	50	39	33	28	26
LA_1	LA	28	42	30	25	17	4

plot_development_over_time_in_fitted_model

Fits a 3 parameter log-logistic curve to the development over time to a specified stage. The fitted curve and the observed datapoints are plotted and returned with the model parameters. The reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

Usage

plot_development_over_time_in_fitted_model(
self, grouping_variable=’genotype’, sample_id=’sample_id’, time=’day’, x_axis_label=’days after infection’, y_axis_label=’development to 4th instar stage (relative to 1st instars)’, stage_of_`int`erest=’fourth_instar’, use_relative_data=True, make_nymphs_relative_to=’first_instar’, predict_for_n_days=0)

Parameters

grouping_variable: string, default=’genotype’
The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

sample_id: string, default=’sample_id’
The name of the column that contains the sample identifiers.

time: string, default=’day’
The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.

x_axis_label: string, default=’days after infection’
Label for the x-axis

y_axis_label: string, default=’development to 4th instar stage (relative to 1st instars)’
Label for the y-axis

stage_of_`int`erest: string, default=’fourth_instar’
The name of the column that contains the data of the developmental stage of `int`erest.

use_relative_data: bool, default=True
If True, the counts for the stage of `int`erest are devided by the stage indicated at ‘make_nymphs_relative_to’. The returned relative rate is used for plotting and curve fitting.

make_nymphs_relative_to: string, default=’first_instar’
The name of the column that contains the counts of the developmental stage which should be used to calculate therelative development to all developmental stages.

predict_for_n_days: int, default=o
Continue model for n days after final count.

Examples

Example of an input dataframe

sample_id	genotype	day	eggs	first_instar	second instar	third_instar	fourth_instar
mm_1	mm	5	45	15	7	0	0
mm_1	mm	9	NA	24	14	6	3
mm_1	mm	11	NA	38	27	16	12

ll3

A three parameter log-logistic function.

Usage

ll3(x,slope,maximum,emt50):

Parameters

slope:
the slope of the curve

maximum:
the maximum value of the curve

emt50:
the EmT50, the timepoint at which 50% of nymphs has developed to the stage of `int`erest

Model

y(x) = maximum/(1+np.exp(slope*(np.log(x)-np.log(emt50))))

plot_survival_over_time_in_fitted_model

Fits a 3 parameter log-normal curve to the number of living nymphs over time. The fitted curve and the observed datapoints are plotted and returned with the model parameters. The reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

Usage

plot_survival_over_time_in_fitted_model(
self, grouping_variable=’genotype’, sample_id=’sample_id’, time=’day’, x_axis_label=’days after infection’, y_axis_label=’number of nymphs per plant’, stage_of_`int`erest=’first_instar’, use_relative_data=False, make_nymphs_relative_to=’eggs’, predict_for_n_days=0)

Parameters

grouping_variable: string, default=’genotype’
The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

sample_id: string, default=’sample_id’
The name of the column that contains the sample identifiers.

time: string, default=’day’
The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.

x_axis_label: string, default=’days after infection’
Label for the x-axis

y_axis_label: string, default=’development to 4th instar stage (relative to 1st instars)’
Label for the y-axis

stage_of_`int`erest: string, default=’first_instar’
The name of the column that contains the data of the developmental stage of `int`erest.

use_relative_data: bool, default=False
If True, the counts for the stage of `int`erest are devided by the stage indicated at ‘make_nymphs_relative_to’. The returned relative rate is used for plotting and curve fitting.

make_nymphs_relative_to: string, default=’eggs’
The name of the column that contains the counts of the developmental stage which should be used to calculate the relative development to all developmental stages.

predict_for_n_days: int, default=o
Continue model for n days after final count.

Examples: Example of an input dataframe

sample_id	genotype	day	eggs	first_instar	second instar	third_instar	fourth_instar
mm_1	mm	5	45	15	7	0	0
mm_1	mm	9	NA	24	14	6	3
mm_1	mm	11	NA	38	27	16	12

hazard

A three parameter log-normal function.

Usage

hazard(x,auc,median,shape)

Parameters

auc:
area under the curve

median:
median time point

shape:
shape of the curve

Model

y(x) = (auc*(shape/median)*pow(x/median,shape-1))/(1+pow(x/median,shape))

OmicsAnalysis

A class to streamline the filtering and exploration of a metabolome dataset.

Parameters

metabolome_csv: str: A path to a .csv file with the metabolome data (scaled or unscaled). Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples
metabolome_feature_id_col: str, optional: The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).

Attributes

metabolome: pandas.core.frame.DataFrame, (n_samples, n_features): The metabolome Pandas dataframe imported from the .csv file. metabolome_validated: bool Is the metabolome dataset validated? Default is False.
blank_features_filtered: bool: Are the features present in blank samples filtered out from the metabolome data? Default by False.
filtered_by_percentile_value: bool: Are the features filtered by percentile value?
unreliable_features_filtered: bool: Are the features not reliably present within one group filtered out from the metabolome data?
pca_performed: bool: Has PCA been performed on the metabolome data? Default is False.
exp_variance: pandas.core.frame.DataFrame, (n_pc, 1): A Pandas dataframe with explained variance per Principal Component. The index of the df contains the PC index (PC1, PC2, etc.). The second column contains the percentage of the explained variance per PC.
metabolome_pca_reduced: numpy.ndarray, (n_samples, n_pc): Numpy array with sample coordinates in reduced dimensions. The dimension of the numpy array is the minimum of the number of samples and features.
sparsity: float: Metabolome matrix sparsity.

Methods

validate_input_metabolome_df: Check if the provided metabolome file is suitable. Turns attribute metabolome_validated to True.
discard_features_detected_in_blanks: Removes features only detected in blank samples.
impute_missing_values_with_median: Impute missing values with the median value of the feature.
filter_out_unreliable_features: Filter out features not reliably detectable in replicates of the same grouping factor. For instance, if a feature is detected less than 4 times within 4 biological replicates, it is discarded with argument nb_times_detected=4.
filter_features_per_group_by_percentile: Filter out features whose abundance within the same grouping factor is lower than a certain percentile value. For instance, features lower than the 90th percentile within a single group are discarded with argument percentile=90.
compute_metabolome_sparsity: Computes the sparsity percentage of the metabolome matrix (percentage of 0 values e.g. 100% for an matrix full of 0 values)
write_clean_metabolome_to_csv: Write the filtered and analysis-ready metabolome data to a .csv file.

Examples

Example of an input metabolome input format (from a csv file)

feature_id	blank_1	blank_2	blank_3	blank_4	MM_1	MM_2	MM_3	MM_4	LA1330_1	LA1330_2	LA1330_3	LA1330_4
rt-0.04_mz-241.88396	280	694	502	604	554	678	674	936	824	940	794	828
rt-0.05_mz-143.95911	1036	1566	1326	1490	1364	1340	1692	1948	1928	1956	1730	1568
rt-0.06_mz-124.96631	1308	992	1060	1010	742	990	0	888	786	668	762	974
rt-0.08_mz-553.45905	11340	12260	10962	11864	10972	11190	12172	11820	12026	11604	11122	11260
rt-0.08_mz-413.26631	984	1162	1292	1104	1090	1106	1290	1170	1282	924	1172	1062

validate_input_metabolome_df

Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the ‘feature_id_col’ column as the index of the validated dataframe. The validated metabolome dataframe is stored as the ‘validated_metabolome’ attribute.

Usage

validate_input_metabolome_df(
self, metabolome_feature_id_col=’feature_id’)

Parameters

metabolome_feature_id: str, optional
The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).

Returns

self: object
Object with attribute metabolome_validated set to True if tests are passed.

Examples

Example of a valid input metabolome dataframe

feature_id	genotypeA_rep1	genotypeA_rep2	genotypeA_rep3	genotypeA_rep4
metabolite1	1246	1245	12345	12458
metabolite2	0	0	0	0
metabolite3	10	0	0	154

discard_features_detected_in_blanks

Removes features present in blanks. Steps:

Sum the abundance of each feature in the blank samples.
Makes a list of features to be discarded (features with a positive summed abundance).
Returns a filtered Pandas dataframe with only features not detected in blank samples

Usage

discard_features_detected_in_blanks(
self, blank_sample_contains=’blank’)

Parameters

blank_sample_contains: str, optional.
Column names with this name will be considered blank samples. Default is=’blank’

Returns

metabolome: pandas.core.frame.DataFrame
A filtered Pandas dataframe without features detected in blank samples and with the blank samples removed.

create_density_plot

For each grouping variable (e.g. genotype), creates a histogram and density plot of all feature peak areas. This plot helps to see whether some groups have a value di`str`ibution different from the rest. The percentage is indicated on the y-axis (bar heights sum to 100).

Usage

create_density_plot(
self, name_grouping_var=”genotype”, n_cols=3, nbins=1000)

Parameters

name_grouping_var: str, optional
The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.

n_cols: int, optional
The number of columns for the final plot.

nbins: int, optional
The number of bins to create.

Returns

matplotlib Axes
Returns the Axes object with the density plots drawn onto it.

filter_features_per_group_by_percentile

Filter metabolome dataframe based on a selected percentile threshold. Features with a peak area values lower than the selected percentile will be discarded. The percentile value is calculated per grouping variable.

For instance, selecting the 50th percentile (median) will discard 50% of the features with a peak area lower than the median/50th percentile in each group.

Usage

filter_features_per_group_by_percentile( self, name_grouping_var=”genotype”, separator_replicates=”_”, percentile=50)

Parameters

name_grouping_var: str, optional
The name of the grouping variable (default is “genotype”)

separator_replicates: str, optional
The character used to separate the main grouping variable from biological replicates. Default is “_: (underscore)

percentile: float, optional
The percentile threshold. Has to be comprised 0 and 100.

Returns

self: object
The object with the .metabolome attribute filtered and the filtered_by_percentile_value set to True.

See also

Use create_density_plot() method to decide on a suitable percentile value.

filter_out_unreliable_features

Removes features not reliably detectable in multiple biological replicates from the same grouping factor.

Takes a dataframe with feature identifiers in index and samples as columns. Steps:

First melt and split the sample names to generate the grouping variable
Count number of times a metabolite is detected in the groups. If number of times detected in a group = number of biological replicates then it is considered as reliable. Each feature receives a tag ‘reliable’ or ‘not_reliable’
Discard the ‘not_reliable’ features and keep the filtered dataframe.

Usage

filter_out_unreliable_features(
self, name_grouping_var=”genotype”, nb_times_detected=4, separator_replicates=’_’)

Parameters

name_grouping_var: str, optional
The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.

nb_times_detected: int, optionaldefault=4
Number of times a metabolite should be detected to be considered ‘reliable’. Should be equal to the number of biological replicates for a given group of `int`erest (e.g. genotype)

separator_replicates: string, default=”_”
The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)

Returns

metabolome: ndarray
A Pandas dataframe with only features considered as reliable, sample names and their values.

Examples

Example of an input dataframe

Example of an output df (rt-0.06_mz-124.96631 is kicked out because 3x0 and 1x888 in MM groups)

write_clean_metabolome_to_csv

Writes the cleaned metabolome data to the disk as a comma-separated value file.

Usage

write_clean_metabolome_to_csv(self, path_of_cleaned_csv=”./data_for_manuals/filtered_metabolome.csv”):

Parameters

path_of_cleaned_csv: str, optional
The path and filename of the .csv file to save. Default to “./data_for_manuals/filtered_metabolome.csv”

compute_pca_on_metabolites

Performs a Principal Component Analysis (PCA) on the metabolome data.

The PCA analysis will return transformed coordinates of the samples in a new space. It will also give the percentage of variance explained by each Principal Component. Assumes that number of samples < number of features/metabolites Performs a transpose of the metabolite dataframe if n_samples > n_features (this can be turned off with auto_transpose)

Usage

compute_pca_on_metabolites(
self, scale=True, n_principal_components=10, auto_transpose=True)

Parameters

scale: bool, optional
Perform scaling (standardize) the metabolite values to zero mean and unit variance. Default is True.

n_principal_components: int, optional
number of principal components to keep in the PCA analysis. if number of PCs > min(n_samples, n_features) then set to the minimum of (n_samples, n_features) Default is to calculate 10 components.

auto_transpose: bool, optional.
If n_samples > n_features, performs a transpose of the feature matrix. Default is True (meaning that transposing will occur if n_samples > n_features).

Returns

self: object
Object with .exp_variance: dataframe with explained variance per Principal Component .metabolome_pca_reduced: dataframe with samples in reduced dimensions .pca_performed: `bool`ean set to True

impute_missing_values_with_median

Imputes missing values with the median of the column. This is necessary for PCA to work.

Usage

impute_missing_values_with_median(
self, missing_value_str=’np.nan’)

Parameters

missing_value_str: str, optional
The string that represents missing values in the input dataframe. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

Returns

self: object with attribute ‘metabolome’ updated with imputed values.

create_scree_plot

Returns a barplot with the explained variance per Principal Component. Has to be preceded by perform_pca()

Usage

create_scree_plot(
self, plot_file_name=None)

Parameters

plot_file_name: string, default=’None’
Path to a file where the plot will be saved. For instance ‘my_scree_plot.pdf’

Returns

matplotlib Axes
Returns the Axes object with the scree plot drawn onto it. Optionally a saved image of the plot.

create_sample_score_plot

Returns a sample score plot of the samples on PCx vs PCy. Samples are colored based on the grouping variable (e.g. genotype)

Usage

create_sample_score_plot(
self, pc_x_axis=1, pc_y_axis=2, name_grouping_var=’genotype’, separator_replicates=”_”, show_color_legend=True, plot_file_name=None)

Parameters

pc_x_axis: int, optional
Principal Component to plot on the x-axis (default is 1 so PC1 will be plotted).

pc_y_axis: int, optional.
Principal Component to plot on the y-axis (default is 2 so PC2 will be plotted).

name_grouping_var: str, optional
Name of the variable used to color samples (Default is “genotype”).

separator_replicates: str, optional.
String separator that separates grouping factor from biological replicates (default is underscore “_”).

show_color_legend: bool, optional.
Add legend for hue (default is True).

plot_file_name: str, optional
A file name and its path to save the sample score plot (default is None). For instance “mydir/sample_score_plot.pdf” Path is relative to current working directory.

Returns

matplotlib Axes
Returns the Axes object with the sample score plot drawn onto it. Samples are colored by specified grouping variable. Optionally a saved image of the plot.

compute_metabolome_sparsity

Determine the sparsity of the metabolome matrix. Formula: number of non zero values/number of values * 100 The higher the sparsity, the more zero values

Usage

compute_metabolome_sparsity(self)

Returns

self: object
Object with sparsity attribute filled (sparsity is a float).

https://stackoverflow.com/questions/38708621/how-to-calculate-percentage-of-sparsity-for-a-numpy-array-matrix

plot_features_in_upset_plot

Visuallises the presence of features per group in an UpSet plot. A feature is considered present in a group if the median>0.

Usage

plot_features_in_upset_plot(
self, seperator_replicates=”_”, plot_file_name=None)

Parameters

separator_replicates: string, default=”_”
The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)

plot_file_name: str, optional
A file name and its path to save the sample score plot (default is None). For instance “mydir/feature_upset_plot.pdf” Path is relative to current working directory.

Returns

Plot:
UpSet plot with features presence per group.

Examples

Example of an input dataframe

FeatureSelection

A class to perform metabolite feature selection using phenotyping and metabolic data.

Perform sanity checks on input dataframes (values above 0, etc.).
Get a baseline performance of a simple Machine Learning Random Forest (“baseline”).
Perform automated Machine Learning model selection using autosklearn.
Using metabolite data, train a model to predict phenotypes. Yields performance metrics (balanced accuracy, precision, recall) on the selected model.
Extracts performance metrics from the best ML model.
Extracts the best metabolite features based on their feature importance and make plots per sample group.

Parameters

metabolome_csv: string

A path to a .csv file with the cleaned up metabolome data (unreliable features filtered out etc.) Use the MetabolomeAnalysis class methods. Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples

phenotype_csv: string

A path to a .csv file with the phenotyping data. Should be two columns at least with:

column 1 containing the sample identifiers

column 2 containing the phenotypic class e.g. ‘resistant’ or ‘sensitive’

metabolome_feature_id_col: string, default=’feature_id’

The name of the column that contains the feature identifiers. Feature identifiers should be unique (=not duplicated).

phenotype_sample_id: string, default=’sample_id’

The name of the column that contains the sample identifiers. Sample identifiers should be unique (=not duplicated).

Attributes

metabolome_validated: bool: Is the metabolome file valid for Machine Learning? (default is False)
phenotype_validated: bool: Is the phenotype file valid for Machine Learning? (default is False)
baseline_performance: float: The baseline performance computed with get_baseline_performance() i.e. using a simple Random Forest model. Search for the best ML model using search_best_model() should perform better than this baseline performance.
best_ensemble_models_searched: bool: Is the search for best ensemble model using auto-sklearn already performed? (default is False)
metabolome: pandas.core.frame.DataFrame: The validated metabolome dataframe of shape (n_features, n_samples).
phenotype: pandas.core.frame.DataFrame: A validated phenotype dataframe of shape (n_samples, 1) Sample names in the index and one column named ‘phenotype’ with the sample classes.
baseline_performance: str: Average balanced accuracy score (-/+ standard deviation) of the basic Random Forest model.
best_model: sklearn.pipeline.Pipeline: A scikit-learn pipeline that contains one or more steps. It is the best performing pipeline found by TPOT automated ML search.
pc_importances: pandas.core.frame.DataFrame: A Pandas dataframe that contains Principal Components importances using scikit-learn permutation_importance() Mean of PC importance over n_repeats. Standard deviation over n_repeats. Raw permutation importance scores.
feature_loadings: pandas.core.frame.DataFrame: A Pandas dataframe that contains feature loadings related to Principal Components

Methods

validate_input_metabolome_df(): Validates the dataframe read from the ‘metabolome_csv’ input file.
validate_input_phenotype_df(): Validates the phenotype dataframe read from the ‘phenotype_csv’ input file.
get_baseline_performance(): Fits a basic Random Forest model to get default performance metrics.
search_best_model_with_tpot_and_get_feature_importances(): Search for the best ML pipeline using TPOT genetic programming method. Computes and output performance metrics from the best pipeline. Extracts feature importances using scikit-learn permutation_importance() method.

Examples

Example of an input metabolome .csv file

feature_id	genotypeA_rep1	genotypeA_rep2	genotypeA_rep3	genotypeA_rep4
metabolite1	1246	1245	12345	12458
metabolite2	0	0	0	0
metabolite3	10	0	0	154

Example of an input phenotype .csv file

sample_id	phenotype
genotypeA_rep1	sensitive
genotypeA_rep2	sensitive
genotypeA_rep3	sensitive
genotypeA_rep4	sensitive
genotypeB_rep1	resistant
genotypeB_rep2	resistant

validate_input_metabolome_df

Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the ‘feature_id_col’ column as the index of the validated dataframe. The validated metabolome dataframe is stored as the ‘validated_metabolome’ attribute

Usage

validate_input_metabolome_df(self)

Returns

self: object
Object with metabolome_validated set to True

Examples: Example of a validated output metabolome dataframe

feature_id	genotypeA_rep1	genotypeA_rep2	genotypeA_rep3	genotypeA_rep4
metabolite1	1246	1245	12345	12458
metabolite2	0	0	0	0
metabolite3	10	0	0	154

validate_input_phenotype_df

Validates the dataframe containing the phenotype classes and the sample identifiers.

Usage

validate_input_phenotype_df(
self, phenotype_class_col=”phenotype”)

Parameters

phenotype_class_col: string, default=”phenotype”
The name of the column to be used

Returns

self: object
Object with phenotype_validated set to True

Examples

Example of a validated phenotype dataframe

sample_id	phenotype
genotypeA_rep1	sensitive
genotypeA_rep2	sensitive
genotypeA_rep3	sensitive
genotypeA_rep4	sensitive
genotypeB_rep1	resistant
genotypeB_rep2	resistant

get_baseline_performance

Takes the phenotype and metabolome dataset and compute a simple Random Forest analysis with default hyperparameters. This will give a base performance for a Machine Learning model that has then to be optimised using autosklearn

k-fold cross-validation is performed to mitigate split effects on small datasets.

get_baseline_performance(: self, kfold=5, train_size=0.8, random_state=123, scoring_metric=’balanced_accuracy’)

Parameters

kfold: int, optional
Cross-validation `str`ategy. Default is to use a 5-fold cross-validation.

train_size: float or int, optional
If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training).

random_state: int, optional
Controls both the randomness of the train/test split samples used when building trees (if boot`str`ap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123.

scoring_metric: str, optional
A valid scoring value (default=”balanced_accuracy”) To get a complete list, type: >> from sklearn.metrics import SCORERS >> sorted(SCORERS.keys()) balanced accuracy is the average of recall obtained on each class.

Returns

self: object
Object with baseline_performance attribute.

search_best_model_with_tpot_and_compute_pc_importances

Search for the best ML model with TPOT genetic programming methodology and extracts best Principal Components.

A characteristic of metabolomic data is to have a high number of features strongly correlated to each other. This makes it difficult to extract the individual true feature importance. Here, this method implements a dimensionality reduction method (PCA) and the importances of each PC is computed.

A resampling strategy called “cross-validation” will be performed on a subset of the data (training data) to increase the model generalisation performance. Finally, the model performance is tested on the unseen test data subset.

By default, TPOT will make use of a set of preprocessors (e.g. Normalizer, PCA) and algorithms (e.g. RandomForestClassifier) defined in the default config (classifier.py). See: https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier.py

Usage

search_best_model_with_tpot_and_compute_pc_importances(
self, class_of_interest, scoring_metric=’balanced_accuracy’, kfolds=3, train_size=0.8, max_time_mins=5, max_eval_time_mins=1, random_state=123, n_permutations=10, export_best_pipeline=True, path_for_saving_pipeline=”./best_fitting_pipeline.py”)

Parameters

class_of_interest: str
The name of the class of interest also called “positive class”. This class will be used to calculate recall_score and precision_score. Recall score = TP / (TP + FN) with TP: true positives and FN: false negatives. Precision score = TP / (TP + FP) with TP: true positives and FP: false positives.

scoring_metric: str, optional
Function used to evaluate the quality of a given pipeline for the classification problem. Default is ‘balanced accuracy’. The following built-in scoring functions can be used: ‘accuracy’, ‘adjusted_rand_score’, ‘average_precision’, ‘balanced_accuracy’, ‘f1’, ‘f1_macro’, ‘f1_micro’, ‘f1_samples’, ‘f1_weighted’, ‘neg_log_loss’, ‘precision’ etc. (suffixes apply as with ‘f1’), ‘recall’ etc. (suffixes apply as with ‘f1’), ‘jaccard’ etc. (suffixes apply as with ‘f1’), ‘roc_auc’, ‘roc_auc_ovr’, ‘roc_auc_ovo’, ‘roc_auc_ovr_weighted’, ‘roc_auc_ovo_weighted’

kfolds: int, optional
Number of folds for the `str`atified K-Folds cross-validation `str`ategy. Default is 3-fold cross-validation. Has to be comprised between 3 and 10 i.e. 3 <= kfolds =< 10 See https://scikit-learn.org/stable/modules/cross_validation.html

train_size: float or int, optional
If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training).

max_time_mins: int, optional
How many minutes TPOT has to optimize the pipeline (in total). Default is 5 minutes. This setting will allow TPOT to run until max_time_mins minutes elapsed and then stop. Try short time `int`ervals (5, 10, 15min) and then see if the model score on the test data improves.

max_eval_time_mins: float, optional
How many minutes TPOT has to evaluate a single pipeline. Default is 1min. This time has to be smaller than the ‘max_time_mins’ setting.

random_state: int, optional
Controls both the randomness of the train/test split samples used when building trees (if boot`str`ap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123.

n_permutations: int, optional
Number of permutations used to compute feature importances from the best model using scikit-learn permutation_importance() method. Default is 10 permutations.

export_best_pipeline: bool, optional
If True, the best fitting pipeline is exported as .py file. This allows for reuse of the pipeline on new datasets. Default is True.

path_for_saving_pipeline: str, optional
The path and filename of the best fitting pipeline to save. The name must have a ‘.py’ extension. Default to “./best_fitting_pipeline.py”

Returns

self: object
The object with best model searched and feature importances computed.

Note

Principal Component importances are calculated on the training set. Permutation importances can be computed either on the training set or on a held-out testing or validation set. Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit. https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance

get_names_of_top_n_features_from_selected_pc

Get the names of features with highest loading scores on selected PC

Takes the matrix of loading scores of shape (n_samples, n_features) and the metabolome dataframe of shape (n_features, n_samples) and extract the names of features. The loadings matrix is available after running the search_best_model_with_tpot_and_compute_pc_importances() method.

Usage

get_names_of_top_n_features_from_selected_pc(
self, selected_pc=1, top_n=5)

Parameters

selected_pc: int, optional
Principal Component to keep. 1-based index (1 selects PC1, 2 selected PC2, etc.) Default is 1.

top_n: int, optional
Number of features to select. The top_n features with the highest absolute loadings will be selected from the selected_pc PC. For instance, the top 5 features from PC1 will be selected with selected_pc=1 and top_n=5. Default is 5.

Returns

A list of feature names.