Documentation

Version 0.1.3

PhenotypeAnalysis

A class to analyse data from developmental bioassays and group the samples in distinct phenotypic classes.

Parameters

bioassay_csv: string

A path to a .csv file with the bioassay count data.

Attributes

bioassay_csv

The path to the csv file with the input data.

bioassay

A pandas dataframe with the development data.

sample_id

The name of the column that contains the sample identifiers.

cumulative_data

A pandas dataframe with cumulative development data.

max_counts

A pandas dataframe with the maximum number of nymphs developed to or past each developmental stage per group.

survival_data

A pandas dataframe with the number of living nymphs per stage at each time point per group.

order_of_groups

A string with the order in which the groups should be plotted.

absolute_counts

A pandas dataframe with values from max_counts in long format.

max_relative

A pandas dataframe based on absolute_counts, with values relative to values of specified developmental stage.

Methods

reshape_to_wide

Reshapes the dataframe from a long to a wide format to make the data accessible for pre-processing. With the counts of each developmental stage in a seperate columns.

combine_seperately_counted_versions_of_last_recorded_stage

Calculates the total number of nymphs developed to the final developmental stage per sample on each timepoint.

convert_counts_to_cumulative

Calculates the total number of nymphs developed to or past each stage on each timepoint.

correct_cumulative_counts

Inner function for convert_counts_to_cumulative().

create_df_with_max_counts_per_stage

Inner function for convert_counts_to_cumulative().

prepare_for_plotting

Prepare the order in which the groups should be plotted.

plot_counts_per_stage

Plots the counts per nymphal stage in boxplots.

plot_development_over_time_in_fitted_model

Fits a 3 parameter log-logistic curve to the development over time to a specified stage.

ll3

A three parameter log-logistic function.

plot_survival_over_time_in_fitted_model

Fits a 3 parameter log-normal curve to the number of living nymphs over time.

hazard

A three parameter log-normal function.

reshape_to_wide

Reshapes the dataframe from a long to a wide format to make the data accessible for pre-processing. with the counts of each developmental stage in a seperate columns.

Usage

reshape_to_wide(

self, sample_id=’sample_id’, grouping_variable=’genotype’, developmental_stages=’stage’, count_values=’number’, time=’day’)

Parameters

sample_id: string, default=’sample_id’

The name of the column that contains the sample identifiers.

grouping_variable: string, default=’genotype’

The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

developmental_stages: string, default=’stage’

The name of the column that contains the developmental stages that were scored during the bioassay.

count_values: string, default=’numbers’

The name of the column that contains the counts.

time: string, default=’day’

The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.

Examples

Example of an input dataframe

sample_id

genotype

day

stage

count

mm_1

mm

5

eggs

45

mm_1

mm

5

first_instar

0

mm_1

mm

5

second_instar

0

Example of a reshaped output dataframe

sample_id

genotype

day

eggs

first_instar

second instar

third_instar

mm_1

mm

5

45

0

0

0

mm_1

mm

9

NA

10

5

0

mm_1

mm

11

NA

15

17

4

combine_seperately_counted_versions_of_last_recorded_stage

Calculates the total number of nymphs developed to the final developmental stage per sample on each timepoint. This is used when nymphs in the (late) final nymph stage were removed after each counting moment and/or when exuviea and last instar stage nymphs were counted seperately. Removal of late last stage nymphs could for example be used to prevent adults from emerging and escaping.

Usage

combine_seperately_counted_versions_of_last_recorded_stage(

self, exuviea=’exuviea’, late_last_stage=’late_fourth_instar’, early_last_stage=’early_fourth_instar’, new_last_stage=’fourth_instar’, seperate_exuviea=True, late_last_stage_removed=True, early_last_stage_kept=True, remove_individual_stage_columns=True)

Parameters

exuviea: string, default=’exuviea’

The name of the column that contains the exuviea counts.

late_last_stage: string, default=’late_fourth_instar’

The name of the column that contains the counts of the last developmental stage recorded in the bioassay.

early_last_stage: string, default=’early_fourth_instar’

The name of the column that contains the counts of the nymphs in early last developmental stage. Is used when nymphs counted in late_last_stage were removed after each counting moment during the bioassay.

new_last_stage: string, default=’fourth_instar’

Name for new column with the returned total final stage data

seperate_exuviea: bool, default=True

If True, sums exuviea and late_last_stage per sample per timepoint. If exuviea were counted seperately from late_last_stage, set to True. If exuviea count was included in late_last_stage, set to False

late_last_stage_removed: bool, default=True

If True, returns the cumulative number of late_last_stage(+exuviea) per sample over time. If nymphs counted in late_last_stage (and exuviea if counted seperately) were removed after each counting moment, set to True. If nymphs counted in late_last_stage (and exuviea if counted seperately) were left on the sample until ending the bioassay, set to False.

early_last_stage_kept: bool, default=True

If True, sums the early and late last stage counts per sample per timepoint If late last stage nymphs were removed after each counting moment, but early last stage nymphs were left on sample, set to True. If early and late last stage nymphs were not counted seperately, set to False

remove_individual_stage_columns: bool, default=True

If True, removes exuviea, late_last_stage, early_last_stage columns from dataframe after returning new_last_stage column.

Examples

Example of an input dataframe

sample_id

genotype

day

eggs

third_instar

exuviea

early_fourth_instar

late_fourth_instar

mm_1

mm

5

45

0

0

0

0

mm_1

mm

9

NA

0

1

5

0

mm_1

mm

11

NA

4

0

7

4

Example of an output dataframe

sample_id

genotype

day

eggs

first_instar

second instar

third_instar

fourth_instar

mm_1

mm

5

45

0

0

0

0

mm_1

mm

9

NA

10

5

0

6

mm_1

mm

11

NA

15

17

4

12

convert_counts_to_cumulative

Calculates the total number of nymphs developed to or past each stage on each timepoint. Cumulative counts make the analysis of development over time and the comparison of number of nymphs past a stage easier. If nymphs in the (late) final nymph stage were removed after each counting moment and/or when exuviea and/or early and late last instar stage nymphs were counted seperately, total_last_stage() should be used first.

Usage

convert_counts_to_cumulative(

self, n_developmental_stages=4, sample_id=’sample_id’, eggs=’eggs’, first_stage=’first_instar’, second_stage=’second_instar’, third_stage=’third_instar’, fourth_stage=’fourth_instar’, fifth_stage=’fifth_instar’, sixth_stage=’sixth_instar’)

Parameters

n_developmental_stages: integer, default=4

The number of developmental stages which were recorded seperately. Can range from 2 to 6.

sample_id: string, default=’sample_id’

The name of the column that contains the sample identifiers.

eggs: string, default=’eggs’

The name of the column that contains the counts of the eggs.

first_stage: string, default=’first_instar’

The name of the column that contains the counts of the first developmental stage recorded in the bioassay.

second_stage: string, default=’second_instar’

The name of the column that contains the counts of the second developmental stage recorded in the bioassay.

third_stage: string, default=’third_instar’

The name of the column that contains the counts of the third developmental stage recorded in the bioassay.

fourth_stage: string, default=’fourth_instar’

The name of the column that contains the counts of the fourth developmental stage recorded in the bioassay.

fifth_stage: string, default=’fifth_instar’

The name of the column that contains the counts of the fifth developmental stage recorded in the bioassay.

sixth_stage: string, default=’sixth_instar’

The name of the column that contains the counts of the sixth developmental stage recorded in the bioassay.

correct_cumulative_counts

Inner function for convert_counts_to_cumulative(). If nymphs die during the bioassay, they should be included in the cumulative count for the stages it had passed. Otherwise, the cumulative count could go down over time. This function corrects the cumulative count if it is lower than the previous count.

Usage

correct_cumulative_counts(

self, current_stage, grouping_variable)

create_df_with_max_counts_per_stage

Inner function for convert_counts_to_cumulative(). With the maximum number of nymphs developed to or past each developmental stage per plant, making graphs becomes easier.

Usage

create_df_with_max_counts_per_stage(

self, egg_column, last_stage, grouping_variable):

prepare_for_plotting

Prepare the order in which the groups should be plotted.

Usage

prepare_for_plotting( self, order_of_groups)

Parameters

order_of_groups: string

List of the group names in the prefered order for plotting For example: [‘MM’, ‘LA’, ‘PI’]

plot_counts_per_stage

Plots the counts per nymphal stage in boxplots. The nymph counts are given as the absolute number of nymphs that developed to or past each stage at the last timepoint and as a fraction of nymphs that developed to or past each stage at the last timepoint relative to another developmental stage. The other developmental stage to which the data is made relative defaults to the first instar stage, because this represents the number of hatched eggs. This means that in this case only the succes of the development is compared between groups (e.g. genotypes or treatments) and the hatching rate of the eggs is not taken into acount.

The imput dataframe ‘max_counts’ is created with convert_counts_to_cumulative.

Usage

plot_counts_per_stage(

self, grouping_variable=’genotype’, sample_id=’sample_id’, eggs=’eggs’, first_stage=’first_instar’, second_stage=’second_instar’, third_stage=’third_instar’, fourth_stage=’fourth_instar’, absolute_x_axis_label=’genotype’, absolute_y_axis_label=’counts (absolute)’, relative_x_axis_label=’genotype’, relative_y_axis_label=’relative number of nymphs’, make_nymphs_relative_to=’first_instar’)

Parameters

grouping_variable: string, default=’genotype’

The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

sample_id: string, default=’sample_id’

The name of the column that contains the sample identifiers.

eggs: string, default=’eggs’

The name of the column that contains the counts of the eggs.

first_stage: string, default=’first_instar’

The name of the column that contains the counts of the first developmental stage recorded in the bioassay.

second_stage: string, default=’second_instar’

The name of the column that contains the counts of the second developmental stage recorded in the bioassay.

third_stage: string, default=’third_instar’

The name of the column that contains the counts of the third developmental stage recorded in the bioassay.

fourth_stage: string, default=’fourth_instar’

The name of the column that contains the counts of the fourth developmental stage recorded in the bioassay.

absolute_x_axis_label: string, default=’genotype’

Label for the x-axis of the boxplots with count data.

absolute_y_axis_label: string, default=’counts (absolute)’

Label for the y-axis of the boxplots with count data.

relative_x_axis_label: string, default=’genotype’

Label for the x-axis of the boxplots with relative development.

relative_y_axis_label: string, default=’relative number of nymphs’

Label for the y-axis of the boxplots with relative development.

make_nymphs_relative_to: string, default=’first_instar’

The name of the column that contains the counts of the developmental stage which should be used to calculate the relative development to all developmental stages.

Examples

Example of an input dataframe

sample_id

genotype

day

eggs

first_instar

second_instar

third_instar

fourth_instar

mm_1

mm

28

45

34

30

30

29

mm_2

mm

28

50

39

33

28

26

LA_1

LA

28

42

30

25

17

4

plot_development_over_time_in_fitted_model

Fits a 3 parameter log-logistic curve to the development over time to a specified stage. The fitted curve and the observed datapoints are plotted and returned with the model parameters. The reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

Usage

plot_development_over_time_in_fitted_model(

self, grouping_variable=’genotype’, sample_id=’sample_id’, time=’day’, x_axis_label=’days after infection’, y_axis_label=’development to 4th instar stage (relative to 1st instars)’, stage_of_`int`erest=’fourth_instar’, use_relative_data=True, make_nymphs_relative_to=’first_instar’, predict_for_n_days=0)

Parameters

grouping_variable: string, default=’genotype’

The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

sample_id: string, default=’sample_id’

The name of the column that contains the sample identifiers.

time: string, default=’day’

The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.

x_axis_label: string, default=’days after infection’

Label for the x-axis

y_axis_label: string, default=’development to 4th instar stage (relative to 1st instars)’

Label for the y-axis

stage_of_`int`erest: string, default=’fourth_instar’

The name of the column that contains the data of the developmental stage of `int`erest.

use_relative_data: bool, default=True

If True, the counts for the stage of `int`erest are devided by the stage indicated at ‘make_nymphs_relative_to’. The returned relative rate is used for plotting and curve fitting.

make_nymphs_relative_to: string, default=’first_instar’

The name of the column that contains the counts of the developmental stage which should be used to calculate therelative development to all developmental stages.

predict_for_n_days: int, default=o

Continue model for n days after final count.

Examples

Example of an input dataframe

sample_id

genotype

day

eggs

first_instar

second instar

third_instar

fourth_instar

mm_1

mm

5

45

15

7

0

0

mm_1

mm

9

NA

24

14

6

3

mm_1

mm

11

NA

38

27

16

12

ll3

A three parameter log-logistic function.

Usage

ll3(x,slope,maximum,emt50):

Parameters

slope:

the slope of the curve

maximum:

the maximum value of the curve

emt50:

the EmT50, the timepoint at which 50% of nymphs has developed to the stage of `int`erest

Model

y(x) = maximum/(1+np.exp(slope*(np.log(x)-np.log(emt50))))

plot_survival_over_time_in_fitted_model

Fits a 3 parameter log-normal curve to the number of living nymphs over time. The fitted curve and the observed datapoints are plotted and returned with the model parameters. The reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

Usage

plot_survival_over_time_in_fitted_model(

self, grouping_variable=’genotype’, sample_id=’sample_id’, time=’day’, x_axis_label=’days after infection’, y_axis_label=’number of nymphs per plant’, stage_of_`int`erest=’first_instar’, use_relative_data=False, make_nymphs_relative_to=’eggs’, predict_for_n_days=0)

Parameters

grouping_variable: string, default=’genotype’

The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments

sample_id: string, default=’sample_id’

The name of the column that contains the sample identifiers.

time: string, default=’day’

The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.

x_axis_label: string, default=’days after infection’

Label for the x-axis

y_axis_label: string, default=’development to 4th instar stage (relative to 1st instars)’

Label for the y-axis

stage_of_`int`erest: string, default=’first_instar’

The name of the column that contains the data of the developmental stage of `int`erest.

use_relative_data: bool, default=False

If True, the counts for the stage of `int`erest are devided by the stage indicated at ‘make_nymphs_relative_to’. The returned relative rate is used for plotting and curve fitting.

make_nymphs_relative_to: string, default=’eggs’

The name of the column that contains the counts of the developmental stage which should be used to calculate the relative development to all developmental stages.

predict_for_n_days: int, default=o

Continue model for n days after final count.

Examples

Example of an input dataframe

sample_id

genotype

day

eggs

first_instar

second instar

third_instar

fourth_instar

mm_1

mm

5

45

15

7

0

0

mm_1

mm

9

NA

24

14

6

3

mm_1

mm

11

NA

38

27

16

12

hazard

A three parameter log-normal function.

Usage

hazard(x,auc,median,shape)

Parameters

auc:

area under the curve

median:

median time point

shape:

shape of the curve

Model

y(x) = (auc*(shape/median)*pow(x/median,shape-1))/(1+pow(x/median,shape))


OmicsAnalysis

A class to streamline the filtering and exploration of a metabolome dataset.

Parameters

metabolome_csv: str

A path to a .csv file with the metabolome data (scaled or unscaled). Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples

metabolome_feature_id_col: str, optional

The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).

Attributes

metabolome: pandas.core.frame.DataFrame, (n_samples, n_features)

The metabolome Pandas dataframe imported from the .csv file. metabolome_validated: bool Is the metabolome dataset validated? Default is False.

blank_features_filtered: bool

Are the features present in blank samples filtered out from the metabolome data? Default by False.

filtered_by_percentile_value: bool

Are the features filtered by percentile value?

unreliable_features_filtered: bool

Are the features not reliably present within one group filtered out from the metabolome data?

pca_performed: bool

Has PCA been performed on the metabolome data? Default is False.

exp_variance: pandas.core.frame.DataFrame, (n_pc, 1)

A Pandas dataframe with explained variance per Principal Component. The index of the df contains the PC index (PC1, PC2, etc.). The second column contains the percentage of the explained variance per PC.

metabolome_pca_reduced: numpy.ndarray, (n_samples, n_pc)

Numpy array with sample coordinates in reduced dimensions. The dimension of the numpy array is the minimum of the number of samples and features.

sparsity: float

Metabolome matrix sparsity.

Methods

validate_input_metabolome_df

Check if the provided metabolome file is suitable. Turns attribute metabolome_validated to True.

discard_features_detected_in_blanks

Removes features only detected in blank samples.

impute_missing_values_with_median

Impute missing values with the median value of the feature.

filter_out_unreliable_features

Filter out features not reliably detectable in replicates of the same grouping factor. For instance, if a feature is detected less than 4 times within 4 biological replicates, it is discarded with argument nb_times_detected=4.

filter_features_per_group_by_percentile

Filter out features whose abundance within the same grouping factor is lower than a certain percentile value. For instance, features lower than the 90th percentile within a single group are discarded with argument percentile=90.

compute_metabolome_sparsity

Computes the sparsity percentage of the metabolome matrix (percentage of 0 values e.g. 100% for an matrix full of 0 values)

write_clean_metabolome_to_csv

Write the filtered and analysis-ready metabolome data to a .csv file.

Examples

Example of an input metabolome input format (from a csv file)

feature_id

blank_1

blank_2

blank_3

blank_4

MM_1

MM_2

MM_3

MM_4

LA1330_1

LA1330_2

LA1330_3

LA1330_4

rt-0.04_mz-241.88396

280

694

502

604

554

678

674

936

824

940

794

828

rt-0.05_mz-143.95911

1036

1566

1326

1490

1364

1340

1692

1948

1928

1956

1730

1568

rt-0.06_mz-124.96631

1308

992

1060

1010

742

990

0

888

786

668

762

974

rt-0.08_mz-553.45905

11340

12260

10962

11864

10972

11190

12172

11820

12026

11604

11122

11260

rt-0.08_mz-413.26631

984

1162

1292

1104

1090

1106

1290

1170

1282

924

1172

1062

validate_input_metabolome_df

Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the ‘feature_id_col’ column as the index of the validated dataframe. The validated metabolome dataframe is stored as the ‘validated_metabolome’ attribute.

Usage

validate_input_metabolome_df(

self, metabolome_feature_id_col=’feature_id’)

Parameters

metabolome_feature_id: str, optional

The name of the column that contains the feature identifiers (default is ‘feature_id’). Feature identifiers should be unique (=not duplicated).

Returns

self: object

Object with attribute metabolome_validated set to True if tests are passed.

Examples

Example of a valid input metabolome dataframe

feature_id

genotypeA_rep1

genotypeA_rep2

genotypeA_rep3

genotypeA_rep4

metabolite1

1246

1245

12345

12458

metabolite2

0

0

0

0

metabolite3

10

0

0

154

discard_features_detected_in_blanks

Removes features present in blanks. Steps:

  1. Sum the abundance of each feature in the blank samples.

  2. Makes a list of features to be discarded (features with a positive summed abundance).

  3. Returns a filtered Pandas dataframe with only features not detected in blank samples

Usage

discard_features_detected_in_blanks(

self, blank_sample_contains=’blank’)

Parameters

blank_sample_contains: str, optional.

Column names with this name will be considered blank samples. Default is=’blank’

Returns

metabolome: pandas.core.frame.DataFrame

A filtered Pandas dataframe without features detected in blank samples and with the blank samples removed.

create_density_plot

For each grouping variable (e.g. genotype), creates a histogram and density plot of all feature peak areas. This plot helps to see whether some groups have a value di`str`ibution different from the rest. The percentage is indicated on the y-axis (bar heights sum to 100).

Usage

create_density_plot(

self, name_grouping_var=”genotype”, n_cols=3, nbins=1000)

Parameters

name_grouping_var: str, optional

The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.

n_cols: int, optional

The number of columns for the final plot.

nbins: int, optional

The number of bins to create.

Returns

matplotlib Axes

Returns the Axes object with the density plots drawn onto it.

filter_features_per_group_by_percentile

Filter metabolome dataframe based on a selected percentile threshold. Features with a peak area values lower than the selected percentile will be discarded. The percentile value is calculated per grouping variable.

For instance, selecting the 50th percentile (median) will discard 50% of the features with a peak area lower than the median/50th percentile in each group.

Usage

filter_features_per_group_by_percentile( self, name_grouping_var=”genotype”, separator_replicates=”_”, percentile=50)

Parameters

name_grouping_var: str, optional

The name of the grouping variable (default is “genotype”)

separator_replicates: str, optional

The character used to separate the main grouping variable from biological replicates. Default is “_: (underscore)

percentile: float, optional

The percentile threshold. Has to be comprised 0 and 100.

Returns

self: object

The object with the .metabolome attribute filtered and the filtered_by_percentile_value set to True.

See also

Use create_density_plot() method to decide on a suitable percentile value.

filter_out_unreliable_features

Removes features not reliably detectable in multiple biological replicates from the same grouping factor.

Takes a dataframe with feature identifiers in index and samples as columns. Steps:

  1. First melt and split the sample names to generate the grouping variable

  2. Count number of times a metabolite is detected in the groups. If number of times detected in a group = number of biological replicates then it is considered as reliable. Each feature receives a tag ‘reliable’ or ‘not_reliable’

  3. Discard the ‘not_reliable’ features and keep the filtered dataframe.

Usage

filter_out_unreliable_features(

self, name_grouping_var=”genotype”, nb_times_detected=4, separator_replicates=’_’)

Parameters

name_grouping_var: str, optional

The name used when splitting between replicate and main factor. For example “genotype” when splitting MM_rep1 into ‘MM’ and ‘rep1’. Default is ‘genotype’.

nb_times_detected: int, optionaldefault=4

Number of times a metabolite should be detected to be considered ‘reliable’. Should be equal to the number of biological replicates for a given group of `int`erest (e.g. genotype)

separator_replicates: string, default=”_”

The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)

Returns

metabolome: ndarray

A Pandas dataframe with only features considered as reliable, sample names and their values.

Examples

Example of an input dataframe

Example of an output df (rt-0.06_mz-124.96631 is kicked out because 3x0 and 1x888 in MM groups)

write_clean_metabolome_to_csv

Writes the cleaned metabolome data to the disk as a comma-separated value file.

Usage

write_clean_metabolome_to_csv(self, path_of_cleaned_csv=”./data_for_manuals/filtered_metabolome.csv”):

Parameters

path_of_cleaned_csv: str, optional

The path and filename of the .csv file to save. Default to “./data_for_manuals/filtered_metabolome.csv”

compute_pca_on_metabolites

Performs a Principal Component Analysis (PCA) on the metabolome data.

The PCA analysis will return transformed coordinates of the samples in a new space. It will also give the percentage of variance explained by each Principal Component. Assumes that number of samples < number of features/metabolites Performs a transpose of the metabolite dataframe if n_samples > n_features (this can be turned off with auto_transpose)

Usage

compute_pca_on_metabolites(

self, scale=True, n_principal_components=10, auto_transpose=True)

Parameters

scale: bool, optional

Perform scaling (standardize) the metabolite values to zero mean and unit variance. Default is True.

n_principal_components: int, optional

number of principal components to keep in the PCA analysis. if number of PCs > min(n_samples, n_features) then set to the minimum of (n_samples, n_features) Default is to calculate 10 components.

auto_transpose: bool, optional.

If n_samples > n_features, performs a transpose of the feature matrix. Default is True (meaning that transposing will occur if n_samples > n_features).

Returns

self: object

Object with .exp_variance: dataframe with explained variance per Principal Component .metabolome_pca_reduced: dataframe with samples in reduced dimensions .pca_performed: `bool`ean set to True

impute_missing_values_with_median

Imputes missing values with the median of the column. This is necessary for PCA to work.

Usage

impute_missing_values_with_median(

self, missing_value_str=’np.nan’)

Parameters

missing_value_str: str, optional

The string that represents missing values in the input dataframe. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

Returns

self: object with attribute ‘metabolome’ updated with imputed values.

create_scree_plot

Returns a barplot with the explained variance per Principal Component. Has to be preceded by perform_pca()

Usage

create_scree_plot(

self, plot_file_name=None)

Parameters

plot_file_name: string, default=’None’

Path to a file where the plot will be saved. For instance ‘my_scree_plot.pdf’

Returns

matplotlib Axes

Returns the Axes object with the scree plot drawn onto it. Optionally a saved image of the plot.

create_sample_score_plot

Returns a sample score plot of the samples on PCx vs PCy. Samples are colored based on the grouping variable (e.g. genotype)

Usage

create_sample_score_plot(

self, pc_x_axis=1, pc_y_axis=2, name_grouping_var=’genotype’, separator_replicates=”_”, show_color_legend=True, plot_file_name=None)

Parameters

pc_x_axis: int, optional

Principal Component to plot on the x-axis (default is 1 so PC1 will be plotted).

pc_y_axis: int, optional.

Principal Component to plot on the y-axis (default is 2 so PC2 will be plotted).

name_grouping_var: str, optional

Name of the variable used to color samples (Default is “genotype”).

separator_replicates: str, optional.

String separator that separates grouping factor from biological replicates (default is underscore “_”).

show_color_legend: bool, optional.

Add legend for hue (default is True).

plot_file_name: str, optional

A file name and its path to save the sample score plot (default is None). For instance “mydir/sample_score_plot.pdf” Path is relative to current working directory.

Returns

matplotlib Axes

Returns the Axes object with the sample score plot drawn onto it. Samples are colored by specified grouping variable. Optionally a saved image of the plot.

compute_metabolome_sparsity

Determine the sparsity of the metabolome matrix. Formula: number of non zero values/number of values * 100 The higher the sparsity, the more zero values

Usage

compute_metabolome_sparsity(self)

Returns

self: object

Object with sparsity attribute filled (sparsity is a float).

https://stackoverflow.com/questions/38708621/how-to-calculate-percentage-of-sparsity-for-a-numpy-array-matrix

plot_features_in_upset_plot

Visuallises the presence of features per group in an UpSet plot. A feature is considered present in a group if the median>0.

Usage

plot_features_in_upset_plot(

self, seperator_replicates=”_”, plot_file_name=None)

Parameters

separator_replicates: string, default=”_”

The separator to split sample names into a grouping variable (e.g. genotype) and the biological replicate number (e.g. 1)

plot_file_name: str, optional

A file name and its path to save the sample score plot (default is None). For instance “mydir/feature_upset_plot.pdf” Path is relative to current working directory.

Returns

Plot:

UpSet plot with features presence per group.

Examples

Example of an input dataframe


FeatureSelection

A class to perform metabolite feature selection using phenotyping and metabolic data.

  • Perform sanity checks on input dataframes (values above 0, etc.).

  • Get a baseline performance of a simple Machine Learning Random Forest (“baseline”).

  • Perform automated Machine Learning model selection using autosklearn.

    Using metabolite data, train a model to predict phenotypes. Yields performance metrics (balanced accuracy, precision, recall) on the selected model.

  • Extracts performance metrics from the best ML model.

  • Extracts the best metabolite features based on their feature importance and make plots per sample group.

Parameters

metabolome_csv: string

A path to a .csv file with the cleaned up metabolome data (unreliable features filtered out etc.) Use the MetabolomeAnalysis class methods. Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples

phenotype_csv: string

A path to a .csv file with the phenotyping data. Should be two columns at least with:

  • column 1 containing the sample identifiers

  • column 2 containing the phenotypic class e.g. ‘resistant’ or ‘sensitive’

metabolome_feature_id_col: string, default=’feature_id’

The name of the column that contains the feature identifiers. Feature identifiers should be unique (=not duplicated).

phenotype_sample_id: string, default=’sample_id’

The name of the column that contains the sample identifiers. Sample identifiers should be unique (=not duplicated).

Attributes

metabolome_validated: bool

Is the metabolome file valid for Machine Learning? (default is False)

phenotype_validated: bool

Is the phenotype file valid for Machine Learning? (default is False)

baseline_performance: float

The baseline performance computed with get_baseline_performance() i.e. using a simple Random Forest model. Search for the best ML model using search_best_model() should perform better than this baseline performance.

best_ensemble_models_searched: bool

Is the search for best ensemble model using auto-sklearn already performed? (default is False)

metabolome: pandas.core.frame.DataFrame

The validated metabolome dataframe of shape (n_features, n_samples).

phenotype: pandas.core.frame.DataFrame

A validated phenotype dataframe of shape (n_samples, 1) Sample names in the index and one column named ‘phenotype’ with the sample classes.

baseline_performance: str

Average balanced accuracy score (-/+ standard deviation) of the basic Random Forest model.

best_model: sklearn.pipeline.Pipeline

A scikit-learn pipeline that contains one or more steps. It is the best performing pipeline found by TPOT automated ML search.

pc_importances: pandas.core.frame.DataFrame

A Pandas dataframe that contains Principal Components importances using scikit-learn permutation_importance() Mean of PC importance over n_repeats. Standard deviation over n_repeats. Raw permutation importance scores.

feature_loadings: pandas.core.frame.DataFrame

A Pandas dataframe that contains feature loadings related to Principal Components

Methods

validate_input_metabolome_df()

Validates the dataframe read from the ‘metabolome_csv’ input file.

validate_input_phenotype_df()

Validates the phenotype dataframe read from the ‘phenotype_csv’ input file.

get_baseline_performance()

Fits a basic Random Forest model to get default performance metrics.

search_best_model_with_tpot_and_get_feature_importances()

Search for the best ML pipeline using TPOT genetic programming method. Computes and output performance metrics from the best pipeline. Extracts feature importances using scikit-learn permutation_importance() method.

Examples

Example of an input metabolome .csv file

feature_id

genotypeA_rep1

genotypeA_rep2

genotypeA_rep3

genotypeA_rep4

metabolite1

1246

1245

12345

12458

metabolite2

0

0

0

0

metabolite3

10

0

0

154

Example of an input phenotype .csv file

sample_id

phenotype

genotypeA_rep1

sensitive

genotypeA_rep2

sensitive

genotypeA_rep3

sensitive

genotypeA_rep4

sensitive

genotypeB_rep1

resistant

genotypeB_rep2

resistant

validate_input_metabolome_df

Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the ‘feature_id_col’ column as the index of the validated dataframe. The validated metabolome dataframe is stored as the ‘validated_metabolome’ attribute

Usage

validate_input_metabolome_df(self)

Returns

self: object

Object with metabolome_validated set to True

Examples

Example of a validated output metabolome dataframe

feature_id

genotypeA_rep1

genotypeA_rep2

genotypeA_rep3

genotypeA_rep4

metabolite1

1246

1245

12345

12458

metabolite2

0

0

0

0

metabolite3

10

0

0

154

validate_input_phenotype_df

Validates the dataframe containing the phenotype classes and the sample identifiers.

Usage

validate_input_phenotype_df(

self, phenotype_class_col=”phenotype”)

Parameters

phenotype_class_col: string, default=”phenotype”

The name of the column to be used

Returns

self: object

Object with phenotype_validated set to True

Examples

Example of a validated phenotype dataframe

sample_id

phenotype

genotypeA_rep1

sensitive

genotypeA_rep2

sensitive

genotypeA_rep3

sensitive

genotypeA_rep4

sensitive

genotypeB_rep1

resistant

genotypeB_rep2

resistant

get_baseline_performance

Takes the phenotype and metabolome dataset and compute a simple Random Forest analysis with default hyperparameters. This will give a base performance for a Machine Learning model that has then to be optimised using autosklearn

k-fold cross-validation is performed to mitigate split effects on small datasets.

get_baseline_performance(

self, kfold=5, train_size=0.8, random_state=123, scoring_metric=’balanced_accuracy’)

Parameters

kfold: int, optional

Cross-validation `str`ategy. Default is to use a 5-fold cross-validation.

train_size: float or int, optional

If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training).

random_state: int, optional

Controls both the randomness of the train/test split samples used when building trees (if boot`str`ap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123.

scoring_metric: str, optional

A valid scoring value (default=”balanced_accuracy”) To get a complete list, type: >> from sklearn.metrics import SCORERS >> sorted(SCORERS.keys()) balanced accuracy is the average of recall obtained on each class.

Returns

self: object

Object with baseline_performance attribute.

search_best_model_with_tpot_and_compute_pc_importances

Search for the best ML model with TPOT genetic programming methodology and extracts best Principal Components.

A characteristic of metabolomic data is to have a high number of features strongly correlated to each other. This makes it difficult to extract the individual true feature importance. Here, this method implements a dimensionality reduction method (PCA) and the importances of each PC is computed.

A resampling strategy called “cross-validation” will be performed on a subset of the data (training data) to increase the model generalisation performance. Finally, the model performance is tested on the unseen test data subset.

By default, TPOT will make use of a set of preprocessors (e.g. Normalizer, PCA) and algorithms (e.g. RandomForestClassifier) defined in the default config (classifier.py). See: https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier.py

Usage

search_best_model_with_tpot_and_compute_pc_importances(

self, class_of_interest, scoring_metric=’balanced_accuracy’, kfolds=3, train_size=0.8, max_time_mins=5, max_eval_time_mins=1, random_state=123, n_permutations=10, export_best_pipeline=True, path_for_saving_pipeline=”./best_fitting_pipeline.py”)

Parameters

class_of_interest: str

The name of the class of interest also called “positive class”. This class will be used to calculate recall_score and precision_score. Recall score = TP / (TP + FN) with TP: true positives and FN: false negatives. Precision score = TP / (TP + FP) with TP: true positives and FP: false positives.

scoring_metric: str, optional

Function used to evaluate the quality of a given pipeline for the classification problem. Default is ‘balanced accuracy’. The following built-in scoring functions can be used: ‘accuracy’, ‘adjusted_rand_score’, ‘average_precision’, ‘balanced_accuracy’, ‘f1’, ‘f1_macro’, ‘f1_micro’, ‘f1_samples’, ‘f1_weighted’, ‘neg_log_loss’, ‘precision’ etc. (suffixes apply as with ‘f1’), ‘recall’ etc. (suffixes apply as with ‘f1’), ‘jaccard’ etc. (suffixes apply as with ‘f1’), ‘roc_auc’, ‘roc_auc_ovr’, ‘roc_auc_ovo’, ‘roc_auc_ovr_weighted’, ‘roc_auc_ovo_weighted’

kfolds: int, optional

Number of folds for the `str`atified K-Folds cross-validation `str`ategy. Default is 3-fold cross-validation. Has to be comprised between 3 and 10 i.e. 3 <= kfolds =< 10 See https://scikit-learn.org/stable/modules/cross_validation.html

train_size: float or int, optional

If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training).

max_time_mins: int, optional

How many minutes TPOT has to optimize the pipeline (in total). Default is 5 minutes. This setting will allow TPOT to run until max_time_mins minutes elapsed and then stop. Try short time `int`ervals (5, 10, 15min) and then see if the model score on the test data improves.

max_eval_time_mins: float, optional

How many minutes TPOT has to evaluate a single pipeline. Default is 1min. This time has to be smaller than the ‘max_time_mins’ setting.

random_state: int, optional

Controls both the randomness of the train/test split samples used when building trees (if boot`str`ap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123.

n_permutations: int, optional

Number of permutations used to compute feature importances from the best model using scikit-learn permutation_importance() method. Default is 10 permutations.

export_best_pipeline: bool, optional

If True, the best fitting pipeline is exported as .py file. This allows for reuse of the pipeline on new datasets. Default is True.

path_for_saving_pipeline: str, optional

The path and filename of the best fitting pipeline to save. The name must have a ‘.py’ extension. Default to “./best_fitting_pipeline.py”

Returns

self: object

The object with best model searched and feature importances computed.

Note

Principal Component importances are calculated on the training set. Permutation importances can be computed either on the training set or on a held-out testing or validation set. Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit. https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance

get_names_of_top_n_features_from_selected_pc

Get the names of features with highest loading scores on selected PC

Takes the matrix of loading scores of shape (n_samples, n_features) and the metabolome dataframe of shape (n_features, n_samples) and extract the names of features. The loadings matrix is available after running the search_best_model_with_tpot_and_compute_pc_importances() method.

Usage

get_names_of_top_n_features_from_selected_pc(

self, selected_pc=1, top_n=5)

Parameters

selected_pc: int, optional

Principal Component to keep. 1-based index (1 selects PC1, 2 selected PC2, etc.) Default is 1.

top_n: int, optional

Number of features to select. The top_n features with the highest absolute loadings will be selected from the selected_pc PC. For instance, the top 5 features from PC1 will be selected with selected_pc=1 and top_n=5. Default is 5.

Returns

A list of feature names.