PhenoFeatureFinder.phenotype_analysis

Classes

PhenotypeAnalysis

A class to analyse data from developmental bioassays and group the samples in distict phenotypic classes.

Module Contents

class PhenoFeatureFinder.phenotype_analysis.PhenotypeAnalysis(bioassay_csv)

A class to analyse data from developmental bioassays and group the samples in distict phenotypic classes. - What does this class do?

Parameters:

arguments (The constructor method __init__ takes one)
bioassay_csv (string) – A path to a .csv file with the bioassay count data. Shape of the dataframe is usually …

metabolome_validated

Is the metabolome dataset validated?

Type:: boolean, default=False

phenotype_validated=False

blank_features_filtered=False

unreliable_features_filtered=False

bioassay

bioassay_csv

reshape_to_wide(sample_id='sample_id', grouping_variable='genotype', developmental_stages='stage', count_values='number', time='day')

Reshapes the dataframe from a long to a wide format to make the data accessible for pre-processing. with the counts of each developmental stage in a seperate columns.

Parameters:

sample_id (string, default='sample_id') – The name of the column that contains the sample identifiers.
grouping_variable (string, default='genotype') – The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments
developmental_stages (string, default='stage') – The name of the column that contains the developmental stages that were scored during the bioassay.
count_values (string, default='numbers') – The name of the column that contains the counts.
time (string, default='day') – The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.

Examples

Example of an input dataframe

|-----------|———–|-------|—————|-------| | mm_1 | mm | 5 | eggs | 45 | | mm_1 | mm | 5 | first_instar | 0 | | mm_1 | mm | 5 | second_instar | 0 |

Example of a reshaped output dataframe

|-----------|———–|-------|——-|---------------|—————|---------------| | mm_1 | mm | 5 | 45 | 0 | 0 | 0 | | mm_1 | mm | 9 | NA | 10 | 5 | 0 | | mm_1 | mm | 11 | NA | 15 | 17 | 4 |

combine_seperately_counted_versions_of_last_recorded_stage(exuviea='exuviea', late_last_stage='late_fourth_instar', early_last_stage='early_fourth_instar', new_last_stage='fourth_instar', seperate_exuviea=True, late_last_stage_removed=True, early_last_stage_kept=True, remove_individual_stage_columns=True)

Calculates the total number of nymphs developed to the final developmental stage per sample on each timepoint. This is used when nymphs in the (late) final nymph stage were removed after each counting moment and/or when exuviea and last instar stage nymphs were counted seperately. Removal of late last stage nymphs could for example be used to prevent adults from emerging and escaping.

exuviea: string, default=’exuviea’
The name of the column that contains the exuviea counts.

late_last_stage: string, default=’late_fourth_instar’
The name of the column that contains the counts of the last developmental stage recorded in the bioassay.

early_last_stage: string, default=’early_fourth_instar’
The name of the column that contains the counts of the nymphs in early last developmental stage. Is used when nymphs counted in late_last_stage were removed after each counting moment during the bioassay.

new_last_stage: string, default=’fourth_instar’
Name for new column with the returned total final stage data

seperate_exuviea: boolean, default=True
If True, sums exuviea and late_last_stage per sample per timepoint. If exuviea were counted seperately from late_last_stage, set to True. If exuviea count was included in late_last_stage, set to False

late_last_stage_removed: boolean, default=True
If True, returns the cumulative number of late_last_stage(+exuviea) per sample over time. If nymphs counted in late_last_stage (and exuviea if counted seperately) were removed after each counting moment, set to True. If nymphs counted in late_last_stage (and exuviea if counted seperately) were left on the sample until ending the bioassay, set to False.

early_last_stage_kept: boolean, default=True

If True, sums the early and late last stage counts per sample per timepoint If late last stage nymphs were removed after each counting moment, but early last stage nymphs were left on sample, set to True. If early and late last stage nymphs were not counted seperately, set to False

remove_individual_stage_columns: boolean, default=True: If True, removes exuviea, late_last_stage, early_last_stage columns from dataframe after returning new_last_stage column.

Example of an input dataframe

|-----------|———–|-------|——-|-----|—————|-----------|———————|--------------------| | mm_1 | mm | 5 | 45 | … | 0 | 0 | 0 | 0 | | mm_1 | mm | 9 | NA | … | 0 | 1 | 5 | 0 | | mm_1 | mm | 11 | NA | … | 4 | 0 | 7 | 4 |

Example of an output dataframe

|-----------|———–|-------|——-|---------------|—————|---------------|—————| | mm_1 | mm | 5 | 45 | 0 | 0 | 0 | 0 | | mm_1 | mm | 9 | NA | 10 | 5 | 0 | 6 | mm_1 | mm | 11 | NA | 15 | 17 | 4 | 12

correct_cumulative_counts(current_stage, grouping_variable): Inner function for convert_counts_to_cumulative(). If nymphs die during the bioassay, they should be included in the cumulative count for the stages it had passed. Otherwise, the cumulative count could go down over time. This function corrects the cumulative count if it is lower than the previous count.

create_df_with_max_counts_per_stage(egg_column, last_stage, grouping_variable): Inner function for convert_counts_to_cumulative(). With the maximum number of nymphs developed to or past each developmental stage per plant, making graphs becomes easier.

convert_counts_to_cumulative(n_developmental_stages=4, sample_id='sample_id', eggs='eggs', first_stage='first_instar', second_stage='second_instar', third_stage='third_instar', fourth_stage='fourth_instar', fifth_stage='fifth_instar', sixth_stage='sixth_instar')

Calculates the total number of nymphs developed to or past each stage on each timepoint. Cumulative counts make the analysis of development over time and the comparison of number of nymphs past a stage easier. If nymphs in the (late) final nymph stage were removed after each counting moment and/or when exuviea and/or early and late last instar stage nymphs were counted seperately, total_last_stage() should be used first.

Parameters:

n_developmental_stages (integer, default=4) – The number of developmental stages which were recorded seperately. Can range from 2 to 6.
sample_id (string, default='sample_id') – The name of the column that contains the sample identifiers.
eggs (string, default='eggs') – The name of the column that contains the counts of the eggs.
first_stage (string, default='first_instar') – The name of the column that contains the counts of the first developmental stage recorded in the bioassay.
second_stage (string, default='second_instar') – The name of the column that contains the counts of the second developmental stage recorded in the bioassay.
third_stage (string, default='third_instar') – The name of the column that contains the counts of the third developmental stage recorded in the bioassay.
fourth_stage (string, default='fourth_instar') – The name of the column that contains the counts of the fourth developmental stage recorded in the bioassay.
fifth_stage (string, default='fifth_instar') – The name of the column that contains the counts of the fifth developmental stage recorded in the bioassay.
sixth_stage (string, default='sixth_instar') – The name of the column that contains the counts of the sixth developmental stage recorded in the bioassay.

prepare_for_plotting(order_of_groups)

Prepare the order in which the groups should be plotted.

Parameters:: order_of_groups (string) – List of the group names in the prefered order for plotting For example: [‘MM’, ‘LA’, ‘PI’]

plot_counts_per_stage(grouping_variable='genotype', sample_id='sample_id', eggs='eggs', first_stage='first_instar', second_stage='second_instar', third_stage='third_instar', fourth_stage='fourth_instar', absolute_x_axis_label='genotype', absolute_y_axis_label='counts (absolute)', relative_x_axis_label='genotype', relative_y_axis_label='relative number of nymphs', make_nymphs_relative_to='first_instar')

Plots the counts per nymphal stage in boxplots. The nymph counts are given as the absolute number of nymphs that developed to or past each stage at the last timepoint and as a fraction of nymphs that developed to or past each stage at the last timepoint relative to another developmental stage. The other developmental stage to which the data is made relative defaults to the first instar stage, because this represents the number of hatched eggs. This means that in this case only the succes of the development is compared between groups (e.g. genotypes or treatments) and the hatching rate of the eggs is not taken into acount.

The imput dataframe ‘max_counts’ is created with convert_counts_to_cumulative.

Parameters:

grouping_variable (string, default='genotype') – The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments
sample_id (string, default='sample_id') – The name of the column that contains the sample identifiers.
eggs (string, default='eggs') – The name of the column that contains the counts of the eggs.
first_stage (string, default='first_instar') – The name of the column that contains the counts of the first developmental stage recorded in the bioassay.
second_stage (string, default='second_instar') – The name of the column that contains the counts of the second developmental stage recorded in the bioassay.
third_stage (string, default='third_instar') – The name of the column that contains the counts of the third developmental stage recorded in the bioassay.
fourth_stage (string, default='fourth_instar') – The name of the column that contains the counts of the fourth developmental stage recorded in the bioassay.
absolute_x_axis_label (string, default='genotype') – Label for the x-axis of the boxplots with count data.
absolute_y_axis_label (string, default='counts (absolute)') – Label for the y-axis of the boxplots with count data.
relative_x_axis_label (string, default='genotype') – Label for the x-axis of the boxplots with relative development.
relative_y_axis_label (string, default='relative number of nymphs') – Label for the y-axis of the boxplots with relative development.
make_nymphs_relative_to (string, default='first_instar') – The name of the column that contains the counts of the developmental stage which should be used to calculate the relative development to all developmental stages.

Examples

Example of an input dataframe

|-----------|———–|-------|——-|---------------|—————|--------------|—————| | mm_1 | mm | 28 | 45 | 34 | 30 | 30 | 29 | | mm_2 | mm | 28 | 50 | 39 | 33 | 28 | 26 | | LA_1 | LA | 28 | 42 | 30 | 25 | 17 | 4 |

plot_development_over_time_in_fitted_model(grouping_variable='genotype', sample_id='sample_id', time='day', x_axis_label='days after infection', y_axis_label='development to 4th instar stage (relative to 1st instars)', stage_of_interest='fourth_instar', use_relative_data=True, make_nymphs_relative_to='first_instar', predict_for_n_days=0)

Fits a 3 parameter log-logistic curve to the development over time to a specified stage. The fitted curve and the observed datapoints are plotted and returned with the model parameters. The reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

Parameters:

grouping_variable (string, default='genotype') – The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments
sample_id (string, default='sample_id') – The name of the column that contains the sample identifiers.
time (string, default='day') – The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.
x_axis_label (string, default='days after infection') – Label for the x-axis
y_axis_label (string, default='development to 4th instar stage (relative to 1st instars)') – Label for the y-axis
stage_of_interest (string, default='fourth_instar') – The name of the column that contains the data of the developmental stage of interest.
use_relative_data (boolean, default=True) – If True, the counts for the stage of interest are devided by the stage indicated at ‘make_nymphs_relative_to’. The returned relative rate is used for plotting and curve fitting.
make_nymphs_relative_to (string, default='first_instar') – The name of the column that contains the counts of the developmental stage which should be used to calculate therelative development to all developmental stages.
predict_for_n_days (default=o) – Continue model for n days after final count.

Examples

Example of an input dataframe

|-----------|———–|-------|——-|---------------|—————|---------------|—————| | mm_1 | mm | 5 | 45 | 15 | 7 | 0 | 0 | | mm_1 | mm | 9 | NA | 24 | 14 | 6 | 3 | | mm_1 | mm | 11 | NA | 38 | 27 | 16 | 12 |

plot_survival_over_time_in_fitted_model(grouping_variable='genotype', sample_id='sample_id', time='day', x_axis_label='days after infection', y_axis_label='number of nymphs per plant', stage_of_interest='first_instar', use_relative_data=False, make_nymphs_relative_to='eggs', predict_for_n_days=0)

Fits a 3 parameter log-normal curve to the number of living nymphs over time. The fitted curve and the observed datapoints are plotted and returned with the model parameters. The reduced Chi-squared is provided to asses the goodness of fit for the fitted models for each group (genotype, treatment, etc.). Optimaly, the reduced Chi-squared should approach the number of observation points per sample. A much larger reduced Chi-squared indicates a bad fit. A much smaller reduced Chi-squared indicates overfitting of the model.

Parameters:

grouping_variable (string, default='genotype') – The name of the column that contains the names of the grouping variables. Examples are genotypes or treatments
sample_id (string, default='sample_id') – The name of the column that contains the sample identifiers.
time (string, default='day') – The name of the column that contains the time at which bioassay scoring was performed. Examples are the date or the number of days after infection.
x_axis_label (string, default='days after infection') – Label for the x-axis
y_axis_label (string, default='development to 4th instar stage (relative to 1st instars)') – Label for the y-axis
stage_of_interest (string, default='first_instar') – The name of the column that contains the data of the developmental stage of interest.
use_relative_data (boolean, default=False) – If True, the counts for the stage of interest are devided by the stage indicated at ‘make_nymphs_relative_to’. The returned relative rate is used for plotting and curve fitting.
make_nymphs_relative_to (string, default='eggs') – The name of the column that contains the counts of the developmental stage which should be used to calculate the relative development to all developmental stages.
predict_for_n_days (default=o) – Continue model for n days after final count.

Examples

Example of an input dataframe

|-----------|———–|-------|——-|---------------|—————|---------------|—————| | mm_1 | mm | 5 | 45 | 15 | 7 | 0 | 0 | | mm_1 | mm | 9 | NA | 24 | 14 | 6 | 3 | | mm_1 | mm | 11 | NA | 38 | 27 | 16 | 12 |