PhenoFeatureFinder.utils
Functions
|
Normalize a dataframe with the median of ratios method |
|
Compute the q-th percentile of data. |
|
Compute a series of metrics for classification tasks |
|
Plot confusion matrix |
|
A utility function to extract the grouping factor (e.g. 'genotype') from sample names. |
Module Contents
- PhenoFeatureFinder.utils.median_of_ratios_normalisation(_data: pandas.DataFrame) pandas.DataFrame
Normalize a dataframe with the median of ratios method from DESeq2.
input data (as a pandas dataframe), e.g.:
sample1 sample2 sample3
gene1 0.00000 10.0000 4.00000 gene2 2.00000 6.00000 12.0000 gene3 33.5000 55.0000 200.000
- normalized output:
sample1 sample2 sample3
gene1 0.00000 10.6444 1.57882 gene2 4.76032 6.38664 4.73646 gene3 78.5453 58.5442 78.9410
References
StatQuest: https://www.youtube.com/watch?v=UFB993xufUU HBC Harvard: https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
- PhenoFeatureFinder.utils.calculate_percentile(df, my_percentile=50)
Compute the q-th percentile of data. Returns the q-th percentile of the array elements.
- Parameters:
my_percentile (float, optional) – Percentile which must be between 0 and 100.
See also
numpy.percentilehttps//numpy.org/doc/stable/reference/generated/numpy.percentile.html
- PhenoFeatureFinder.utils.compute_metrics_classification(y_predictions, y_trues, positive_class)
Compute a series of metrics for classification tasks
Util function designed to work downstream of the search for the best model. Will compute the following metrics:
balanced accuracy
precision
recall
f1 score
- Parameters:
y_predictions (list) – List of class predictions.
y_trues (list) – List of the true values (from the test set)
positive_class (str) – The name of the positive class for calculation of true positives, true negatives, etc.
- Returns:
model_metrics_df – Dataframe with the balanced accuracy, precision, recall and f1 score calculated.
- Return type:
pandas.core.frame.DataFrame
See also
https//scikit-learn.org/stable/modules/model_evaluation.html
- PhenoFeatureFinder.utils.plot_confusion_matrix(y_predictions, y_trues)
Plot confusion matrix
- Parameters:
y_predictions (list) – List of class predictions.
y_trues (list) – List of the true values (from the test set)
positive_class (str) – The name of the positive class for calculation of true positives, true negatives, etc.
- Returns:
model_metrics_df – Dataframe with the balanced accuracy, precision, recall and f1 score calculated.
- Return type:
pandas.core.frame.DataFrame
See also
https//scikit-learn.org/stable/modules/model_evaluation.html
- PhenoFeatureFinder.utils.extract_samples_to_condition(df, name_grouping_var='genotype', separator_replicates='_')
A utility function to extract the grouping factor (e.g. ‘genotype’) from sample names.
Uses melting (wide to long) and split grouping variable from biological replicates using specified separator.
- Parameters:
df (pandas.core.DataFrame)
name_grouping_var (str, optional) – Name of the variable used as grouping variable (default is ‘genotype’).
separator_replicates (str, optional) – The separator between the grouping variable and the biological replicates ( default is underscore ‘_’)
- Return type:
A dataframe with the correspondence between samples and experimental condition (grouping variable).
Notes
- Input dataframe
- genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 |
|----------------|—————-|----------------|—————-|
feature_id
metabolite1 | 1246 | 1245 | 12345 | 12458 |metabolite2 | 0 | 0 | 0 | 0 |metabolite3 | 10 | 0 | 0 | 154 |
Output dataframe
sample | genotype | replicate ||--------------------|—————-|----------------| | genotypeA_rep1 | genotypeA | rep1 | | genotypeA_rep2 | genotypeA | rep2 | | genotypeA_rep3 | genotypeA | rep3 | | genotypeA_rep4 | genotypeA | rep4 | | etc.