PhenoFeatureFinder.feature_selection_using_ml

Attributes

tpot_custom_config

Classes

FeatureSelection

A class to perform metabolite feature selection using phenotyping and metabolic data.

Module Contents

PhenoFeatureFinder.feature_selection_using_ml.tpot_custom_config

class PhenoFeatureFinder.feature_selection_using_ml.FeatureSelection(metabolome_csv, phenotype_csv, metabolome_feature_id_col='feature_id', phenotype_sample_id='sample_id')

A class to perform metabolite feature selection using phenotyping and metabolic data.

Perform sanity checks on input dataframes (values above 0, etc.).
Get a baseline performance of a simple Machine Learning Random Forest (“baseline”).
Perform automated Machine Learning model selection using autosklearn.
Using metabolite data, train a model to predict phenotypes. Yields performance metrics (balanced accuracy, precision, recall) on the selected model.
Extracts performance metrics from the best ML model.
Extracts the best metabolite features based on their feature importance and make plots per sample group.

Parameters:

metabolome_csv (string) – A path to a .csv file with the cleaned up metabolome data (unreliable features filtered out etc.) Use the MetabolomeAnalysis class methods. Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples
phenotype_csv (string) –
A path to a .csv file with the phenotyping data. Should be two columns at least with:
- column 1 containing the sample identifiers
- column 2 containing the phenotypic class e.g. ‘resistant’ or ‘sensitive’
metabolome_feature_id_col (string, default='feature_id') – The name of the column that contains the feature identifiers. Feature identifiers should be unique (=not duplicated).
phenotype_sample_id (string, default='sample_id') – The name of the column that contains the sample identifiers. Sample identifiers should be unique (=not duplicated).

metabolome_validated

Is the metabolome file valid for Machine Learning? (default is False)

Type:: bool

phenotype_validated

Is the phenotype file valid for Machine Learning? (default is False)

Type:: bool

baseline_performance

The baseline performance computed with get_baseline_performance() i.e. using a simple Random Forest model. Search for the best ML model using search_best_model() should perform better than this baseline performance.

Type:: float

best_ensemble_models_searched

Is the search for best ensemble model using auto-sklearn already performed? (default is False)

Type:: bool

metabolome

The validated metabolome dataframe of shape (n_features, n_samples).

Type:: pandas.core.frame.DataFrame

phenotype

A validated phenotype dataframe of shape (n_samples, 1) Sample names in the index and one column named ‘phenotype’ with the sample classes.

Type:: pandas.core.frame.DataFrame

baseline_performance

Average balanced accuracy score (-/+ standard deviation) of the basic Random Forest model.

Type:: str

best_model

A scikit-learn pipeline that contains one or more steps. It is the best performing pipeline found by TPOT automated ML search.

Type:: sklearn.pipeline.Pipeline

pc_importances

A Pandas dataframe that contains Principal Components importances using scikit-learn permutation_importance(): Mean of PC importance over n_repeats. Standard deviation over n_repeats. Raw permutation importance scores.

Type:: pandas.core.frame.DataFrame

feature_loadings

A Pandas dataframe that contains feature loadings related to Principal Components

Type:: pandas.core.frame.DataFrame

validate_input_metabolome_df(): Validates the dataframe read from the ‘metabolome_csv’ input file.

validate_input_phenotype_df(): Validates the phenotype dataframe read from the ‘phenotype_csv’ input file.

get_baseline_performance(): Fits a basic Random Forest model to get default performance metrics.

search_best_model_with_tpot_and_get_feature_importances(): Search for the best ML pipeline using TPOT genetic programming method. Computes and output performance metrics from the best pipeline. Extracts feature importances using scikit-learn permutation_importance() method.

Notes

Example of an input metabolome .csv file

feature_id | genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 |

|-------------|—————-|----------------|—————-|----------------| | metabolite1 | 1246 | 1245 | 12345 | 12458 | | metabolite2 | 0 | 0 | 0 | 0 | | metabolite3 | 10 | 0 | 0 | 154 |

Example of an input phenotype .csv file

sample_id | phenotype |

|----------------|———–| | genotypeA_rep1 | sensitive | | genotypeA_rep2 | sensitive | | genotypeA_rep3 | sensitive | | genotypeA_rep4 | sensitive | | genotypeB_rep1 | resistant | | genotypeB_rep2 | resistant |

metabolome_validated = False

phenotype_validated = False

baseline_performance = None

best_ensemble_models_searched = False

metabolome

phenotype

validate_input_metabolome_df()

Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the ‘feature_id_col’ column as the index of the validated dataframe. The validated metabolome dataframe is stored as the ‘validated_metabolome’ attribute

Returns:: self – Object with metabolome_validated set to True
Return type:: object

Notes

Example of a validated output metabolome dataframe

genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 |

|----------------|—————-|----------------|—————-|

feature_id

metabolite1 |   1246         | 1245           | 12345          | 12458          |
metabolite2 |   0            | 0              | 0              | 0              |
metabolite3 |   10           | 0              | 0              | 154            |

validate_input_phenotype_df(phenotype_class_col='phenotype')

Validates the dataframe containing the phenotype classes and the sample identifiers

Params

phenotype_class_col: string, default=”phenotype”: The name of the column to be used

returns:: self – Object with phenotype_validated set to True
rtype:: object

Notes

Example of an input phenotype dataframe

sample_id | phenotype |

|----------------|———–| | genotypeA_rep1 | sensitive | | genotypeA_rep2 | sensitive | | genotypeA_rep3 | sensitive | | genotypeA_rep4 | sensitive | | genotypeB_rep1 | resistant | | genotypeB_rep2 | resistant |

Example of a validated output phenotype dataframe.

phenotype |

|-----------|

sample_id

genotypeA_rep1 | sensitive |
genotypeA_rep2 | sensitive |
genotypeA_rep3 | sensitive |
genotypeA_rep4 | sensitive |
genotypeB_rep1 | resistant |
genotypeB_rep2 | resistant |

Example

>> fs = FeatureSelection( >> metabolome_csv=”clean_metabolome.csv”, >> phenotype_csv=”phenotypes_test_data.csv”, >> phenotype_sample_id=’sample’) >> fs.validate_input_phenotype_df()

get_baseline_performance(kfold=5, train_size=0.8, random_state=123, scoring_metric='balanced_accuracy')

Takes the phenotype and metabolome dataset and compute a simple Random Forest analysis with default hyperparameters. This will give a base performance for a Machine Learning model that has then to be optimised using autosklearn

k-fold cross-validation is performed to mitigate split effects on small datasets.

Parameters:

kfold (int, optional) – Cross-validation strategy. Default is to use a 5-fold cross-validation.
train_size (float or int, optional) – If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training).
random_state (int, optional) – Controls both the randomness of the train/test split samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123.

scoring_metric: str, optional: A valid scoring value (default=”balanced_accuracy”) To get a complete list, type: >> from sklearn.metrics import SCORERS >> sorted(SCORERS.keys()) balanced accuracy is the average of recall obtained on each class.

Returns:: self – Object with baseline_performance attribute.
Return type:: object

Example

>>> fs = FeatureSelection(
           metabolome_csv="../tests/clean_metabolome.csv",
           phenotype_csv="../tests/phenotypes_test_data.csv",
           phenotype_sample_id='sample')
    fs.get_baseline_performance()

search_best_model_with_tpot_and_compute_pc_importances(class_of_interest, scoring_metric='balanced_accuracy', kfolds=3, train_size=0.8, max_time_mins=5, max_eval_time_mins=1, random_state=123, n_permutations=10, export_best_pipeline=True, path_for_saving_pipeline='./best_fitting_pipeline.py')

Search for the best ML model with TPOT genetic programming methodology and extracts best Principal Components.

A characteristic of metabolomic data is to have a high number of features strongly correlated to each other. This makes it difficult to extract the individual true feature importance. Here, this method implements a dimensionality reduction method (PCA) and the importances of each PC is computed.

A resampling strategy called “cross-validation” will be performed on a subset of the data (training data) to increase the model generalisation performance. Finally, the model performance is tested on the unseen test data subset.

By default, TPOT will make use of a set of preprocessors (e.g. Normalizer, PCA) and algorithms (e.g. RandomForestClassifier) defined in the default config (classifier.py). See: https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier.py

Parameters:

class_of_interest (str) – The name of the class of interest also called “positive class”. This class will be used to calculate recall_score and precision_score. Recall score = TP / (TP + FN) with TP: true positives and FN: false negatives. Precision score = TP / (TP + FP) with TP: true positives and FP: false positives.
scoring_metric (str, optional) –
Function used to evaluate the quality of a given pipeline for the classification problem. Default is ‘balanced accuracy’. The following built-in scoring functions can be used:

’accuracy’, ‘adjusted_rand_score’, ‘average_precision’, ‘balanced_accuracy’, ‘f1’, ‘f1_macro’, ‘f1_micro’, ‘f1_samples’, ‘f1_weighted’, ‘neg_log_loss’, ‘precision’ etc. (suffixes apply as with ‘f1’), ‘recall’ etc. (suffixes apply as with ‘f1’), ‘jaccard’ etc. (suffixes apply as with ‘f1’), ‘roc_auc’, ‘roc_auc_ovr’, ‘roc_auc_ovo’, ‘roc_auc_ovr_weighted’, ‘roc_auc_ovo_weighted’
kfolds (int, optional) – Number of folds for the stratified K-Folds cross-validation strategy. Default is 3-fold cross-validation. Has to be comprised between 3 and 10 i.e. 3 <= kfolds =< 10 See https://scikit-learn.org/stable/modules/cross_validation.html
train_size (float or int, optional) – If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training).
max_time_mins (int, optional) – How many minutes TPOT has to optimize the pipeline (in total). Default is 5 minutes. This setting will allow TPOT to run until max_time_mins minutes elapsed and then stop. Try short time intervals (5, 10, 15min) and then see if the model score on the test data improves.
max_eval_time_mins (float, optional) – How many minutes TPOT has to evaluate a single pipeline. Default is 1min. This time has to be smaller than the ‘max_time_mins’ setting.
random_state (int, optional) – Controls both the randomness of the train/test split samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123.
n_permutations (int, optional) – Number of permutations used to compute feature importances from the best model using scikit-learn permutation_importance() method. Default is 10 permutations.
export_best_pipeline (bool, optional) – If True, the best fitting pipeline is exported as .py file. This allows for reuse of the pipeline on new datasets. Default is True.
path_for_saving_pipeline (str, optional) – The path and filename of the best fitting pipeline to save. The name must have a ‘.py’ extension. Default to “./best_fitting_pipeline.py”

Returns:

self – The object with best model searched and feature importances computed.

Return type:

object

Notes

Principal Component importances are calculated on the training set Permutation importances can be computed either on the training set or on a held-out testing or validation set. Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit. https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance

get_names_of_top_n_features_from_selected_pc(selected_pc=1, top_n=5)

Get the names of features with highest loading scores on selected PC

Takes the matrix of loading scores of shape (n_samples, n_features) and the metabolome dataframe of shape (n_features, n_samples) and extract the names of features. The loadings matrix is available after running the search_best_model_with_tpot_and_compute_pc_importances() method.

Params

selected_pc: int, optional: Principal Component to keep. 1-based index (1 selects PC1, 2 selected PC2, etc.) Default is 1.
top_n: int, optional: Number of features to select. The top_n features with the highest absolute loadings will be selected from the selected_pc PC. For instance, the top 5 features from PC1 will be selected with selected_pc=1 and top_n=5. Default is 5.

returns:: A list of feature names.