PhenoFeatureFinder.feature_selection_using_ml ============================================= .. py:module:: PhenoFeatureFinder.feature_selection_using_ml Attributes ---------- .. autoapisummary:: PhenoFeatureFinder.feature_selection_using_ml.tpot_custom_config Classes ------- .. autoapisummary:: PhenoFeatureFinder.feature_selection_using_ml.FeatureSelection Module Contents --------------- .. py:data:: tpot_custom_config .. py:class:: FeatureSelection(metabolome_csv, phenotype_csv, metabolome_feature_id_col='feature_id', phenotype_sample_id='sample_id') A class to perform metabolite feature selection using phenotyping and metabolic data. - Perform sanity checks on input dataframes (values above 0, etc.). - Get a baseline performance of a simple Machine Learning Random Forest ("baseline"). - Perform automated Machine Learning model selection using autosklearn. Using metabolite data, train a model to predict phenotypes. Yields performance metrics (balanced accuracy, precision, recall) on the selected model. - Extracts performance metrics from the best ML model. - Extracts the best metabolite features based on their feature importance and make plots per sample group. :param metabolome_csv: A path to a .csv file with the cleaned up metabolome data (unreliable features filtered out etc.) Use the MetabolomeAnalysis class methods. Shape of the dataframe is usually (n_samples, n_features) with n_features >> n_samples :type metabolome_csv: string :param phenotype_csv: A path to a .csv file with the phenotyping data. Should be two columns at least with: - column 1 containing the sample identifiers - column 2 containing the phenotypic class e.g. 'resistant' or 'sensitive' :type phenotype_csv: string :param metabolome_feature_id_col: The name of the column that contains the feature identifiers. Feature identifiers should be unique (=not duplicated). :type metabolome_feature_id_col: string, default='feature_id' :param phenotype_sample_id: The name of the column that contains the sample identifiers. Sample identifiers should be unique (=not duplicated). :type phenotype_sample_id: string, default='sample_id' .. attribute:: metabolome_validated Is the metabolome file valid for Machine Learning? (default is False) :type: bool .. attribute:: phenotype_validated Is the phenotype file valid for Machine Learning? (default is False) :type: bool .. attribute:: baseline_performance The baseline performance computed with get_baseline_performance() i.e. using a simple Random Forest model. Search for the best ML model using search_best_model() should perform better than this baseline performance. :type: float .. attribute:: best_ensemble_models_searched Is the search for best ensemble model using auto-sklearn already performed? (default is False) :type: bool .. attribute:: metabolome The validated metabolome dataframe of shape (n_features, n_samples). :type: pandas.core.frame.DataFrame .. attribute:: phenotype A validated phenotype dataframe of shape (n_samples, 1) Sample names in the index and one column named 'phenotype' with the sample classes. :type: pandas.core.frame.DataFrame .. attribute:: baseline_performance Average balanced accuracy score (-/+ standard deviation) of the basic Random Forest model. :type: str .. attribute:: best_model A scikit-learn pipeline that contains one or more steps. It is the best performing pipeline found by TPOT automated ML search. :type: sklearn.pipeline.Pipeline .. attribute:: pc_importances A Pandas dataframe that contains Principal Components importances using scikit-learn permutation_importance() Mean of PC importance over n_repeats. Standard deviation over n_repeats. Raw permutation importance scores. :type: pandas.core.frame.DataFrame .. attribute:: feature_loadings A Pandas dataframe that contains feature loadings related to Principal Components :type: pandas.core.frame.DataFrame .. method:: validate_input_metabolome_df() Validates the dataframe read from the 'metabolome_csv' input file. .. method:: validate_input_phenotype_df() Validates the phenotype dataframe read from the 'phenotype_csv' input file. .. method:: get_baseline_performance() Fits a basic Random Forest model to get default performance metrics. .. method:: search_best_model_with_tpot_and_get_feature_importances() Search for the best ML pipeline using TPOT genetic programming method. Computes and output performance metrics from the best pipeline. Extracts feature importances using scikit-learn permutation_importance() method. .. rubric:: Notes Example of an input metabolome .csv file | feature_id | genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 | |-------------|----------------|----------------|----------------|----------------| | metabolite1 | 1246 | 1245 | 12345 | 12458 | | metabolite2 | 0 | 0 | 0 | 0 | | metabolite3 | 10 | 0 | 0 | 154 | Example of an input phenotype .csv file | sample_id | phenotype | |----------------|-----------| | genotypeA_rep1 | sensitive | | genotypeA_rep2 | sensitive | | genotypeA_rep3 | sensitive | | genotypeA_rep4 | sensitive | | genotypeB_rep1 | resistant | | genotypeB_rep2 | resistant | .. py:attribute:: metabolome_validated :value: False .. py:attribute:: phenotype_validated :value: False .. py:attribute:: baseline_performance :value: None .. py:attribute:: best_ensemble_models_searched :value: False .. py:attribute:: metabolome .. py:attribute:: phenotype .. py:method:: validate_input_metabolome_df() Validates the dataframe containing the feature identifiers, metabolite values and sample names. Will place the 'feature_id_col' column as the index of the validated dataframe. The validated metabolome dataframe is stored as the 'validated_metabolome' attribute :returns: **self** -- Object with metabolome_validated set to True :rtype: object .. rubric:: Notes Example of a validated output metabolome dataframe | genotypeA_rep1 | genotypeA_rep2 | genotypeA_rep3 | genotypeA_rep4 | |----------------|----------------|----------------|----------------| feature_id | metabolite1 | 1246 | 1245 | 12345 | 12458 | | metabolite2 | 0 | 0 | 0 | 0 | | metabolite3 | 10 | 0 | 0 | 154 | .. py:method:: validate_input_phenotype_df(phenotype_class_col='phenotype') Validates the dataframe containing the phenotype classes and the sample identifiers Params ------ phenotype_class_col: string, default="phenotype" The name of the column to be used :returns: **self** -- Object with phenotype_validated set to True :rtype: object .. rubric:: Notes Example of an input phenotype dataframe | sample_id | phenotype | |----------------|-----------| | genotypeA_rep1 | sensitive | | genotypeA_rep2 | sensitive | | genotypeA_rep3 | sensitive | | genotypeA_rep4 | sensitive | | genotypeB_rep1 | resistant | | genotypeB_rep2 | resistant | Example of a validated output phenotype dataframe. | phenotype | |-----------| sample_id | genotypeA_rep1 | sensitive | | genotypeA_rep2 | sensitive | | genotypeA_rep3 | sensitive | | genotypeA_rep4 | sensitive | | genotypeB_rep1 | resistant | | genotypeB_rep2 | resistant | .. rubric:: Example >> fs = FeatureSelection( >> metabolome_csv="clean_metabolome.csv", >> phenotype_csv="phenotypes_test_data.csv", >> phenotype_sample_id='sample') >> fs.validate_input_phenotype_df() .. py:method:: get_baseline_performance(kfold=5, train_size=0.8, random_state=123, scoring_metric='balanced_accuracy') Takes the phenotype and metabolome dataset and compute a simple Random Forest analysis with default hyperparameters. This will give a base performance for a Machine Learning model that has then to be optimised using autosklearn k-fold cross-validation is performed to mitigate split effects on small datasets. :param kfold: Cross-validation strategy. Default is to use a 5-fold cross-validation. :type kfold: int, optional :param train_size: If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training). :type train_size: float or int, optional :param random_state: Controls both the randomness of the train/test split samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123. :type random_state: int, optional scoring_metric: str, optional A valid scoring value (default="balanced_accuracy") To get a complete list, type: >> from sklearn.metrics import SCORERS >> sorted(SCORERS.keys()) balanced accuracy is the average of recall obtained on each class. :returns: **self** -- Object with baseline_performance attribute. :rtype: object .. rubric:: Example >>> fs = FeatureSelection( metabolome_csv="../tests/clean_metabolome.csv", phenotype_csv="../tests/phenotypes_test_data.csv", phenotype_sample_id='sample') fs.get_baseline_performance() .. py:method:: search_best_model_with_tpot_and_compute_pc_importances(class_of_interest, scoring_metric='balanced_accuracy', kfolds=3, train_size=0.8, max_time_mins=5, max_eval_time_mins=1, random_state=123, n_permutations=10, export_best_pipeline=True, path_for_saving_pipeline='./best_fitting_pipeline.py') Search for the best ML model with TPOT genetic programming methodology and extracts best Principal Components. A characteristic of metabolomic data is to have a high number of features strongly correlated to each other. This makes it difficult to extract the individual true feature importance. Here, this method implements a dimensionality reduction method (PCA) and the importances of each PC is computed. A resampling strategy called "cross-validation" will be performed on a subset of the data (training data) to increase the model generalisation performance. Finally, the model performance is tested on the unseen test data subset. By default, TPOT will make use of a set of preprocessors (e.g. Normalizer, PCA) and algorithms (e.g. RandomForestClassifier) defined in the default config (classifier.py). See: https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier.py :param class_of_interest: The name of the class of interest also called "positive class". This class will be used to calculate recall_score and precision_score. Recall score = TP / (TP + FN) with TP: true positives and FN: false negatives. Precision score = TP / (TP + FP) with TP: true positives and FP: false positives. :type class_of_interest: str :param scoring_metric: Function used to evaluate the quality of a given pipeline for the classification problem. Default is 'balanced accuracy'. The following built-in scoring functions can be used: 'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'precision' etc. (suffixes apply as with ‘f1’), 'recall' etc. (suffixes apply as with ‘f1’), ‘jaccard’ etc. (suffixes apply as with ‘f1’), 'roc_auc', ‘roc_auc_ovr’, ‘roc_auc_ovo’, ‘roc_auc_ovr_weighted’, ‘roc_auc_ovo_weighted’ :type scoring_metric: str, optional :param kfolds: Number of folds for the stratified K-Folds cross-validation strategy. Default is 3-fold cross-validation. Has to be comprised between 3 and 10 i.e. 3 <= kfolds =< 10 See https://scikit-learn.org/stable/modules/cross_validation.html :type kfolds: int, optional :param train_size: If float, should be between 0.5 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. Default is 0.8 (80% of the data used for training). :type train_size: float or int, optional :param max_time_mins: How many minutes TPOT has to optimize the pipeline (in total). Default is 5 minutes. This setting will allow TPOT to run until max_time_mins minutes elapsed and then stop. Try short time intervals (5, 10, 15min) and then see if the model score on the test data improves. :type max_time_mins: int, optional :param max_eval_time_mins: How many minutes TPOT has to evaluate a single pipeline. Default is 1min. This time has to be smaller than the 'max_time_mins' setting. :type max_eval_time_mins: float, optional :param random_state: Controls both the randomness of the train/test split samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details. You can change this value several times to see how it affects the best ensemble model performance. Default is 123. :type random_state: int, optional :param n_permutations: Number of permutations used to compute feature importances from the best model using scikit-learn permutation_importance() method. Default is 10 permutations. :type n_permutations: int, optional :param export_best_pipeline: If True, the best fitting pipeline is exported as .py file. This allows for reuse of the pipeline on new datasets. Default is True. :type export_best_pipeline: `bool`, optional :param path_for_saving_pipeline: The path and filename of the best fitting pipeline to save. The name must have a '.py' extension. Default to "./best_fitting_pipeline.py" :type path_for_saving_pipeline: str, optional :returns: **self** -- The object with best model searched and feature importances computed. :rtype: object .. rubric:: Notes Principal Component importances are calculated on the training set Permutation importances can be computed either on the training set or on a held-out testing or validation set. Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit. https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance .. py:method:: get_names_of_top_n_features_from_selected_pc(selected_pc=1, top_n=5) Get the names of features with highest loading scores on selected PC Takes the matrix of loading scores of shape (n_samples, n_features) and the metabolome dataframe of shape (n_features, n_samples) and extract the names of features. The loadings matrix is available after running the search_best_model_with_tpot_and_compute_pc_importances() method. Params ------ selected_pc: int, optional Principal Component to keep. 1-based index (1 selects PC1, 2 selected PC2, etc.) Default is 1. top_n: int, optional Number of features to select. The top_n features with the highest absolute loadings will be selected from the selected_pc PC. For instance, the top 5 features from PC1 will be selected with selected_pc=1 and top_n=5. Default is 5. :returns: A list of feature names.