MendezComparisonBinaryML

Table of Contents:

Overview of the Repository
Table 1: Train (with IB 95% CI) and Test (with OOB 95% CI) AUC for the following datasets and methods.
Table 2: The download link for the following datasets and methods.
Table 3: Machine Learning Methods
Table 4: Datasets

Overview of the Repository

This repository contains the supplementary information for the journal article, “A Comparative Evaluation of the Generalised Predictive Ability of Eight Machine Learning Algorithms across Ten Clinical Metabolomics Data Sets for Binary Classification”. Generalised predictive ability is evaluated using the area under the receiver operator characteristic (ROC) curve (AUC) using a Train (2/3 of the data) and Test (1/3 of the data) set, combined with bootstrap resampling of the data for each model. The latter results in a population of models (n=100), trained on approximately 2/3 of the data, and tested on the unused data (approximately 1/3). From which, 95% population confidence intervals (CI) are calculated for the training data (in-bag IB), and test data (out-of-bag OOB). Both Train (with IB 95% CI) and Test (with OOB 95% CI) AUC are calculated and stored for each model (refer to Table 1). The Test (with OOB 95% CI) AUC is used as an unbiased estimator of generalised predictive ability.

The eight machine learning methods evaluated in this study are: partial least squares regression – discriminatory analysis (PLS-DA), principal component regression (PCR), principal component logistic regression (PCLR), random forest (RF), linear kernel support vector machine (SVM-Lin), radial basis kernel support vector machine (SVM-RBF), and a linear and non-linear two-layer artificial neural network (ANN). The linear two-layer ANN hidden consisting of multiple neurons (n = 2 to 6) with a linear activation, and layer 2 (output layer) consisting of a single neuron with a logistic (sigmoidal) activation function (ANN-LS). The non-linear two-layer ANN hidden consisting of multiple neurons (n = 2 to 6) with a sigmoidal activation, and layer 2 (output layer) consisting of a single neuron with a logistic (sigmoidal) activation function (ANN-SS). For information on each ML method refer to Table 3.

The datasets used for the comparative evaluation of 8 ML methods were identified using the following criteria: data were of clinical origin, data were previously published, data publicly available at either MetaboLights or Metabolomics Workbench data repositories, metabolite data available in a form amenable for direct modelling, experimental data (e.g. Clinical Outcome) available in a form amenable for direct modelling, a clear binary outcome with the number of samples in each class reasonably balanced, data representative of the three primary metabolomics technologies (NMR, LC-MS, GC-MS), data representative of multiple biofluids (e.g. blood, urine, faeces) and a range of sample sizes (from less than 50 to more than 500). For information on each dataset refer to Table 4.

The computational workflow for each model and dataset includes the following step: import package, load data and peak sheet, extract X and Y, hyperparameter optimisation, build model and evaluate, bootstrap evaluation, and export results. All workflows were implemented using the Python programming language in the form of interactive Jupyter notebooks. All data and notebooks are publicly available on this GitHub repository. All notebooks can be viewed in static html format by following the steps under Table 1. All notebooks can be downloaded and re-run by following the steps under Table 2. For information regarding each step in the workflow, refer to the top of each notebook.

Table 1: Train (with IB 95% CI) and Test (with OOB 95% CI) AUC for the following datasets and methods.

To view the static notebooks (html format) that are included in the results table above, click the corresponding square. For example, to open the MTBLS90 dataset and PLS-DA method click the area indicated by the green arrow. Alternatively, to open the ST000496 dataset and ANN-LS method click the area indicated by the red arrow. Note: if there are issues, try maximising the webpage.

Table 2: The download link for the following datasets and methods.

All workflows in this manuscript can be downloaded and run locally, or run as an interactive session in the cloud using Binder. To run these workflows locally, including downloading and installing Python refer to our tutorial review "Toward Collaborative Open Data Science in Metabolomics using Jupyter Notebooks and Cloud Computing" (doi: 10.1007/s11306-019-1588-0). The repository can be downloaded using the clicking the "Download .zip" at the top of the page. In the table below, we provide direct links to download each workflow for a given method and dataset. To open this table in Binders, please click the link below:

Launch the workflows in the cloud:

To create a virtual environment, and run workflows locally:

Note: If you are using Windows, you need to install git using the following: Git for Windows

Open Terminal on Linux/MacOS or Command Prompt on Windows
Enter the following into the console (one line at a time)

git clone https://github.com/cimcb/MetabComparisonBinaryML
cd MetabComparisonBinaryML
conda env create -f environment.yml
conda activate MetabComparisonBinaryML
jupyter notebook

DATASET	MTBLS90	MTBLS92	MTBLS136	MTBLS161	MTBLS404	MTBLS547	ST000369	ST000496	ST001000	ST001047
PLATFORM	LC-MS	LC-MS	LC-MS	NMR	LC-MS	LC-MS	GC-MS	GC-MS	LC-MS	LC-MS
SAMPLE TYPE	Plasma	Plasma	Serum	Serum	Urine	Caecal Content	Serum	Saliva	Stool	Urine
SAMPLE SIZE	968 (485/483)	253 (142/111)	668 (337/331)	59 (34/25)	184 (101/83)	97 (46/51)	80 (49/31)	100 (50/50)	121 (68/53)	83 (43/40)
NO. OF PEAKS	189	138	689	29	120	42	181	69	757	149
PLS-DA	_{PLSDA_ MTBLS90.ipynb}	_{PLSDA_ MTBLS92.ipynb}	_{PLSDA_ MTBLS136.ipynb}	_{PLSDA_ MTBLS161.ipynb}	_{PLSDA_ MTBLS404.ipynb}	_{PLSDA_ MTBLS547.ipynb}	_{PLSDA_ ST000369.ipynb}	_{PLSDA_ ST000496.ipynb}	_{PLSDA_ ST001000.ipynb}	_{PLSDA_ ST001047.ipynb}
PCR	_{PCR_ MTBLS90.ipynb}	_{PCR_ MTBLS92.ipynb}	_{PCR_ MTBLS136.ipynb}	_{PCR_ MTBLS161.ipynb}	_{PCR_ MTBLS404.ipynb}	_{PCR_ MTBLS547.ipynb}	_{PCR_ ST000369.ipynb}	_{PCR_ ST000496.ipynb}	_{PCR_ ST001000.ipynb}	_{PCR_ ST001047.ipynb}
PCLR	_{PCLR_ MTBLS90.ipynb}	_{PCLR_ MTBLS92.ipynb}	_{PCLR_ MTBLS136.ipynb}	_{PCLR_ MTBLS161.ipynb}	_{PCLR_ MTBLS404.ipynb}	_{PCLR_ MTBLS547.ipynb}	_{PCLR_ ST000369.ipynb}	_{PCLR_ ST000496.ipynb}	_{PCLR_ ST001000.ipynb}	_{PCLR_ ST001047.ipynb}
SVM-Lin	_{SVMLin_ MTBLS90.ipynb}	_{SVMLin_ MTBLS92.ipynb}	_{SVMLin_ MTBLS136.ipynb}	_{SVMLin_ MTBLS161.ipynb}	_{SVMLin_ MTBLS404.ipynb}	_{SVMLin_ MTBLS547.ipynb}	_{SVMLin_ ST000369.ipynb}	_{SVMLin_ ST000496.ipynb}	_{SVMLin_ ST001000.ipynb}	_{SVMLin_ ST001047.ipynb}
SVM-RBF	_{SVMRBF_ MTBLS90.ipynb}	_{SVMRBF_ MTBLS92.ipynb}	_{SVMRBF_ MTBLS136.ipynb}	_{SVMRBF_ MTBLS161.ipynb}	_{SVMRBF_ MTBLS404.ipynb}	_{SVMRBF_ MTBLS547.ipynb}	_{SVMRBF_ ST000369.ipynb}	_{SVMRBF_ ST000496.ipynb}	_{SVMRBF_ ST001000.ipynb}	_{SVMRBF_ ST001047.ipynb}
RF	_{RF_ MTBLS90.ipynb}	_{RF_ MTBLS92.ipynb}	_{RF_ MTBLS136.ipynb}	_{RF_ MTBLS161.ipynb}	_{RF_ MTBLS404.ipynb}	_{RF_ MTBLS547.ipynb}	_{RF_ ST000369.ipynb}	_{RF_ ST000496.ipynb}	_{RF_ ST001000.ipynb}	_{RF_ ST001047.ipynb}
ANN-LS	_{ANNLinSig_ MTBLS90.ipynb}	_{ANNLinSig_ MTBLS92.ipynb}	_{ANNLinSig_ MTBLS136.ipynb}	_{ANNLinSig_ MTBLS161.ipynb}	_{ANNLinSig_ MTBLS404.ipynb}	_{ANNLinSig_ MTBLS547.ipynb}	_{ANNLinSig_ ST000369.ipynb}	_{ANNLinSig_ ST000496.ipynb}	_{ANNLinSig_ ST001000.ipynb}	_{ANNLinSig_ ST001047.ipynb}
ANN-SS	_{ANNSigSig_ MTBLS90.ipynb}	_{ANNSigSig_ MTBLS92.ipynb}	_{ANNSigSig_ MTBLS136.ipynb}	_{ANNSigSig_ MTBLS161.ipynb}	_{ANNSigSig_ MTBLS404.ipynb}	_{ANNSigSig_ MTBLS547.ipynb}	_{ANNSigSig_ ST000369.ipynb}	_{ANNSigSig_ ST000496.ipynb}	_{ANNSigSig_ ST001000.ipynb}	_{ANNSigSig_ ST001047.ipynb}

Table 3: Machine Learning Methods

The 8 machine learning methods compared in this study are: partial least squares discriminant (PLS-DA), principal component regression (PCR), principal component logistic regression (PCLR), random forest (RF), linear-kernel support vector machine (SVM-Lin), radial basis function-kernel support vector machine (SVM-RBF), 2-Layer (linear activation, sigmoid activation) artificial neural network (ANN-LS), and 2-Layer (sigmoid activation, sigmoid activation) artificial neural network (ANN-SS). A summary of each method is shown below:

Method	Summary
PLS-DA	Partial least squares discriminant analysis was implemented using the SIMPLS algorithm. Refer to De Jong (1993) for details on the SIMPLS algorithm.
PCR	Principal component regression is a two-stage algorithm combining principal component analysis (PCA) and multiple linear regression (MLR), where the first N principal component scores act as the independent variables of the MLR, and the binary classification the is the dependent variable. The value of N is chosen by the user. PCA was implemented using PCA and MLR using Linear Regression from scikit-learn.
PCLR	Principal component logistic regression is a two-stage algorithm combining principal component analysis (PCA) and logistic regression, where the first N principal component scores act as the independent variables of the logistic regression, and the binary classification is the dependent variable. The value of N is chosen by the user. PCA was implemented using PCA and Logistic Regression from scikit-learn.
RF	Random forest was implemented using Random Forest Classifier from scikit-learn.
SVM-Lin	Linear kernel support vector machine was implemented using Support Vector Classifier from scikit-learn.
SVM-RBF	Radial basis function kernel support vector machine was implemented using Support Vector Classifier from scikit-learn.
ANN-LS	2 layer artificial neural network with layer 1 consisting of multiple neurons (n = 2 to 6) with a linear activation, and layer 2 (output layer) consisting of a single neuron with a sigmoidal activation function. ANN was implemented using Keras with a TensorFlow backend.
ANN-SS	2 layer artificial neural network with layer 1 consisting of multiple neurons (n = 2 to 6) with a sigmoidal activation, and layer 2 (output layer) consisting of a single neuron with a sigmoidal activation function. ANN was implemented using Keras with a TensorFlow backend.

Table 4: Datasets

The 10 open-access datasets below were obtained from the MetaboLights and Metabolomics Workbench data repositories. These datasets were selected to represent a cross-section of popular analytical platforms (NMR, GC- and LC-MS), sample types (caecum content, saliva, serum, stool, plasma, and urine), sample sizes (from 59 to 1005), and number of peaks (from 29 to 2000). A summary and link for each dataset is shown below:

Dataset	Summary
MTBLS90	A plasma LC-MS dataset consisting of 189 named metabolites. This was a large prospective epidemiological study of men and women at age 70 living in Uppsala, Sweden. For the purpose of this study, we compare males (Class=1; n=485) and females (Class=0; n=483) in a binary discriminant analysis.
MTBLS92	A plasma LC-MS dataset consisting of 138 named metabolites. The primary outcome for this paper was before and after neoadjuvant chemotherapy in breast cancer patients. For the purpose of this study, we compare before (Class=1; n=142) and after (Class=0; n=111) neoadjuvant chemotherapy in a binary discriminant analysis.
MTBLS136	A serum LC-MS dataset consisting of 949 named metabolites. The primary outcome for this paper was estrogen-only (E; n=332) vs. estrogen plus progestin (E+P; n=337) vs. non-users of post-menopausal hormone therapy regimes (Control; n=667). For the purpose of this study, we compare only the E vs. E+P in a binary discriminant analysis.
MTBLS161	A serum LC-MS dataset consisting of 29 named metabolites. The primary outcome for this paper was myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS; n=34) vs. healthy control (HC; n=25), for both serum and urine samples. For the purpose of this study, we compare ME/CFS (Class=1) vs. HC (Class=0) using the serum dataset in a binary discriminant analysis.
MTBLS404	A urine LC-MS dataset consisting of 120 named metabolites. This paper was an analysis of the variations of the human adult urinary metabolome with age, body mass index, and gender. For the purpose of this study, we compare males (Class=1; n=101) and females (Class=0; n=83) in a binary discriminant analysis.
MTBLS547	A caecal content (of mice) LC-MS dataset consisting of 42 named metabolites. The primary outcome for this paper was high fat diet (HFD; n=46) vs. healthy control (HC; n=51). For the purpose of this study, we compare HFD (Class=1) vs. HC (Class=0) in a binary discriminant analysis.
ST000369	A serum and plasma GC-MS dataset consisting of 181 named metabolites. The primary outcome for this paper was Adenocarcinoma Lung Cancer vs. Healthy Control using two independent case-control studies. For the purpose of this publication, we compare the serum dataset for the first set of independent case-control studies (ADC1): Adenocarcinoma Lung Cancer (Class=1; n=49) vs. Healthy Control (Class=0; n=31).
ST000496	A saliva GC-MS dataset consisting of 69 named metabolites. The primary outcome for this paper was to assess the correlation of periodontal inflamed surface area (PISA) and salivary metabolites, before and after debridement. For the purpose of this publication, we compare the secondary outcome in a binary discriminant analysis: before debridement (Class=0; n=50) and after debridement (Class=1; n=50).
ST001000	A stool LC-MS dataset (4 modes- HILICpos, HILICneg, C18neg, C8pos) consisting of >8,000 measured metabolite features. The primary outcome for this paper was healthy controls vs. inflammatory bowel disease (IBD), which includes Crohn’s disease (CD) and ulcerative colitis (UC). For the purpose of this publication, we compare UC (Class=1; n=68) and CD (Class=0; n=53) using the C18neg dataset in a binary discriminant analysis.
ST001047	A urine NMR data set consisting of 149 named metabolites. The primary outcome for this paper was the urine was Gastric Cancer (GC; n=43) v Benign Tumor (BN; n=40) v Healthy Control (HE; n=40). For the purposes of this publication we compare only the GC vs HE samples in a binary discrimiant analysis.