Skip to the content.

Table of Contents:



Overview of the Repository

This repository contains the supplementary information for the journal article, “A Comparative Evaluation of the Generalised Predictive Ability of Eight Machine Learning Algorithms across Ten Clinical Metabolomics Data Sets for Binary Classification”. Generalised predictive ability is evaluated using the area under the receiver operator characteristic (ROC) curve (AUC) using a Train (2/3 of the data) and Test (1/3 of the data) set, combined with bootstrap resampling of the data for each model. The latter results in a population of models (n=100), trained on approximately 2/3 of the data, and tested on the unused data (approximately 1/3). From which, 95% population confidence intervals (CI) are calculated for the training data (in-bag IB), and test data (out-of-bag OOB). Both Train (with IB 95% CI) and Test (with OOB 95% CI) AUC are calculated and stored for each model (refer to Table 1). The Test (with OOB 95% CI) AUC is used as an unbiased estimator of generalised predictive ability.

The eight machine learning methods evaluated in this study are: partial least squares regression – discriminatory analysis (PLS-DA), principal component regression (PCR), principal component logistic regression (PCLR), random forest (RF), linear kernel support vector machine (SVM-Lin), radial basis kernel support vector machine (SVM-RBF), and a linear and non-linear two-layer artificial neural network (ANN). The linear two-layer ANN hidden consisting of multiple neurons (n = 2 to 6) with a linear activation, and layer 2 (output layer) consisting of a single neuron with a logistic (sigmoidal) activation function (ANN-LS). The non-linear two-layer ANN hidden consisting of multiple neurons (n = 2 to 6) with a sigmoidal activation, and layer 2 (output layer) consisting of a single neuron with a logistic (sigmoidal) activation function (ANN-SS). For information on each ML method refer to Table 3.

The datasets used for the comparative evaluation of 8 ML methods were identified using the following criteria: data were of clinical origin, data were previously published, data publicly available at either MetaboLights or Metabolomics Workbench data repositories, metabolite data available in a form amenable for direct modelling, experimental data (e.g. Clinical Outcome) available in a form amenable for direct modelling, a clear binary outcome with the number of samples in each class reasonably balanced, data representative of the three primary metabolomics technologies (NMR, LC-MS, GC-MS), data representative of multiple biofluids (e.g. blood, urine, faeces) and a range of sample sizes (from less than 50 to more than 500). For information on each dataset refer to Table 4.

The computational workflow for each model and dataset includes the following step: import package, load data and peak sheet, extract X and Y, hyperparameter optimisation, build model and evaluate, bootstrap evaluation, and export results. All workflows were implemented using the Python programming language in the form of interactive Jupyter notebooks. All data and notebooks are publicly available on this GitHub repository. All notebooks can be viewed in static html format by following the steps under Table 1. All notebooks can be downloaded and re-run by following the steps under Table 2. For information regarding each step in the workflow, refer to the top of each notebook.




Table 1: Train (with IB 95% CI) and Test (with OOB 95% CI) AUC for the following datasets and methods.

To view the static notebooks (html format) that are included in the results table above, click the corresponding square. For example, to open the MTBLS90 dataset and PLS-DA method click the area indicated by the green arrow. Alternatively, to open the ST000496 dataset and ANN-LS method click the area indicated by the red arrow. Note: if there are issues, try maximising the webpage.

html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html html



Table 2: The download link for the following datasets and methods.

All workflows in this manuscript can be downloaded and run locally, or run as an interactive session in the cloud using Binder. To run these workflows locally, including downloading and installing Python refer to our tutorial review "Toward Collaborative Open Data Science in Metabolomics using Jupyter Notebooks and Cloud Computing" (doi: 10.1007/s11306-019-1588-0). The repository can be downloaded using the clicking the "Download .zip" at the top of the page. In the table below, we provide direct links to download each workflow for a given method and dataset. To open this table in Binders, please click the link below:

Launch the workflows in the cloud: Binder

To create a virtual environment, and run workflows locally:

Note: If you are using Windows, you need to install git using the following: Git for Windows

  1. Open Terminal on Linux/MacOS or Command Prompt on Windows
  2. Enter the following into the console (one line at a time)
git clone https://github.com/cimcb/MetabComparisonBinaryML
cd MetabComparisonBinaryML
conda env create -f environment.yml
conda activate MetabComparisonBinaryML
jupyter notebook


DATASET MTBLS90 MTBLS92 MTBLS136 MTBLS161 MTBLS404 MTBLS547 ST000369 ST000496 ST001000 ST001047
PLATFORM LC-MS LC-MS LC-MS NMR LC-MS LC-MS GC-MS GC-MS LC-MS LC-MS
SAMPLE TYPE Plasma Plasma Serum Serum Urine Caecal
Content
Serum Saliva Stool Urine
SAMPLE SIZE 968
(485/483)
253
(142/111)
668
(337/331)
59
(34/25)
184
(101/83)
97
(46/51)
80
(49/31)
100
(50/50)
121
(68/53)
83
(43/40)
NO. OF PEAKS 189 138 689 29 120 42 181 69 757 149
PLS-DA PLSDA_
MTBLS90.ipynb
PLSDA_
MTBLS92.ipynb
PLSDA_
MTBLS136.ipynb
PLSDA_
MTBLS161.ipynb
PLSDA_
MTBLS404.ipynb
PLSDA_
MTBLS547.ipynb
PLSDA_
ST000369.ipynb
PLSDA_
ST000496.ipynb
PLSDA_
ST001000.ipynb
PLSDA_
ST001047.ipynb
PCR PCR_
MTBLS90.ipynb
PCR_
MTBLS92.ipynb
PCR_
MTBLS136.ipynb
PCR_
MTBLS161.ipynb
PCR_
MTBLS404.ipynb
PCR_
MTBLS547.ipynb
PCR_
ST000369.ipynb
PCR_
ST000496.ipynb
PCR_
ST001000.ipynb
PCR_
ST001047.ipynb
PCLR PCLR_
MTBLS90.ipynb
PCLR_
MTBLS92.ipynb
PCLR_
MTBLS136.ipynb
PCLR_
MTBLS161.ipynb
PCLR_
MTBLS404.ipynb
PCLR_
MTBLS547.ipynb
PCLR_
ST000369.ipynb
PCLR_
ST000496.ipynb
PCLR_
ST001000.ipynb
PCLR_
ST001047.ipynb
SVM-Lin SVMLin_
MTBLS90.ipynb
SVMLin_
MTBLS92.ipynb
SVMLin_
MTBLS136.ipynb
SVMLin_
MTBLS161.ipynb
SVMLin_
MTBLS404.ipynb
SVMLin_
MTBLS547.ipynb
SVMLin_
ST000369.ipynb
SVMLin_
ST000496.ipynb
SVMLin_
ST001000.ipynb
SVMLin_
ST001047.ipynb
SVM-RBF SVMRBF_
MTBLS90.ipynb
SVMRBF_
MTBLS92.ipynb
SVMRBF_
MTBLS136.ipynb
SVMRBF_
MTBLS161.ipynb
SVMRBF_
MTBLS404.ipynb
SVMRBF_
MTBLS547.ipynb
SVMRBF_
ST000369.ipynb
SVMRBF_
ST000496.ipynb
SVMRBF_
ST001000.ipynb
SVMRBF_
ST001047.ipynb
RF RF_
MTBLS90.ipynb
RF_
MTBLS92.ipynb
RF_
MTBLS136.ipynb
RF_
MTBLS161.ipynb
RF_
MTBLS404.ipynb
RF_
MTBLS547.ipynb
RF_
ST000369.ipynb
RF_
ST000496.ipynb
RF_
ST001000.ipynb
RF_
ST001047.ipynb
ANN-LS ANNLinSig_
MTBLS90.ipynb
ANNLinSig_
MTBLS92.ipynb
ANNLinSig_
MTBLS136.ipynb
ANNLinSig_
MTBLS161.ipynb
ANNLinSig_
MTBLS404.ipynb
ANNLinSig_
MTBLS547.ipynb
ANNLinSig_
ST000369.ipynb
ANNLinSig_
ST000496.ipynb
ANNLinSig_
ST001000.ipynb
ANNLinSig_
ST001047.ipynb
ANN-SS ANNSigSig_
MTBLS90.ipynb
ANNSigSig_
MTBLS92.ipynb
ANNSigSig_
MTBLS136.ipynb
ANNSigSig_
MTBLS161.ipynb
ANNSigSig_
MTBLS404.ipynb
ANNSigSig_
MTBLS547.ipynb
ANNSigSig_
ST000369.ipynb
ANNSigSig_
ST000496.ipynb
ANNSigSig_
ST001000.ipynb
ANNSigSig_
ST001047.ipynb



Table 3: Machine Learning Methods

The 8 machine learning methods compared in this study are: partial least squares discriminant (PLS-DA), principal component regression (PCR), principal component logistic regression (PCLR), random forest (RF), linear-kernel support vector machine (SVM-Lin), radial basis function-kernel support vector machine (SVM-RBF), 2-Layer (linear activation, sigmoid activation) artificial neural network (ANN-LS), and 2-Layer (sigmoid activation, sigmoid activation) artificial neural network (ANN-SS). A summary of each method is shown below:

Method Summary
PLS-DA Partial least squares discriminant analysis was implemented using the SIMPLS algorithm. Refer to De Jong (1993) for details on the SIMPLS algorithm.
PCR Principal component regression is a two-stage algorithm combining principal component analysis (PCA) and multiple linear regression (MLR), where the first N principal component scores act as the independent variables of the MLR, and the binary classification the is the dependent variable. The value of N is chosen by the user. PCA was implemented using PCA and MLR using Linear Regression from scikit-learn.
PCLR Principal component logistic regression is a two-stage algorithm combining principal component analysis (PCA) and logistic regression, where the first N principal component scores act as the independent variables of the logistic regression, and the binary classification is the dependent variable. The value of N is chosen by the user. PCA was implemented using PCA and Logistic Regression from scikit-learn.
RF Random forest was implemented using Random Forest Classifier from scikit-learn.
SVM-Lin Linear kernel support vector machine was implemented using Support Vector Classifier from scikit-learn.
SVM-RBF Radial basis function kernel support vector machine was implemented using Support Vector Classifier from scikit-learn.
ANN-LS 2 layer artificial neural network with layer 1 consisting of multiple neurons (n = 2 to 6) with a linear activation, and layer 2 (output layer) consisting of a single neuron with a sigmoidal activation function. ANN was implemented using Keras with a TensorFlow backend.
ANN-SS 2 layer artificial neural network with layer 1 consisting of multiple neurons (n = 2 to 6) with a sigmoidal activation, and layer 2 (output layer) consisting of a single neuron with a sigmoidal activation function. ANN was implemented using Keras with a TensorFlow backend.



Table 4: Datasets

The 10 open-access datasets below were obtained from the MetaboLights and Metabolomics Workbench data repositories. These datasets were selected to represent a cross-section of popular analytical platforms (NMR, GC- and LC-MS), sample types (caecum content, saliva, serum, stool, plasma, and urine), sample sizes (from 59 to 1005), and number of peaks (from 29 to 2000). A summary and link for each dataset is shown below:

Dataset Summary
MTBLS90 A plasma LC-MS dataset consisting of 189 named metabolites. This was a large prospective epidemiological study of men and women at age 70 living in Uppsala, Sweden. For the purpose of this study, we compare males (Class=1; n=485) and females (Class=0; n=483) in a binary discriminant analysis.
MTBLS92 A plasma LC-MS dataset consisting of 138 named metabolites. The primary outcome for this paper was before and after neoadjuvant chemotherapy in breast cancer patients. For the purpose of this study, we compare before (Class=1; n=142) and after (Class=0; n=111) neoadjuvant chemotherapy in a binary discriminant analysis.
MTBLS136 A serum LC-MS dataset consisting of 949 named metabolites. The primary outcome for this paper was estrogen-only (E; n=332) vs. estrogen plus progestin (E+P; n=337) vs. non-users of post-menopausal hormone therapy regimes (Control; n=667). For the purpose of this study, we compare only the E vs. E+P in a binary discriminant analysis.
MTBLS161 A serum LC-MS dataset consisting of 29 named metabolites. The primary outcome for this paper was myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS; n=34) vs. healthy control (HC; n=25), for both serum and urine samples. For the purpose of this study, we compare ME/CFS (Class=1) vs. HC (Class=0) using the serum dataset in a binary discriminant analysis.
MTBLS404 A urine LC-MS dataset consisting of 120 named metabolites. This paper was an analysis of the variations of the human adult urinary metabolome with age, body mass index, and gender. For the purpose of this study, we compare males (Class=1; n=101) and females (Class=0; n=83) in a binary discriminant analysis.
MTBLS547 A caecal content (of mice) LC-MS dataset consisting of 42 named metabolites. The primary outcome for this paper was high fat diet (HFD; n=46) vs. healthy control (HC; n=51). For the purpose of this study, we compare HFD (Class=1) vs. HC (Class=0) in a binary discriminant analysis.
ST000369 A serum and plasma GC-MS dataset consisting of 181 named metabolites. The primary outcome for this paper was Adenocarcinoma Lung Cancer vs. Healthy Control using two independent case-control studies. For the purpose of this publication, we compare the serum dataset for the first set of independent case-control studies (ADC1): Adenocarcinoma Lung Cancer (Class=1; n=49) vs. Healthy Control (Class=0; n=31).
ST000496 A saliva GC-MS dataset consisting of 69 named metabolites. The primary outcome for this paper was to assess the correlation of periodontal inflamed surface area (PISA) and salivary metabolites, before and after debridement. For the purpose of this publication, we compare the secondary outcome in a binary discriminant analysis: before debridement (Class=0; n=50) and after debridement (Class=1; n=50).
ST001000 A stool LC-MS dataset (4 modes- HILICpos, HILICneg, C18neg, C8pos) consisting of >8,000 measured metabolite features. The primary outcome for this paper was healthy controls vs. inflammatory bowel disease (IBD), which includes Crohn’s disease (CD) and ulcerative colitis (UC). For the purpose of this publication, we compare UC (Class=1; n=68) and CD (Class=0; n=53) using the C18neg dataset in a binary discriminant analysis.
ST001047 A urine NMR data set consisting of 149 named metabolites. The primary outcome for this paper was the urine was Gastric Cancer (GC; n=43) v Benign Tumor (BN; n=40) v Healthy Control (HE; n=40). For the purposes of this publication we compare only the GC vs HE samples in a binary discrimiant analysis.