The study used in this tutorial has been previously published by Chan et al. (2016), and the deconvolved and annotated data file deposited at the Metabolomics Workbench data repository (Study ID: ST001047). The data can be accessed directly via its project DOI: 10.21228/M8B10B. This workflow requires data to be formatted as a Microsoft Excel file, using the Tidy Data framework (i.e. each column is a variable, and row is an observation). As such, the Excel file contains a Data Sheet and Peak Sheet. The Data Sheet contains all the metabolite concentrations and metadata associated with each observation (requiring the inclusion of the columns: Idx, SampleID, and Class). The Peak Sheet contains all the metadata pertaining to each measured metabolite (requiring the inclusion of the columns: Idx, Name, and Label). Please inspect the Excel file ST001047.xlsx used in this workflow before proceeding.
This is a urine NMR data set consisting of 149 named metabolites. The primary outcome for this paper was the urine was Gastric Cancer (GC; n=43) v Benign Tumor (BN; n=40) v Healthy Control (HE; n=40). For the purposes of this publication we compare only the GC vs HE samples in a binary discrimiant analysis.
This Jupyter Notebook implements the complete workflow for creating, optimising, and evaluating a 2 layer artificial neural network with Layer 1 consisting of multiple neurons (n = 2 to 6) with a sigmoidal activation, and Layer 2 (output layer) consisting of a single neuron with a sigmoidal activation function (ANN-SS). ANN was implemented using Keras with a Theano backend.
Please refer to the 'cimcb' package documentation for further details regarding this specific implementation: https://cimcb.github.io/cimcblearning_rate
: the parameter that controls the step-size in updating the weights (default=0.01) n_neurons
: the number of neurons in the hidden layer (default=2)epochs
: the number of iterations in the model training (default=100) momentum
: a value that alters the learning rate schedule, whereby increasing the learning rate when the error cost gradient continue in the same direction (default=0.5)decay
: a value that alters the learning rate schedule, whereby decreasing the learning rate after each epoch/iteration (default=0)loss
: the function used to calculate the error of the model during the model training process known as backpropagation (default='binary_crossentropy')Preliminary analysis indicated, for the metabolomics data sets used in this study, that varying hyperparameters momentum
and decay
had little impact on performance, thus they were kept constant at their default values. Additionally, it was observed that fixing the number of epochs
to 400 proved effective across most of the data sets. Thus hyperparameter optimisation was reduced to a grid search across n_neurons = [2,3,4,5,6]
and learning_rate = [0.0001,0.001,0.01,0.1,1]
. After the number of neurons is chosen, the learning rate was fine-tuned as appropriate using a linear search.
numpy
, pandas
, and cimcb
).
DataTable
and PeakTable
.DataTable
to include only those observations needed for the binary comparison and create a new table: DataTable2
. We define one column of the data table to be the "outcome" variable Outcomes
, and convert the class labels in this column to a binary outcome vector Y
, where 1
is the positive outcome, and 0
the negative outcome (eg. case=1 & control=0). A new variable peaklist
is created to hold the names (M1...Mn) of the metabolites to be used in the discriminant analysis. To create an independent dataset to evaluate, scikit-learn module's train_test_split()
function is used. The data is split into 2/3rd training (DataTrain
and YTrain
), and 1/3rd test (DataTest
and YTest
). The metabolite data corresponding to peaklist
is extracted from DataTrain
and placed in a matrix XTrain
. The XTrain
matrix is log-transformed and auto-scaled, with missing values imputed using k-nearest neighbours (k=3). Then the metabolite data corresponding to peaklist
is extracted from DataTest
and placed in a matrix XTest
. The XTest
matrix is log-transformed and auto-scaled (using mu and sigma from XTrain
), with missing values imputed using k-nearest neighbours (k=3).
cb.cross_val.KFold()
to carry out 5-fold cross-validation of a set of ANN models (ANN-LS) configured with different values for learning rate (0.001 to 1) and number of neurons (2 to 6). This helper function is generally applicable, and the values being passed to it are:
cb.model.NN_LinearSigmoid
.XTknn
, and binary outcome vector, Y
.param_dict
, describing key:value pairs where the key is a parameter that is passed to the model, and the value is a list of values to be passed to that parameter.folds
, and the number of monte carlo repetitions of the k-fold CV, n_mc
.cv.run()
followed by cv.plot(metric='r2q2')
are run the predictive ability of the multiple models across the hyperparameter grid search (n_neurons
vs. learning_rate
) are displayed in the form of heatmaps representing the parametric performance values $R^2$, $Q^2$ and $|R^2 - Q^2|$. These heatmaps are interactively linked to a scatter plot of $|R^2 - Q^2|$ vs. $Q^2$ and line plots of $R^2$ & $Q^2$ vs n_neurons
and learning_rate
. If the function cv.plot(metric='auc')
is run the predictive ability of the models is presented as measures of the area under the ROC curve, $AUC(full)$ & $AUC(cv)$, as a nonparametric alternative to $R^2$ & $Q^2$. These multiple plots are used to aid in selecting the optimal hyperparameter values.cb.model.NN_LinearSigmoid()
to building a ANN-LS model using the optimal hyperparameter values determined in step 4. The model is trained on the training dataset, XTrainKnn
, and tested on the independent test dataset, XTestKnn
. Next, the trained model's .evaluate()
method is used to visualise model performance for both the training and independent test dataset using: a violin plot showing the distributions of negative and positive responses as violin and box-whisker plots; a probability density function plot for each response type, and a ROC curve that displays the curve for the training dataset (green) and test dataset (yellow).
cb.bootstrap.Per()
with 100 boostrapped models. This generates a population of 100 model predictions for both the training set (in-bag prediction - IB) and the holdout test set (out-of-bag - OOB) from the full dataset, with the metabolite matrix, XBootKnn
, and binary outcome vector, Y
. These predictions are visualised with a box-violin and probability density function plot for the aggregate model. The ROC curve displays the curve for the training dataset (green) and test dataset (yellow) from section 5 with 95% confidence intervals (light green band = IB & light yellow band = OOB).
import numpy as np
import pandas as pd
import cimcb as cb
from sklearn.model_selection import train_test_split
print('All packages successfully loaded')
home = 'data/'
file = 'ST001047.xlsx'
DataTable,PeakTable = cb.utils.load_dataXL(home + file, DataSheet='Data', PeakSheet='Peak')
# Clean PeakTable and Extract PeakList
RSD = PeakTable['QC_RSD']
PercMiss = PeakTable['Perc_missing']
PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]
PeakList = PeakTableClean['Name']
# Select Subset of Data (Class "GC" or "HE" only)
DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]
# Create a Binary Y Vector
Outcomes = DataTable2['Class']
Y = [1 if outcome == 'GC' else 0 for outcome in Outcomes]
Y = np.array(Y)
# Split Data into Train (2/3rd) and Test (1/3rd)
DataTrain, DataTest, YTrain, YTest = train_test_split(DataTable2, Y, test_size=1/3, stratify=Y, random_state=40)
# Extract Train Data
XTrain = DataTrain[PeakList]
XTrainLog = np.log(XTrain)
XTrainScale, mu, sigma = cb.utils.scale(XTrainLog, method='auto', return_mu_sigma=True)
XTrainKnn = cb.utils.knnimpute(XTrainScale, k=3)
# Extract Test Data
XTest = DataTest[PeakList]
XTestLog = np.log(XTest)
XTestScale = cb.utils.scale(XTestLog, method='auto', mu=mu, sigma=sigma)
XTestKnn = cb.utils.knnimpute(XTestScale, k=3)
# Parameter Dictionary
lr = [0.001,0.005,0.01,0.05,0.1,1]
neurons = [2, 3, 4, 5, 6]
param_dict = dict(learning_rate=lr,
n_neurons=neurons,
epochs=400,
momentum=0.5,
decay=0,
loss='binary_crossentropy')
# Initialise
cv = cb.cross_val.KFold(model=cb.model.NN_LinearSigmoid,
X=XTrainKnn,
Y=YTrain,
param_dict=param_dict,
folds=5,
n_mc=10)
# Run and Plot
cv.run()
cv.plot(metric='auc', color_beta=[5,5,5])
cv.plot(metric='r2q2', color_beta=[5,5,5])
# Parameter Dictionary
lr = [0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009,0.01]
param_dict = dict(learning_rate=lr,
n_neurons=3,
epochs=400,
momentum=0.5,
decay=0,
loss='binary_crossentropy')
# Initialise
cv = cb.cross_val.KFold(model=cb.model.NN_LinearSigmoid,
X=XTrainKnn,
Y=YTrain,
param_dict=param_dict,
folds=5,
n_mc=10)
# Run and Plot
cv.run()
cv.plot(metric='auc')
cv.plot(metric='r2q2')
# Build Model
model = cb.model.NN_LinearSigmoid(learning_rate=0.002,
n_neurons=3,
epochs=400,
momentum=0.5,
decay=0,
loss='binary_crossentropy')
YPredTrain = model.train(XTrainKnn, YTrain)
YPredTest = model.test(XTestKnn)
# Put YTrain and YPredTrain in a List
EvalTrain = [YTrain, YPredTrain]
# Put YTest and YPrestTest in a List
EvalTest = [YTest, YPredTest]
# Evaluate Model (include Test Dataset)
model.evaluate(testset=EvalTest)
# Extract X Data
XBoot = DataTable2[PeakList]
XBootLog = np.log(XBoot)
XBootScale = cb.utils.scale(XBootLog, method='auto')
XBootKnn = cb.utils.knnimpute(XBootScale, k=3)
YPredBoot = model.train(XBootKnn, Y)
# Build Boostrap Models
bootmodel = cb.bootstrap.Per(model, bootnum=100)
bootmodel.run()
# Boostrap Evaluate Model (include Test Dataset)
bootmodel.evaluate(trainset=EvalTrain, testset=EvalTest)
home = 'results/'
file = 'ANNLinSig_ST001047.xlsx'
bootmodel.save_results(home + file)