To begin: Click anywhere in this cell and press Run on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cells. A text cell and a code cell. When you Run a text cell (we are in a text cell now), you advance to the next cell without executing any code. When you Run a code cell (identified by In[ ]: to the left of the cell) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press Run again. Repeat this process until the end of the notebook. NOTE: All the cells in this notebook can be automatically executed sequentially by clicking Kernel→Restart and Run All. Should anything crash then restart the Jupyter Kernal by clicking Kernel→Restart, and start again from the top.

SVMLin_ST000496

The study used in this tutorial has been previously published by Sakanaka et al. (2017), and the deconvolved and annotated data file deposited at the Metabolomics Workbench data repository (Study ID: ST000496). The data can be accessed directly via its project DOI: 10.21228/M8TW22. This workflow requires data to be formatted as a Microsoft Excel file, using the Tidy Data framework (i.e. each column is a variable, and row is an observation). As such, the Excel file contains a Data Sheet and Peak Sheet. The Data Sheet contains all the metabolite concentrations and metadata associated with each observation (requiring the inclusion of the columns: Idx, SampleID, and Class). The Peak Sheet contains all the metadata pertaining to each measured metabolite (requiring the inclusion of the columns: Idx, Name, and Label). Please inspect the Excel file ST000496.xlsx used in this workflow before proceeding.

This is a saliva GC-MS dataset consisting of 69 named metabolites. The primary outcome for this paper was to assess the correlation of periodontal inflamed surface area (PISA) and salivary metabolites, before and after debridement. For the purpose of this publication, we compare the secondary outcome in a binary discriminant analysis: before debridement (Class=0; n=50) and after debridement (Class=1; n=50).

SVM-Lin Workflow

This Jupyter Notebook implements the complete workflow for creating, optimising, and evaluating a linear kernel support vector machine (SVM-Lin) model. SVM was implemented using Support Vector Classifier from scikit-learn.

Please refer to the 'cimcb' package documentation for further details regarding this specific implementation: https://cimcb.github.io/cimcb

SVM uses the following Hyperparameter(s):

kernel: specifies the type of kernel used in the algorithm. For SVM-Lin, this needs to be set to 'linear'.
C: the cost parameter that determines how soft or hard (i.e. strict) the hyperplane margin is (default=1)

The purpose of each hyperparameter is explained here: Hsu et al. (2003)

The notebook workflow is broken into the following steps:

Import Packages: First, the Python packages required for this workflow need to be imported (numpy, pandas, and cimcb).
Load Data & Peak Sheet: From the Excel spreadsheet, import the Data and Peak spreadsheets and create two respective Pandas tables: DataTable and PeakTable.
Extract X & Y: Next, we reduce the data in DataTable to include only those observations needed for the binary comparison and create a new table: DataTable2. We define one column of the data table to be the "outcome" variable Outcomes, and convert the class labels in this column to a binary outcome vector Y, where 1 is the positive outcome, and 0 the negative outcome (eg. case=1 & control=0). A new variable peaklist is created to hold the names (M1...Mn) of the metabolites to be used in the discriminant analysis. To create an independent dataset to evaluate, scikit-learn module's train_test_split() function is used. The data is split into 2/3rd training (DataTrain and YTrain), and 1/3rd test (DataTest and YTest). The metabolite data corresponding to peaklist is extracted from DataTrain and placed in a matrix XTrain. The XTrain matrix is log-transformed and auto-scaled, with missing values imputed using k-nearest neighbours (k=3). Then the metabolite data corresponding to peaklist is extracted from DataTest and placed in a matrix XTest. The XTest matrix is log-transformed and auto-scaled (using mu and sigma from XTrain), with missing values imputed using k-nearest neighbours (k=3).
Hyperparameter Optimisation: Here, we use the helper function cb.cross_val.KFold() to carry out 5-fold cross-validation of a set of SVM models (kernel='linear') configured with different values for C (0.0001 to 10). This helper function is generally applicable, and the values being passed to it are:
- The class of model to be created by the function, cb.model.SVM.
- The metabolite matrix, XTknn, and binary outcome vector, Y.
- A dictionary, param_dict, describing key:value pairs where the key is a parameter that is passed to the model, and the value is a list of values to be passed to that parameter.
- The number of folds in the cross-validation, folds, and the number of monte carlo repetitions of the k-fold CV, n_mc.
When cv.run() is executed the results are displayed as 2 plots of $R^2$ and $Q^2$ statistics and 2 equivalent plots of AUCfull and AUCcv statistics: (a) ($R^2 - Q^2$) vs. $Q^2$ or (AUCfull - AUCcv) vs. AUCcv, and (b) ($R^2$ and $Q^2$) or (AUCfull and AUCcv) against C. These plots are used to aid in selecting the optimal C value.
Build Model & Evaluate: Here, we use the class cb.model.SVM() to building a SVM-Lin model using the optimal hyperparameter values determined in step 4. The model is trained on the training dataset, XTrainKnn, and tested on the independent test dataset, XTestKnn. Next, the trained model's .evaluate() method is used to visualise model performance for both the training and independent test dataset using: a violin plot showing the distributions of negative and positive responses as violin and box-whisker plots; a probability density function plot for each response type, and a ROC curve that displays the curve for the training dataset (green) and test dataset (yellow).
Bootstrap Evaluation: Finally, to create an estimate of the robustness and a measure of generalised predictive ability of this model we perform bootstrap aggregation (Bagging) using the helper function cb.bootstrap.Per() with 100 boostrapped models. This generates a population of 100 model predictions for both the training set (in-bag prediction - IB) and the holdout test set (out-of-bag - OOB) from the full dataset, with the metabolite matrix, XBootKnn, and binary outcome vector, Y. These predictions are visualised with a box-violin and probability density function plot for the aggregate model. The ROC curve displays the curve for the training dataset (green) and test dataset (yellow) from section 5 with 95% confidence intervals (light green band = IB & light yellow band = OOB).
Export Results: Exporting the model evaluation results as an Excel spreadsheet.

1. Import Packages¶

import numpy as np
import pandas as pd
import cimcb as cb
from sklearn.model_selection import train_test_split

print('All packages successfully loaded')

Using Theano backend.

All packages successfully loaded

2. Load Data & Peak Sheet¶

home = 'data/'
file = 'ST000496.xlsx'

DataTable,PeakTable = cb.utils.load_dataXL(home + file, DataSheet='Data', PeakSheet='Peak')

Loadings PeakFile: Peak
Loadings DataFile: Data
Data Table & Peak Table is suitable.
TOTAL SAMPLES: 100 TOTAL PEAKS: 69
Done!

3. Extract X & Y¶

# Extract PeakList
PeakList = PeakTable['Name']

# Select Subset of Data
DataTable2 = DataTable[(DataTable.Class == 1) | (DataTable.Class == 0)]

# Create a Binary Y Vector 
Outcomes = DataTable2['Class']
Y = Outcomes.values

# Split Data into Train (2/3) and Test (1/3)
DataTrain, DataTest, YTrain, YTest = train_test_split(DataTable2, Y, test_size=1/3, stratify=Y, random_state=14)

# Extract Train Data 
XTrain = DataTrain[PeakList]
XTrainLog = np.log(XTrain)
XTrainScale, mu, sigma = cb.utils.scale(XTrainLog, method='auto', return_mu_sigma=True)
XTrainKnn = cb.utils.knnimpute(XTrainScale, k=3)

# Extract Test Data
XTest = DataTest[PeakList]
XTestLog = np.log(XTest)
XTestScale = cb.utils.scale(XTestLog, method='auto', mu=mu, sigma=sigma)
XTestKnn = cb.utils.knnimpute(XTestScale, k=3)

4. Hyperparameter Optimisation¶

# Parameter Dictionary
C_range = [0.0001, 0.001, 0.01, 0.1, 1, 10]
param_dict = dict(C=C_range, kernel="linear")

# Initialise
cv = cb.cross_val.KFold(model=cb.model.SVM,
                        X=XTrainKnn,
                        Y=YTrain,
                        param_dict=param_dict,
                        folds=5,
                        n_mc=10)

# Run and Plot
cv.run()
cv.plot(metric='auc')
cv.plot(metric='r2q2')

Number of cores set to: 8
Running ...

1/2: 100%|██████████| 6/6 [00:00<00:00, 28.78it/s]
2/2: 100%|██████████| 60/60 [00:00<00:00, 328.12it/s]

Time taken: 0.19 minutes with 8 cores
Done!

# Parameter Dictionary
C_range = [0.005,0.01,0.02,0.03,0.04,0.05,0.06,0.08,0.1]
param_dict = dict(C=C_range, kernel="linear")

# Initialise
cv = cb.cross_val.KFold(model=cb.model.SVM,
                        X=XTrainKnn,
                        Y=YTrain,
                        param_dict=param_dict,
                        folds=5,
                        n_mc=10)

# Run and Plot
cv.run()
cv.plot(metric='auc')
cv.plot(metric='r2q2')

Number of cores set to: 8
Running ...

1/2: 100%|██████████| 9/9 [00:00<00:00, 3511.51it/s]
2/2: 100%|██████████| 90/90 [00:00<00:00, 223.29it/s]

Time taken: 0.01 minutes with 8 cores
Done!

5. Build Model & Evaluate¶

# Build Model
model = cb.model.SVM(C=0.03, kernel="linear")
YPredTrain = model.train(XTrainKnn, YTrain)
YPredTest = model.test(XTestKnn)

# Put YTrain and YPredTrain in a List
EvalTrain = [YTrain, YPredTrain]

# Put YTest and YPrestTest in a List
EvalTest = [YTest, YPredTest]

# Evaluate Model (include Test Dataset)
model.evaluate(testset=EvalTest)

6. Bootstrap Evaluation¶

# Extract X Data
XBoot = DataTable2[PeakList]
XBootLog = np.log(XBoot)
XBootScale = cb.utils.scale(XBootLog, method='auto')
XBootKnn = cb.utils.knnimpute(XBootScale, k=3)
YPredBoot = model.train(XBootKnn, Y)

# Build Boostrap Models
bootmodel = cb.bootstrap.Per(model, bootnum=100)
bootmodel.run()

# Boostrap Evaluate Model (include Test Dataset)
bootmodel.evaluate(trainset=EvalTrain, testset=EvalTest)

Number of cores set to: 8

100%|██████████| 100/100 [00:11<00:00,  9.01it/s]

Time taken: 0.19 minutes with 8 cores

7. Export Results¶

home = 'results/'
file = 'SVMLin_ST000496.xlsx'

bootmodel.save_results(home + file)

Done! Saved table as results/SVMLin_ST000496.xlsx