{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": false
   },
   "source": [
    "<div style=\"text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;\">\n",
    "    <font color='red'>To begin: Click anywhere in this cell and press <kbd>Run</kbd> on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cells. A <i>text cell</i> and a <i>code cell</i>. When you <kbd>Run</kbd> a text cell (<i>we are in a text cell now</i>), you advance to the next cell without executing any code. When you <kbd>Run</kbd> a code cell (<i>identified by <span style=\"font-family: courier; color:black; background-color:white;\">In[ ]:</span> to the left of the cell</i>) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press <kbd>Run</kbd> again. Repeat this process until the end of the notebook. <b>NOTE:</b> All the cells in this notebook can be automatically executed sequentially by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart and Run All</kbd>. Should anything crash then restart the Jupyter Kernal by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart</kbd>, and start again from the top.\n",
    "        \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;\">\n",
    "<img src=\"https://github.com/CIMCB/MetabComparisonBinaryML/blob/master/cimcb_logo.png?raw=true\" width=\"180px\" align=\"right\" style=\"padding: 20px\">\n",
    "\n",
    "<h1> SVMLin_MTBLS136 </h1>\n",
    "\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<p style=\"text-align: justify\"> The study used in this tutorial has been previously published by  <a href=\"https://europepmc.org/abstract/MED/30830410\">Stevens et al. (2018)</a>, and the deconvolved and annotated data file deposited at the Metabolights data repository. The data can be accessed directly via its study ID: <a href=\"https://www.ebi.ac.uk/metabolights/MTBLS136\">MTBLS136</a>. This workflow requires data to be formatted as a Microsoft Excel file, using the Tidy Data framework (i.e. each column is a variable, and row is an observation). As such, the Excel file contains a Data Sheet and Peak Sheet. The Data Sheet contains all the metabolite concentrations and metadata associated with each observation (requiring the inclusion of the columns: Idx, SampleID, and Class). The Peak Sheet contains all the metadata pertaining to each measured metabolite (requiring the inclusion of the columns: Idx, Name, and Label). Please inspect the Excel file <a href=\"https://github.com/CIMCB/MetabComparisonBinaryML/blob/master/dynamic/data/MTBLS136.xlsx?raw=true\">MTBLS136.xlsx</a> used in this workflow before proceeding.</p>\n",
    "\n",
    "<p style=\"text-align: justify\">This is a serum LC-MS dataset consisting of 949 named metabolites. The primary outcome for this paper was estrogen-only (E; n=332) vs. estrogen plus progestin (E+P; n=337) vs. non-users of post-menopausal hormone therapy regimes (Control; n=667). For the purpose of this study, we compare only the E vs. E+P in a binary discriminant analysis.</p>\n",
    "<br>\n",
    "\n",
    "\n",
    "</ol> \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;\">\n",
    "    \n",
    "<h1> SVM-Lin Workflow </h1>\n",
    "<br>\n",
    "\n",
    "<p style=\"text-align: justify\">This Jupyter Notebook implements the complete workflow for creating, optimising, and evaluating a linear kernel support vector machine (SVM-Lin) model. <b style=\"text-align: justify\"> SVM was implemented using <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html\">Support Vector Classifier</a> from scikit-learn.</b></p>\n",
    "    \n",
    "<i style=\"text-align: justify\"> Please refer to the 'cimcb' package documentation for further details regarding this specific implementation: <a href=\"https://cimcb.github.io/cimcb\">https://cimcb.github.io/cimcb</a></i><br>\n",
    " \n",
    "<br>\n",
    "\n",
    "<b style=\"text-align: justify\"> SVM uses the following Hyperparameter(s):</b>\n",
    "<ul style=\"list-style-type: square;\">\n",
    "    <li><code>kernel</code>: specifies the type of kernel used in the algorithm. For SVM-Lin, this needs to be set to 'linear'.</li>\n",
    "    <li><code>C</code>: the cost parameter that determines how soft or hard (i.e. strict) the hyperplane margin is (default=1)</li>\n",
    "    \n",
    "</ul>\n",
    "<i style=\"text-align: justify\">The purpose of each hyperparameter is explained here: <a href=\"http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf\">Hsu et al. (2003)</a></i>\n",
    " \n",
    "<br>\n",
    "<br>\n",
    "\n",
    "<b style=\"text-align: justify\"> The notebook workflow is broken into the following steps:</b>\n",
    "\n",
    "<ol>\n",
    "    <li><b><i>Import Packages</i></b>: First, the Python packages required for this workflow need to be imported (<a href=\"http://www.numpy.org/\"><code>numpy</code></a>, <a href=\"https://pandas.pydata.org/\"><code>pandas</code></a>, and <a href=\"https://cimcb.github.io/cimcb\"><code>cimcb</code></a>).\n",
    "</li>\n",
    "    <li><b><i>Load Data & Peak Sheet:</i></b> From the Excel spreadsheet, import the Data and Peak spreadsheets and create two respective <a href=\"https://pandas.pydata.org/\">Pandas</a> tables: <code>DataTable</code> and <code>PeakTable</code>.</li>\n",
    "    <li><b><i>Extract X & Y:</i></b> Next, we reduce the data in <code>DataTable</code> to include only those observations needed for the binary comparison and create a new table: <code>DataTable2</code>. We define one column of the data table to be the \"outcome\" variable <code>Outcomes</code>, and convert the class labels in this column to a binary outcome vector <code>Y</code>, where <code>1</code> is the positive outcome, and <code>0</code> the negative outcome (eg. case=1 & control=0). A new variable <code>peaklist</code> is created to hold the names (M1...Mn) of the metabolites to be used in the discriminant analysis. To create an independent dataset to evaluate, <a href=\"https://scikit-learn.org/stable/\">scikit-learn</a> module's <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\"><code>train_test_split()</code></a> function is used. The data is split into 2/3rd training (<code>DataTrain</code> and <code>YTrain</code>), and 1/3rd test (<code>DataTest</code> and <code>YTest</code>). The metabolite data corresponding to <code>peaklist</code> is extracted from <code>DataTrain</code> and placed in a matrix <code>XTrain</code>. The <code>XTrain</code> matrix is log-transformed and auto-scaled, with missing values imputed using k-nearest neighbours (k=3). Then the metabolite data corresponding to <code>peaklist</code> is extracted from <code>DataTest</code> and placed in a matrix <code>XTest</code>. The <code>XTest</code> matrix is log-transformed and auto-scaled (using mu and sigma from <code>XTrain</code>), with missing values imputed using k-nearest neighbours (k=3).\n",
    "    <li><b><i>Hyperparameter Optimisation:</i></b> Here, we use the helper function <code>cb.cross_val.KFold()</code> to carry out 5-fold cross-validation of a set of SVM models (kernel='linear') configured with different values for C (0.0001 to 10). This helper function is generally applicable, and the values being passed to it are: \n",
    "    <ul>\n",
    "    <li>The class of model to be created by the function, <code>cb.model.SVM</code>.</li>\n",
    "        <li>The metabolite matrix, <code>XTknn</code>, and binary outcome vector, <code>Y</code>.</li>\n",
    "        <li>A dictionary, <code>param_dict</code>, describing key:value pairs where the key is a parameter that is passed to the model, and the value is a list of values to be passed to that parameter.</li>\n",
    "        <li>The number of folds in the cross-validation, <code>folds</code>, and the number of monte carlo repetitions of the k-fold CV, <code>n_mc</code>.</li></ul>\n",
    "When <code>cv.run()</code> is executed the results are displayed as 2 plots of $R^2$ and $Q^2$ statistics and 2 equivalent plots of AUCfull and AUCcv statistics: (a) ($R^2 - Q^2$) vs.  $Q^2$ or (AUCfull - AUCcv) vs. AUCcv, and (b) ($R^2$ and $Q^2$) or (AUCfull and AUCcv) against C. These plots are used to aid in selecting the optimal C value.\n",
    "    </li>\n",
    "    <li><b><i>Build Model & Evaluate:</i></b> Here, we use the class <code>cb.model.SVM()</code> to building a SVM-Lin model using the optimal hyperparameter values determined in step 4. The model is trained on the training dataset, <code>XTrainKnn</code>, and tested on the independent test dataset, <code>XTestKnn</code>. Next, the trained model's <code>.evaluate()</code> method is used to visualise model performance for both the training and independent test dataset using: a <a href=\"https://www.data-to-viz.com/graph/violin.html\">violin plot</a> showing the distributions of negative and positive responses as violin and box-whisker plots; a <a href=\"https://books.google.com.au/books?id=7WBMrZ9umRYC\">probability density function</a> plot for each response type, and a <a href=\"https://doi.org/10.1007/s11306-012-0482-9\">ROC curve</a> that displays the curve for the training dataset (green) and test dataset (yellow).\n",
    "   <li><b><i>Bootstrap Evaluation:</i></b> Finally, to create an estimate of the robustness and a measure of generalised predictive ability of this model we perform  <a href=\"https://link.springer.com/article/10.1007%2FBF00058655\">bootstrap aggregation</a> (Bagging) using the helper function <code>cb.bootstrap.Per()</code> with 100 boostrapped models. This generates a population of 100 model predictions for both the training set (in-bag prediction - IB) and the holdout test set (out-of-bag - OOB) from the full dataset, with the metabolite matrix, <code>XBootKnn</code>, and binary outcome vector, <code>Y</code>. These predictions are visualised with a box-violin and probability density function plot for the aggregate model. The ROC curve displays the curve for the training dataset (green) and test dataset (yellow) from section 5 with 95% confidence intervals (light green band = IB & light yellow band = OOB).\n",
    "  <li><b><i>Export Results:</i></b> Exporting the model evaluation results as an Excel spreadsheet.</li>\n",
    "</ol> \n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": true
   },
   "source": [
    "### 1. Import Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import cimcb as cb\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "print('All packages successfully loaded')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Load Data & Peak Sheet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "home = 'data/' \n",
    "file = 'MTBLS136.xlsx'  \n",
    "\n",
    "DataTable,PeakTable = cb.utils.load_dataXL(home + file, DataSheet='Data', PeakSheet='Peak') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Extract X & Y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean PeakTable and Extract PeakList\n",
    "PercMiss = PeakTable['Perc_missing']  \n",
    "PeakTableClean = PeakTable[(PercMiss < 20)] \n",
    "PeakList = PeakTableClean['Name']  \n",
    "\n",
    "# Select Subset of Data\n",
    "DataTable2 = DataTable[(DataTable.Class == 1) | (DataTable.Class == 0)]\n",
    "\n",
    "# Create a Binary Y Vector \n",
    "Outcomes = DataTable2['Class']\n",
    "Y = Outcomes.values \n",
    "\n",
    "# Split Data into Train (2/3) and Test (1/3)\n",
    "DataTrain, DataTest, YTrain, YTest = train_test_split(DataTable2, Y, test_size=1/3, stratify=Y, random_state=85)\n",
    "\n",
    "# Extract Train Data \n",
    "XTrain = DataTrain[PeakList]                                    \n",
    "XTrainLog = np.log(XTrain)                                          \n",
    "XTrainScale, mu, sigma = cb.utils.scale(XTrainLog, method='auto', return_mu_sigma=True)              \n",
    "XTrainKnn = cb.utils.knnimpute(XTrainScale, k=3)    \n",
    "\n",
    "# Extract Test Data\n",
    "XTest = DataTest[PeakList]                                     \n",
    "XTestLog = np.log(XTest)                                          \n",
    "XTestScale = cb.utils.scale(XTestLog, method='auto', mu=mu, sigma=sigma) \n",
    "XTestKnn = cb.utils.knnimpute(XTestScale, k=3)                                                                                                                 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Hyperparameter Optimisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Parameter Dictionary\n",
    "C_range = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05,  0.1]\n",
    "param_dict = dict(C=C_range, kernel=\"linear\")               \n",
    "\n",
    "# Initialise\n",
    "cv = cb.cross_val.KFold(model=cb.model.SVM,                      \n",
    "                        X=XTrainKnn,                                 \n",
    "                        Y=YTrain,                               \n",
    "                        param_dict=param_dict,                   \n",
    "                        folds=5,\n",
    "                        n_mc=10)                            \n",
    "\n",
    "# Run and Plot\n",
    "cv.run()  \n",
    "cv.plot(metric='auc')\n",
    "cv.plot(metric='r2q2')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parameter Dictionary\n",
    "#C_range = [1e-6,1e-5,1e-4,5e-4,1e-3,5e-3,1e-2,1e-1]\n",
    "C_range = [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, 0.001]\n",
    "param_dict = dict(C=C_range, kernel=\"linear\")               \n",
    "\n",
    "# Initialise\n",
    "cv = cb.cross_val.KFold(model=cb.model.SVM,                      \n",
    "                        X=XTrainKnn,                                 \n",
    "                        Y=YTrain,                               \n",
    "                        param_dict=param_dict,                   \n",
    "                        folds=5,\n",
    "                        n_mc=10)                                 \n",
    "\n",
    "# Run and Plot\n",
    "cv.run()  \n",
    "cv.plot(metric='auc')\n",
    "cv.plot(metric='r2q2')   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Build Model & Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Build Model\n",
    "model = cb.model.SVM(C=0.0005, kernel=\"linear\")     \n",
    "YPredTrain = model.train(XTrainKnn, YTrain)\n",
    "YPredTest = model.test(XTestKnn)\n",
    "\n",
    "# Put YTrain and YPredTrain in a List\n",
    "EvalTrain = [YTrain, YPredTrain]\n",
    "\n",
    "# Put YTest and YPrestTest in a List\n",
    "EvalTest = [YTest, YPredTest]\n",
    "\n",
    "# Evaluate Model (include Test Dataset)\n",
    "model.evaluate(testset=EvalTest) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Bootstrap Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract X Data\n",
    "XBoot = DataTable2[PeakList]\n",
    "XBootLog = np.log(XBoot)\n",
    "XBootScale = cb.utils.scale(XBootLog, method='auto')\n",
    "XBootKnn = cb.utils.knnimpute(XBootScale, k=3)\n",
    "YPredBoot = model.train(XBootKnn, Y)\n",
    "\n",
    "# Build Boostrap Models\n",
    "bootmodel = cb.bootstrap.Per(model, bootnum=100) \n",
    "bootmodel.run()\n",
    "\n",
    "# Boostrap Evaluate Model (include Test Dataset)\n",
    "bootmodel.evaluate(trainset=EvalTrain, testset=EvalTest)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Export Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "home = 'results/'\n",
    "file = 'SVMLin_MTBLS136.xlsx'\n",
    "\n",
    "bootmodel.save_results(home + file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": false,
   "toc_window_display": false
  },
  "toc-autonumbering": false,
  "toc-showmarkdowntxt": false
 },
 "nbformat": 4,
 "nbformat_minor": 2
}