{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": false
   },
   "source": [
    "<div style=\"text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;\">\n",
    "    <font color='red'>To begin: Click anywhere in this cell and press <kbd>Run</kbd> on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cells. A <i>text cell</i> and a <i>code cell</i>. When you <kbd>Run</kbd> a text cell (<i>we are in a text cell now</i>), you advance to the next cell without executing any code. When you <kbd>Run</kbd> a code cell (<i>identified by <span style=\"font-family: courier; color:black; background-color:white;\">In[ ]:</span> to the left of the cell</i>) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press <kbd>Run</kbd> again. Repeat this process until the end of the notebook. <b>NOTE:</b> All the cells in this notebook can be automatically executed sequentially by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart and Run All</kbd>. Should anything crash then restart the Jupyter Kernal by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart</kbd>, and start again from the top.\n",
    "        \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;\">\n",
    "<img src=\"https://github.com/CIMCB/MetabComparisonBinaryML/blob/master/cimcb_logo.png?raw=true\" width=\"180px\" align=\"right\" style=\"padding: 20px\">\n",
    "\n",
    "\n",
    "<h1> RF_ST001047 </h1>\n",
    "\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<p style=\"text-align: justify\"> The study used in this tutorial has been previously published by  <a href=\"https://www.nature.com/articles/bjc2015414\">Chan et al. (2016)</a>, and the deconvolved and annotated data file deposited at the Metabolomics Workbench data repository (Study ID: ST001047). The data can be accessed directly via its project DOI: <a href=\"http://dx.doi.org/DOI:10.21228/M8B10B\">10.21228/M8B10B</a>. This workflow requires data to be formatted as a Microsoft Excel file, using the Tidy Data framework (i.e. each column is a variable, and row is an observation). As such, the Excel file contains a Data Sheet and Peak Sheet. The Data Sheet contains all the metabolite concentrations and metadata associated with each observation (requiring the inclusion of the columns: Idx, SampleID, and Class). The Peak Sheet contains all the metadata pertaining to each measured metabolite (requiring the inclusion of the columns: Idx, Name, and Label). Please inspect the Excel file <a href=\"https://github.com/CIMCB/MetabComparisonBinaryML/blob/master/dynamic/data/ST001047.xlsx?raw=true\">ST001047.xlsx</a> used in this workflow before proceeding.</p>\n",
    "\n",
    "<p style=\"text-align: justify\">This is a urine NMR data set consisting of 149 named metabolites. The primary outcome for this paper was the urine was Gastric Cancer (GC; n=43) v Benign Tumor (BN; n=40) v Healthy Control (HE; n=40). For the purposes of this publication we compare only the GC vs HE samples in a binary discrimiant analysis.</p>\n",
    "\n",
    "\n",
    "<br>\n",
    "\n",
    "\n",
    "</ol> \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;\">\n",
    "    \n",
    "<h1> RF Workflow </h1>\n",
    "<br>\n",
    "<p style=\"text-align: justify\">This Jupyter Notebook implements the complete workflow for creating, optimising, and evaluating a random forest (RF) model. <b style=\"text-align: justify\"> RF was implemented using <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\">Random Forest Classifier</a> from scikit-learn.</b></p>\n",
    "    \n",
    "<i style=\"text-align: justify\"> Please refer to the 'cimcb' package documentation for further details regarding this specific implementation: <a href=\"https://cimcb.github.io/cimcb\">https://cimcb.github.io/cimcb</a></i><br>\n",
    " \n",
    "<br>\n",
    "\n",
    "<b style=\"text-align: justify\"> RF uses the following Hyperparameter(s):</b>\n",
    "<ul style=\"list-style-type: square;\">\n",
    "    <li><code>max_depth</code>: the maximum depth allowed for each tree (default = None)</li>\n",
    "    <li><code>min_samples_leaf</code>: the minimum number of samples required at each leaf node after a split in a tree. This value can either be an integer or a fraction of the total number of samples (default = 1)</li>\n",
    "    <li><code>n_estimators</code>: the number of trees in the forest (default = 100)</li>\n",
    "    <li><code>max_features</code>: the number of features considered at each split in a tree (default = square-root(n_features))</li>\n",
    "    <li><code>criterion</code>: the function used to measure the quality of the split in a tree. This is 'gini' for Gini impurity or 'entropy' for information gain (default = 'gini')</li>\n",
    "    <li><code>min_samples_split</code>: the minimum number of samples required for a split in a tree. This value can either be an integer or a fraction of the total number of samples (default = 2)</li>\n",
    "    <li><code>max_leaf_nodes</code>: the maximum number of leaf nodes in a tree (default = None)</li> \n",
    "</ul>\n",
    "<i style=\"text-align: justify\">The purpose of each hyperparameter is explained here:  <a href=\"https://doi.org/10.1002/widm.1301\">Probst et al. (2019)</a></i>\n",
    " \n",
    "<br>\n",
    "<br>\n",
    "\n",
    "<p style=\"text-align: justify\">Preliminary analysis indicated that varying hyperparameters <code>n_estimators</code>, <code>max_features</code>, <code>min_samples_split</code>, and <code>max_leaf_nodes</code> had little impact on performance (low sensitivity) for the metabolomics data sets used in this study, thus they were kept constant at their default values. Therefore, hyperparameter optimisation was reduced to a grid search across <code>depth = [1,2,3,4,5,6,7,8,9,10]</code> and <code>min_samples_leaf = [0.01,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5]</code>.</p>\n",
    "\n",
    "<br>\n",
    "\n",
    "<b style=\"text-align: justify\"> The notebook workflow is broken into the following steps:</b>\n",
    "\n",
    "<ol>\n",
    "    <li><b><i>Import Packages</i></b>: First, the Python packages required for this workflow need to be imported (<a href=\"http://www.numpy.org/\"><code>numpy</code></a>, <a href=\"https://pandas.pydata.org/\"><code>pandas</code></a>, and <a href=\"https://cimcb.github.io/cimcb\"><code>cimcb</code></a>).\n",
    "</li>\n",
    "    <li><b><i>Load Data & Peak Sheet:</i></b> From the Excel spreadsheet, import the Data and Peak spreadsheets and create two respective <a href=\"https://pandas.pydata.org/\">Pandas</a> tables: <code>DataTable</code> and <code>PeakTable</code>.</li>\n",
    "    <li><b><i>Extract X & Y:</i></b> Next, we reduce the data in <code>DataTable</code> to include only those observations needed for the binary comparison and create a new table: <code>DataTable2</code>. We define one column of the data table to be the \"outcome\" variable <code>Outcomes</code>, and convert the class labels in this column to a binary outcome vector <code>Y</code>, where <code>1</code> is the positive outcome, and <code>0</code> the negative outcome (eg. case=1 & control=0). A new variable <code>peaklist</code> is created to hold the names (M1...Mn) of the metabolites to be used in the discriminant analysis. To create an independent dataset to evaluate, <a href=\"https://scikit-learn.org/stable/\">scikit-learn</a> module's <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\"><code>train_test_split()</code></a> function is used. The data is split into 2/3rd training (<code>DataTrain</code> and <code>YTrain</code>), and 1/3rd test (<code>DataTest</code> and <code>YTest</code>). The metabolite data corresponding to <code>peaklist</code> is extracted from <code>DataTrain</code> and placed in a matrix <code>XTrain</code>. The <code>XTrain</code> matrix is log-transformed and auto-scaled, with missing values imputed using k-nearest neighbours (k=3). Then the metabolite data corresponding to <code>peaklist</code> is extracted from <code>DataTest</code> and placed in a matrix <code>XTest</code>. The <code>XTest</code> matrix is log-transformed and auto-scaled (using mu and sigma from <code>XTrain</code>), with missing values imputed using k-nearest neighbours (k=3).\n",
    "    <li><b><i>Hyperparameter Optimisation:</i></b> Here, we use the helper function <code>cb.cross_val.KFold()</code> to carry out 5-fold cross-validation of a set of RF models configured with different numbers of maximum depths (1 to 10) and minimum sample leaf as a fraction (0 to 0.5). This helper function is generally applicable, and the values being passed to it are: \n",
    "    <ul>\n",
    "    <li>The class of model to be created by the function, <code>cb.model.RF</code>.</li>\n",
    "        <li>The metabolite matrix, <code>XTknn</code>, and binary outcome vector, <code>Y</code>.</li>\n",
    "        <li>A dictionary, <code>param_dict</code>, describing key:value pairs where the key is a parameter that is passed to the model, and the value is a list of values to be passed to that parameter.</li>\n",
    "        <li>The number of folds in the cross-validation, <code>folds</code>, and the number of Monte Carlo repetitions of the k-fold CV, <code>n_mc</code>.</li></ul>\n",
    "When <code>cv.run()</code> followed by <code>cv.plot(metric='r2q2')</code> are run the predictive ability of the multiple models across the hyperparameter grid search (<code>sample leaf</code> vs. <code>max depth</code>) are displayed in the form of heatmaps representing the parametric performance values $R^2$, $Q^2$ and $|R^2 - Q^2|$. These heatmaps are interactively linked to a scatter plot of $|R^2 - Q^2|$ vs. $Q^2$ and line plots of $R^2$ & $Q^2$ vs <code>sample leaf</code> and <code>max depth</code>. If the function <code>cv.plot(metric='auc')</code> is run the predictive ability of the models is presented as measures of the area under the ROC curve, $AUC(full)$ & $AUC(cv)$, as a nonparametric alternative to $R^2$ & $Q^2$. These multiple plots are used to aid in selecting the optimal hyperparameter values.</li>\n",
    "    <li><b><i>Build Model & Evaluate:</i></b> Here, we use the class <code>cb.model.RF()</code> to building a RF model using the optimal hyperparameter values determined in step 4. The model is trained on the training dataset, <code>XTrainKnn</code>, and tested on the independent test dataset, <code>XTestKnn</code>. Next, the trained model's <code>.evaluate()</code> method is used to visualise model performance for both the training and independent test dataset using: a <a href=\"https://www.data-to-viz.com/graph/violin.html\">violin plot</a> showing the distributions of negative and positive responses as violin and box-whisker plots; a <a href=\"https://books.google.com.au/books?id=7WBMrZ9umRYC\">probability density function</a> plot for each response type, and a <a href=\"https://doi.org/10.1007/s11306-012-0482-9\">ROC curve</a> that displays the curve for the training dataset (green) and test dataset (yellow).\n",
    "   <li><b><i>Bootstrap Evaluation:</i></b> Finally, to create an estimate of the robustness and a measure of generalised predictive ability of this model we perform  <a href=\"https://link.springer.com/article/10.1007%2FBF00058655\">bootstrap aggregation</a> (Bagging) using the helper function <code>cb.bootstrap.Per()</code> with 100 boostrapped models. This generates a population of 100 model predictions for both the training set (in-bag prediction - IB) and the holdout test set (out-of-bag - OOB) from the full dataset, with the metabolite matrix, <code>XBootKnn</code>, and binary outcome vector, <code>Y</code>. These predictions are visualised with a box-violin and probability density function plot for the aggregate model. The ROC curve displays the curve for the training dataset (green) and test dataset (yellow) from section 5 with 95% confidence intervals (light green band = IB & light yellow band = OOB).\n",
    "  <li><b><i>Export Results:</i></b> Exporting the model evaluation results as an Excel spreadsheet.</li>\n",
    "</ol> \n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": true
   },
   "source": [
    "### 1. Import Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import cimcb as cb\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "print('All packages successfully loaded')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Load Data & Peak Sheet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "home = 'data/' \n",
    "file = 'ST001047.xlsx' \n",
    "\n",
    "DataTable,PeakTable = cb.utils.load_dataXL(home + file, DataSheet='Data', PeakSheet='Peak') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Extract X & Y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean PeakTable and Extract PeakList\n",
    "RSD = PeakTable['QC_RSD']   \n",
    "PercMiss = PeakTable['Perc_missing']  \n",
    "PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]   \n",
    "PeakList = PeakTableClean['Name']  \n",
    "\n",
    "# Select Subset of Data (Class \"GC\" or \"HE\" only)\n",
    "DataTable2 = DataTable[(DataTable.Class == \"GC\") | (DataTable.Class == \"HE\")]\n",
    "\n",
    "# Create a Binary Y Vector \n",
    "Outcomes = DataTable2['Class']                                  \n",
    "Y = [1 if outcome == 'GC' else 0 for outcome in Outcomes]         \n",
    "Y = np.array(Y)   \n",
    "\n",
    "# Split Data into Train (2/3rd) and Test (1/3rd)\n",
    "DataTrain, DataTest, YTrain, YTest = train_test_split(DataTable2, Y, test_size=1/3, stratify=Y, random_state=40)\n",
    "\n",
    "# Extract Train Data                          \n",
    "XTrain = DataTrain[PeakList]                                    \n",
    "XTrainLog = np.log(XTrain)                                          \n",
    "XTrainScale, mu, sigma = cb.utils.scale(XTrainLog, method='auto', return_mu_sigma=True)              \n",
    "XTrainKnn = cb.utils.knnimpute(XTrainScale, k=3)    \n",
    "\n",
    "# Extract Test Data\n",
    "XTest = DataTest[PeakList]                                    \n",
    "XTestLog = np.log(XTest)                                          \n",
    "XTestScale = cb.utils.scale(XTestLog, method='auto', mu=mu, sigma=sigma)           \n",
    "XTestKnn = cb.utils.knnimpute(XTestScale, k=3)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Hyperparameters Optimisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Parameter Dictionary\n",
    "depth = list(range(1,11))\n",
    "leaf_asfraction = [0.01,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5]\n",
    "\n",
    "param_dict = dict(max_depth=depth,\n",
    "                  min_samples_leaf=leaf_asfraction,\n",
    "                  max_features='sqrt',\n",
    "                  criterion='gini',\n",
    "                  min_samples_split=2,\n",
    "                  max_leaf_nodes=None,\n",
    "                  n_estimators=100)\n",
    "\n",
    "# Initialise\n",
    "cv = cb.cross_val.KFold(model=cb.model.RF,                      \n",
    "                        X=XTrainKnn,                                 \n",
    "                        Y=YTrain,                               \n",
    "                        param_dict=param_dict,                   \n",
    "                        folds=5,\n",
    "                        n_mc=10)       \n",
    "                              \n",
    "\n",
    "# Run and Plot\n",
    "cv.run()  \n",
    "cv.plot(metric='auc', color_beta=[5,5,3])\n",
    "cv.plot(metric='r2q2', color_beta=[5,5,3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Build Model & Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Build Model\n",
    "model = cb.model.RF(max_depth=6,\n",
    "                    min_samples_leaf=0.15,\n",
    "                    max_features='sqrt',\n",
    "                    criterion='gini',\n",
    "                    min_samples_split=2,\n",
    "                    max_leaf_nodes=None,\n",
    "                    n_estimators=100) \n",
    "YPredTrain = model.train(XTrainKnn, YTrain)\n",
    "YPredTest = model.test(XTestKnn)\n",
    "\n",
    "# Put YTrain and YPredTrain in a List\n",
    "EvalTrain = [YTrain, YPredTrain]\n",
    "\n",
    "# Put YTest and YPrestTest in a List\n",
    "EvalTest = [YTest, YPredTest]\n",
    "\n",
    "# Evaluate Model (include Test Dataset\n",
    "model.evaluate(testset=EvalTest) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Bootstrap Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract X Data\n",
    "XBoot = DataTable2[PeakList]\n",
    "XBootLog = np.log(XBoot)\n",
    "XBootScale = cb.utils.scale(XBootLog, method='auto')\n",
    "XBootKnn = cb.utils.knnimpute(XBootScale, k=3)\n",
    "YPredBoot = model.train(XBootKnn, Y)\n",
    "\n",
    "# Build Boostrap Models\n",
    "bootmodel = cb.bootstrap.Per(model, bootnum=100) \n",
    "bootmodel.run()\n",
    "\n",
    "# Boostrap Evaluate Model (include Test Dataset)\n",
    "bootmodel.evaluate(trainset=EvalTrain, testset=EvalTest)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Export Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "home = 'results/'\n",
    "file = 'RF_ST001047.xlsx'\n",
    "\n",
    "bootmodel.save_results(home + file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": false,
   "toc_window_display": false
  },
  "toc-autonumbering": false,
  "toc-showmarkdowntxt": false
 },
 "nbformat": 4,
 "nbformat_minor": 2
}