|
|
Line 1: |
Line 1: |
|
| |
|
| == Bagging ==
| |
| Chapter 8 of the book.
| |
|
| |
| * Bootstrap mean is approximately a posterior average.
| |
| * Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by
| |
| :<math>\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).</math>
| |
|
| |
| [https://statcompute.wordpress.com/2016/01/02/where-bagging-might-work-better-than-boosting/ Where Bagging Might Work Better Than Boosting]
| |
|
| |
| [https://freakonometrics.hypotheses.org/52777 CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8]
| |
|
| |
| == Boosting ==
| |
| * Ch8.2 Bagging, Random Forests and Boosting of [http://www-bcf.usc.edu/~gareth/ISL/ An Introduction to Statistical Learning] and the [http://www-bcf.usc.edu/~gareth/ISL/Chapter%208%20Lab.txt code].
| |
| * [http://freakonometrics.hypotheses.org/19874 An Attempt To Understand Boosting Algorithm]
| |
| * [http://cran.r-project.org/web/packages/gbm/index.html gbm] package. An implementation of extensions to Freund and Schapire's '''AdaBoost algorithm''' and Friedman's '''gradient boosting machine'''. Includes regression methods for least squares, absolute loss, t-distribution loss, [http://mathewanalytics.com/2015/11/13/applied-statistical-theory-quantile-regression/ quantile regression], logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).
| |
| * https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf
| |
| * http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and [http://www.is.uni-freiburg.de/ressourcen/business-analytics/homework_ensemblelearning_questions.pdf exercise]
| |
| * [https://freakonometrics.hypotheses.org/52782 Classification from scratch]
| |
|
| |
| === AdaBoost ===
| |
| AdaBoost.M1 by Freund and Schapire (1997):
| |
|
| |
| The error rate on the training sample is
| |
| <math>
| |
| \bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)),
| |
| </math>
| |
|
| |
| Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers <math>G_m(x), m=1,2,\dots,M.</math>
| |
|
| |
| The predictions from all of them are combined through a weighted majority vote to produce the final prediction:
| |
| <math>
| |
| G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)].
| |
| </math>
| |
| Here <math> \alpha_1,\alpha_2,\dots,\alpha_M</math> are computed by the boosting algorithm and weight the contribution of each respective <math>G_m(x)</math>. Their effect is to give higher influence to the more accurate classifiers in the sequence.
| |
|
| |
| === Dropout regularization ===
| |
| [https://statcompute.wordpress.com/2017/08/20/dart-dropout-regularization-in-boosting-ensembles/ DART: Dropout Regularization in Boosting Ensembles]
| |
|
| |
| === Gradient boosting ===
| |
| * https://en.wikipedia.org/wiki/Gradient_boosting
| |
| * [https://shirinsplayground.netlify.com/2018/11/ml_basics_gbm/ Machine Learning Basics - Gradient Boosting & XGBoost]
| |
| * [http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/139-gradient-boosting-essentials-in-r-using-xgboost/ Gradient Boosting Essentials in R Using XGBOOST]
| |
|
| |
| == Gradient descent ==
| |
| Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function ([https://en.wikipedia.org/wiki/Gradient_descent Wikipedia]).
| |
|
| |
| * [https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ An Introduction to Gradient Descent and Linear Regression] Easy to understand based on simple linear regression. Code is provided too.
| |
| * [http://gradientdescending.com/applying-gradient-descent-primer-refresher/ Applying gradient descent – primer / refresher]
| |
| * [http://sebastianruder.com/optimizing-gradient-descent/index.html An overview of Gradient descent optimization algorithms]
| |
| * [https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/ A Complete Tutorial on Ridge and Lasso Regression in Python]
| |
| * How to choose the learning rate?
| |
| ** [http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html Machine learning] from Andrew Ng
| |
| ** http://scikit-learn.org/stable/modules/sgd.html
| |
| * R packages
| |
| ** https://cran.r-project.org/web/packages/gradDescent/index.html, https://www.rdocumentation.org/packages/gradDescent/versions/2.0
| |
| ** https://cran.r-project.org/web/packages/sgd/index.html
| |
|
| |
| The error function from a simple linear regression looks like
| |
| : <math>
| |
| \begin{align}
| |
| Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\
| |
| \end{align}
| |
| </math>
| |
|
| |
| We compute the gradient first for each parameters.
| |
| : <math>
| |
| \begin{align}
| |
| \frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\
| |
| \frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b))
| |
| \end{align}
| |
| </math>
| |
|
| |
| The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called '''learning rate'''.
| |
| <pre>
| |
| new_m &= m_current - (learningRate * m_gradient)
| |
| new_b &= b_current - (learningRate * b_gradient)
| |
| </pre>
| |
|
| |
| After each iteration, derivative is closer to zero. [http://blog.hackerearth.com/gradient-descent-algorithm-linear-regression Coding in R] for the simple linear regression.
| |
|
| |
| === Gradient descent vs Newton's method ===
| |
| * [https://stackoverflow.com/a/12066869 What is the difference between Gradient Descent and Newton's Gradient Descent?]
| |
| * [http://www.santanupattanayak.com/2017/12/19/newtons-method-vs-gradient-descent-method-in-tacking-saddle-points-in-non-convex-optimization/ Newton's Method vs Gradient Descent Method in tacking saddle points in Non-Convex Optimization]
| |
| * [https://dinh-hung-tu.github.io/gradient-descent-vs-newton-method/ Gradient Descent vs Newton Method]
| |
|
| |
| == Classification and Regression Trees (CART) ==
| |
| === Construction of the tree classifier ===
| |
| * Node proportion
| |
| :<math> p(1|t) + \dots + p(6|t) =1 </math> where <math>p(j|t)</math> define the node proportions (class proportion of class ''j'' on node ''t''. Here we assume there are 6 classes.
| |
| * Impurity of node t
| |
| :<math>i(t)</math> is a nonnegative function <math>\phi</math> of the <math>p(1|t), \dots, p(6|t)</math> such that <math> \phi(1/6,1/6,\dots,1/6)</math> = maximumm <math>\phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0</math>. That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.
| |
| * Gini index of impurity
| |
| :<math>i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).</math>
| |
| * Goodness of the split s on node t
| |
| :<math>\Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). </math> where <math>p_R</math> are the proportion of the cases in t go into the left node <math>t_L</math> and a proportion <math>p_R</math> go into right node <math>t_R</math>.
| |
| A tree was grown in the following way: At the root node <math>t_1</math>, a search was made through all candidate splits to find that split <math>s^*</math> which gave the largest decrease in impurity;
| |
| :<math>\Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).</math>
| |
| * Class character of a terminal node was determined by the plurality rule. Specifically, if <math>p(j_0|t)=\max_j p(j|t)</math>, then ''t'' was designated as a class <math>j_0</math> terminal node.
| |
|
| |
| === R packages ===
| |
| * [http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf rpart]
| |
| * http://exploringdatablog.blogspot.com/2013/04/classification-tree-models.html
| |
|
| |
| == Partially additive (generalized) linear model trees ==
| |
| * https://eeecon.uibk.ac.at/~zeileis/news/palmtree/
| |
| * https://cran.r-project.org/web/packages/palmtree/index.html
| |
|
| |
| == Supervised Classification, Logistic and Multinomial ==
| |
| * http://freakonometrics.hypotheses.org/19230
| |
|
| |
| == Variable selection ==
| |
| === Review ===
| |
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5969114/ Variable selection – A review and recommendations for the practicing statistician] by Heinze et al 2018.
| |
|
| |
| === Variable selection and variable importance plot ===
| |
| * http://freakonometrics.hypotheses.org/19835
| |
|
| |
| === Variable selection and cross-validation ===
| |
| * http://freakonometrics.hypotheses.org/19925
| |
| * http://ellisp.github.io/blog/2016/06/05/bootstrap-cv-strategies/
| |
|
| |
| === Mallow ''C<sub>p</sub>'' ===
| |
| Mallows's ''C<sub>p</sub>'' addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the '''mean squared prediction error (MSPE)'''.
| |
| :<math>
| |
| E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2,
| |
| </math>
| |
| The ''C<sub>p</sub>'' statistic is defined as
| |
| :<math> C_p={SSE_p \over S^2} - N + 2P. </math>
| |
|
| |
| * https://en.wikipedia.org/wiki/Mallows%27s_Cp
| |
| * Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.
| |
|
| |
| === Variable selection for mode regression ===
| |
| http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017
| |
|
| |
| == Neural network ==
| |
| * [http://junma5.weebly.com/data-blog/build-your-own-neural-network-classifier-in-r Build your own neural network in R]
| |
| * (Video) [https://youtu.be/ntKn5TPHHAk 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code] from the Coding Train. The book [http://natureofcode.com/book/chapter-10-neural-networks/ THE NATURE OF CODE] by DANIEL SHIFFMAN
| |
| * [https://freakonometrics.hypotheses.org/52774 CLASSIFICATION FROM SCRATCH, NEURAL NETS]. The ROCR package was used to produce the ROC curve.
| |
|
| |
| == Support vector machine (SVM) ==
| |
| * [https://statcompute.wordpress.com/2016/03/19/improve-svm-tuning-through-parallelism/ Improve SVM tuning through parallelism] by using the '''foreach''' and '''doParallel''' packages.
| |
|
| |
| == Quadratic Discriminant Analysis (qda), KNN ==
| |
| [https://datarvalue.blogspot.com/2017/05/machine-learning-stock-market-data-part_16.html Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN]
| |
|
| |
| == [https://en.wikipedia.org/wiki/Regularization_(mathematics) Regularization] ==
| |
| Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting
| |