Testmathj
Bagging
Chapter 8 of the book.
- Bootstrap mean is approximately a posterior average.
- Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by
- \(\displaystyle \hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).\)
Boosting
- Ch8.2 Bagging, Random Forests and Boosting of An Introduction to Statistical Learning and the code.
- An Attempt To Understand Boosting Algorithm
- gbm package. An implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. Includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).
- https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf
- http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and exercise
- Classification from scratch
AdaBoost
AdaBoost.M1 by Freund and Schapire (1997):
The error rate on the training sample is \(\displaystyle \bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)), \)
Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers \(\displaystyle G_m(x), m=1,2,\dots,M.\)
The predictions from all of them are combined through a weighted majority vote to produce the final prediction: \(\displaystyle G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)]. \) Here \(\displaystyle \alpha_1,\alpha_2,\dots,\alpha_M\) are computed by the boosting algorithm and weight the contribution of each respective \(\displaystyle G_m(x)\). Their effect is to give higher influence to the more accurate classifiers in the sequence.
Dropout regularization
DART: Dropout Regularization in Boosting Ensembles
Gradient boosting
Gradient descent
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function (Wikipedia).
- An Introduction to Gradient Descent and Linear Regression Easy to understand based on simple linear regression. Code is provided too.
- Applying gradient descent – primer / refresher
- An overview of Gradient descent optimization algorithms
- A Complete Tutorial on Ridge and Lasso Regression in Python
- How to choose the learning rate?
- Machine learning from Andrew Ng
- http://scikit-learn.org/stable/modules/sgd.html
- R packages
The error function from a simple linear regression looks like
- \(\displaystyle \begin{align} Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\ \end{align} \)
We compute the gradient first for each parameters.
- \(\displaystyle \begin{align} \frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\ \frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b)) \end{align} \)
The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called learning rate.
new_m &= m_current - (learningRate * m_gradient) new_b &= b_current - (learningRate * b_gradient)
After each iteration, derivative is closer to zero. Coding in R for the simple linear regression.
Gradient descent vs Newton's method
Classification and Regression Trees (CART)
Construction of the tree classifier
- Node proportion
- \(\displaystyle p(1|t) + \dots + p(6|t) =1 \) where \(\displaystyle p(j|t)\) define the node proportions (class proportion of class j on node t. Here we assume there are 6 classes.
- Impurity of node t
- \(\displaystyle i(t)\) is a nonnegative function \(\displaystyle \phi\) of the \(\displaystyle p(1|t), \dots, p(6|t)\) such that \(\displaystyle \phi(1/6,1/6,\dots,1/6)\) = maximumm \(\displaystyle \phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0\). That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.
- Gini index of impurity
- \(\displaystyle i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).\)
- Goodness of the split s on node t
- \(\displaystyle \Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). \) where \(\displaystyle p_R\) are the proportion of the cases in t go into the left node \(\displaystyle t_L\) and a proportion \(\displaystyle p_R\) go into right node \(\displaystyle t_R\).
A tree was grown in the following way: At the root node \(\displaystyle t_1\), a search was made through all candidate splits to find that split \(\displaystyle s^*\) which gave the largest decrease in impurity;
- \(\displaystyle \Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).\)
- Class character of a terminal node was determined by the plurality rule. Specifically, if \(\displaystyle p(j_0|t)=\max_j p(j|t)\), then t was designated as a class \(\displaystyle j_0\) terminal node.
R packages
Partially additive (generalized) linear model trees
Supervised Classification, Logistic and Multinomial
Variable selection
Review
Variable selection – A review and recommendations for the practicing statistician by Heinze et al 2018.
Variable selection and variable importance plot
Variable selection and cross-validation
- http://freakonometrics.hypotheses.org/19925
- http://ellisp.github.io/blog/2016/06/05/bootstrap-cv-strategies/
Mallow Cp
Mallows's Cp addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the mean squared prediction error (MSPE).
- \(\displaystyle E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2, \)
The Cp statistic is defined as
- \(\displaystyle C_p={SSE_p \over S^2} - N + 2P. \)
- https://en.wikipedia.org/wiki/Mallows%27s_Cp
- Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.
Variable selection for mode regression
http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017
Neural network
- Build your own neural network in R
- (Video) 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code from the Coding Train. The book THE NATURE OF CODE by DANIEL SHIFFMAN
- CLASSIFICATION FROM SCRATCH, NEURAL NETS. The ROCR package was used to produce the ROC curve.
Support vector machine (SVM)
- Improve SVM tuning through parallelism by using the foreach and doParallel packages.
Quadratic Discriminant Analysis (qda), KNN
Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN
Regularization
Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting