Testmathj

Bagging

Chapter 8 of the book.

Bootstrap mean is approximately a posterior average.
Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by

\(\displaystyle \hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).\)

Where Bagging Might Work Better Than Boosting

CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8

Boosting

Ch8.2 Bagging, Random Forests and Boosting of An Introduction to Statistical Learning and the code.
An Attempt To Understand Boosting Algorithm
gbm package. An implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. Includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).
https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf
http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and exercise
Classification from scratch

AdaBoost

AdaBoost.M1 by Freund and Schapire (1997):

The error rate on the training sample is \(\displaystyle \bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)), \)

Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers \(\displaystyle G_m(x), m=1,2,\dots,M.\)

The predictions from all of them are combined through a weighted majority vote to produce the final prediction: \(\displaystyle G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)]. \) Here \(\displaystyle \alpha_1,\alpha_2,\dots,\alpha_M\) are computed by the boosting algorithm and weight the contribution of each respective \(\displaystyle G_m(x)\). Their effect is to give higher influence to the more accurate classifiers in the sequence.

Dropout regularization

DART: Dropout Regularization in Boosting Ensembles

Gradient boosting

Gradient descent

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function (Wikipedia).

An Introduction to Gradient Descent and Linear Regression Easy to understand based on simple linear regression. Code is provided too.
Applying gradient descent – primer / refresher
An overview of Gradient descent optimization algorithms
A Complete Tutorial on Ridge and Lasso Regression in Python
How to choose the learning rate?
- Machine learning from Andrew Ng
- http://scikit-learn.org/stable/modules/sgd.html
R packages
- https://cran.r-project.org/web/packages/gradDescent/index.html, https://www.rdocumentation.org/packages/gradDescent/versions/2.0
- https://cran.r-project.org/web/packages/sgd/index.html

The error function from a simple linear regression looks like

\(\displaystyle \begin{align} Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\ \end{align} \)

We compute the gradient first for each parameters.

\(\displaystyle \begin{align} \frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\ \frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b)) \end{align} \)

The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called learning rate.

new_m &= m_current - (learningRate * m_gradient) 
new_b &= b_current - (learningRate * b_gradient)

After each iteration, derivative is closer to zero. Coding in R for the simple linear regression.

Gradient descent vs Newton's method

Classification and Regression Trees (CART)

Construction of the tree classifier

Node proportion

\(\displaystyle p(1|t) + \dots + p(6|t) =1 \) where \(\displaystyle p(j|t)\) define the node proportions (class proportion of class j on node t. Here we assume there are 6 classes.

Impurity of node t

\(\displaystyle i(t)\) is a nonnegative function \(\displaystyle \phi\) of the \(\displaystyle p(1|t), \dots, p(6|t)\) such that \(\displaystyle \phi(1/6,1/6,\dots,1/6)\) = maximumm \(\displaystyle \phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0\). That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.

Gini index of impurity

\(\displaystyle i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).\)

Goodness of the split s on node t

\(\displaystyle \Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). \) where \(\displaystyle p_R\) are the proportion of the cases in t go into the left node \(\displaystyle t_L\) and a proportion \(\displaystyle p_R\) go into right node \(\displaystyle t_R\).

A tree was grown in the following way: At the root node \(\displaystyle t_1\), a search was made through all candidate splits to find that split \(\displaystyle s^*\) which gave the largest decrease in impurity;

\(\displaystyle \Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).\)

Class character of a terminal node was determined by the plurality rule. Specifically, if \(\displaystyle p(j_0|t)=\max_j p(j|t)\), then t was designated as a class \(\displaystyle j_0\) terminal node.

R packages

Partially additive (generalized) linear model trees

Supervised Classification, Logistic and Multinomial

http://freakonometrics.hypotheses.org/19230

Variable selection

Review

Variable selection – A review and recommendations for the practicing statistician by Heinze et al 2018.

Variable selection and variable importance plot

http://freakonometrics.hypotheses.org/19835

Variable selection and cross-validation

Mallow C_p

Mallows's C_p addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the mean squared prediction error (MSPE).

\(\displaystyle E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2, \)

The C_p statistic is defined as

\(\displaystyle C_p={SSE_p \over S^2} - N + 2P. \)

https://en.wikipedia.org/wiki/Mallows%27s_Cp
Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.

Variable selection for mode regression

http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017

Neural network

Build your own neural network in R
(Video) 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code from the Coding Train. The book THE NATURE OF CODE by DANIEL SHIFFMAN
CLASSIFICATION FROM SCRATCH, NEURAL NETS. The ROCR package was used to produce the ROC curve.

Support vector machine (SVM)

Improve SVM tuning through parallelism by using the foreach and doParallel packages.

Quadratic Discriminant Analysis (qda), KNN

Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN

Regularization

Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting