Testmathj

From David's Wiki
Revision as of 22:38, 7 September 2019 by David (talk | contribs)
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Bagging

Chapter 8 of the book.

  • Bootstrap mean is approximately a posterior average.
  • Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by
\(\displaystyle \hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).\)

Where Bagging Might Work Better Than Boosting

CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8

Boosting

AdaBoost

AdaBoost.M1 by Freund and Schapire (1997):

The error rate on the training sample is \(\displaystyle \bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)), \)

Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers \(\displaystyle G_m(x), m=1,2,\dots,M.\)

The predictions from all of them are combined through a weighted majority vote to produce the final prediction: \(\displaystyle G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)]. \) Here \(\displaystyle \alpha_1,\alpha_2,\dots,\alpha_M\) are computed by the boosting algorithm and weight the contribution of each respective \(\displaystyle G_m(x)\). Their effect is to give higher influence to the more accurate classifiers in the sequence.

Dropout regularization

DART: Dropout Regularization in Boosting Ensembles

Gradient boosting

Gradient descent

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function (Wikipedia).

The error function from a simple linear regression looks like

\(\displaystyle \begin{align} Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\ \end{align} \)

We compute the gradient first for each parameters.

\(\displaystyle \begin{align} \frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\ \frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b)) \end{align} \)

The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called learning rate.

new_m &= m_current - (learningRate * m_gradient) 
new_b &= b_current - (learningRate * b_gradient) 

After each iteration, derivative is closer to zero. Coding in R for the simple linear regression.

Gradient descent vs Newton's method

Classification and Regression Trees (CART)

Construction of the tree classifier

  • Node proportion
\(\displaystyle p(1|t) + \dots + p(6|t) =1 \) where \(\displaystyle p(j|t)\) define the node proportions (class proportion of class j on node t. Here we assume there are 6 classes.
  • Impurity of node t
\(\displaystyle i(t)\) is a nonnegative function \(\displaystyle \phi\) of the \(\displaystyle p(1|t), \dots, p(6|t)\) such that \(\displaystyle \phi(1/6,1/6,\dots,1/6)\) = maximumm \(\displaystyle \phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0\). That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.
  • Gini index of impurity
\(\displaystyle i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).\)
  • Goodness of the split s on node t
\(\displaystyle \Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). \) where \(\displaystyle p_R\) are the proportion of the cases in t go into the left node \(\displaystyle t_L\) and a proportion \(\displaystyle p_R\) go into right node \(\displaystyle t_R\).

A tree was grown in the following way: At the root node \(\displaystyle t_1\), a search was made through all candidate splits to find that split \(\displaystyle s^*\) which gave the largest decrease in impurity;

\(\displaystyle \Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).\)
  • Class character of a terminal node was determined by the plurality rule. Specifically, if \(\displaystyle p(j_0|t)=\max_j p(j|t)\), then t was designated as a class \(\displaystyle j_0\) terminal node.

R packages

Partially additive (generalized) linear model trees

Supervised Classification, Logistic and Multinomial

Variable selection

Review

Variable selection – A review and recommendations for the practicing statistician by Heinze et al 2018.

Variable selection and variable importance plot

Variable selection and cross-validation

Mallow Cp

Mallows's Cp addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the mean squared prediction error (MSPE).

\(\displaystyle E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2, \)

The Cp statistic is defined as

\(\displaystyle C_p={SSE_p \over S^2} - N + 2P. \)
  • https://en.wikipedia.org/wiki/Mallows%27s_Cp
  • Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.

Variable selection for mode regression

http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017

Neural network

Support vector machine (SVM)

Quadratic Discriminant Analysis (qda), KNN

Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN

Regularization

Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting