Testmathj: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
== Bagging == | |||
Chapter 8 of the book. | |||
* Bootstrap mean is approximately a posterior average. | |||
* Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by | |||
:<math>\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).</math> | |||
* | |||
* | |||
* | |||
[https://statcompute.wordpress.com/2016/01/02/where-bagging-might-work-better-than-boosting/ Where Bagging Might Work Better Than Boosting] | |||
https:// | |||
[https://freakonometrics.hypotheses.org/52777 CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8] | |||
[https:// | |||
== | == Boosting == | ||
* [ | * Ch8.2 Bagging, Random Forests and Boosting of [http://www-bcf.usc.edu/~gareth/ISL/ An Introduction to Statistical Learning] and the [http://www-bcf.usc.edu/~gareth/ISL/Chapter%208%20Lab.txt code]. | ||
* [http://freakonometrics.hypotheses.org/19874 An Attempt To Understand Boosting Algorithm] | |||
** | * [http://cran.r-project.org/web/packages/gbm/index.html gbm] package. An implementation of extensions to Freund and Schapire's '''AdaBoost algorithm''' and Friedman's '''gradient boosting machine'''. Includes regression methods for least squares, absolute loss, t-distribution loss, [http://mathewanalytics.com/2015/11/13/applied-statistical-theory-quantile-regression/ quantile regression], logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart). | ||
* https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf | |||
** [https:// | * http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and [http://www.is.uni-freiburg.de/ressourcen/business-analytics/homework_ensemblelearning_questions.pdf exercise] | ||
** | * [https://freakonometrics.hypotheses.org/52782 Classification from scratch] | ||
** | === AdaBoost === | ||
AdaBoost.M1 by Freund and Schapire (1997): | |||
* | The error rate on the training sample is | ||
* | <math> | ||
* | \bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)), | ||
</math> | |||
Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers <math>G_m(x), m=1,2,\dots,M.</math> | |||
\hat{ | |||
The predictions from all of them are combined through a weighted majority vote to produce the final prediction: | |||
<math> | |||
G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)]. | |||
</math> | |||
Here <math> \alpha_1,\alpha_2,\dots,\alpha_M</math> are computed by the boosting algorithm and weight the contribution of each respective <math>G_m(x)</math>. Their effect is to give higher influence to the more accurate classifiers in the sequence. | |||
=== Dropout regularization === | |||
[https://statcompute.wordpress.com/2017/08/20/dart-dropout-regularization-in-boosting-ensembles/ DART: Dropout Regularization in Boosting Ensembles] | |||
=== Gradient boosting === | |||
* https://en.wikipedia.org/wiki/Gradient_boosting | |||
* [https://shirinsplayground.netlify.com/2018/11/ml_basics_gbm/ Machine Learning Basics - Gradient Boosting & XGBoost] | |||
* [http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/139-gradient-boosting-essentials-in-r-using-xgboost/ Gradient Boosting Essentials in R Using XGBOOST] | |||
== Gradient descent == | |||
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function ([https://en.wikipedia.org/wiki/Gradient_descent Wikipedia]). | |||
* [https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ An Introduction to Gradient Descent and Linear Regression] Easy to understand based on simple linear regression. Code is provided too. | |||
* [http://gradientdescending.com/applying-gradient-descent-primer-refresher/ Applying gradient descent – primer / refresher] | |||
* [http://sebastianruder.com/optimizing-gradient-descent/index.html An overview of Gradient descent optimization algorithms] | |||
* [https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/ A Complete Tutorial on Ridge and Lasso Regression in Python] | |||
* How to choose the learning rate? | |||
** [http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html Machine learning] from Andrew Ng | |||
** http://scikit-learn.org/stable/modules/sgd.html | |||
* R packages | |||
** https://cran.r-project.org/web/packages/gradDescent/index.html, https://www.rdocumentation.org/packages/gradDescent/versions/2.0 | |||
** https://cran.r-project.org/web/packages/sgd/index.html | |||
The error function from a simple linear regression looks like | |||
: <math> | |||
\begin{align} | |||
Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\ | |||
\end{align} | |||
</math> | |||
We compute the gradient first for each parameters. | |||
: <math> | |||
\begin{align} | |||
\frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\ | |||
\frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b)) | |||
\end{align} | |||
</math> | |||
The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called '''learning rate'''. | |||
<pre> | |||
new_m &= m_current - (learningRate * m_gradient) | |||
new_b &= b_current - (learningRate * b_gradient) | |||
</pre> | |||
After each iteration, derivative is closer to zero. [http://blog.hackerearth.com/gradient-descent-algorithm-linear-regression Coding in R] for the simple linear regression. | |||
=== Gradient descent vs Newton's method === | |||
* [https://stackoverflow.com/a/12066869 What is the difference between Gradient Descent and Newton's Gradient Descent?] | |||
* [http://www.santanupattanayak.com/2017/12/19/newtons-method-vs-gradient-descent-method-in-tacking-saddle-points-in-non-convex-optimization/ Newton's Method vs Gradient Descent Method in tacking saddle points in Non-Convex Optimization] | |||
* [https://dinh-hung-tu.github.io/gradient-descent-vs-newton-method/ Gradient Descent vs Newton Method] | |||
== Classification and Regression Trees (CART) == | |||
=== Construction of the tree classifier === | |||
* Node proportion | |||
:<math> p(1|t) + \dots + p(6|t) =1 </math> where <math>p(j|t)</math> define the node proportions (class proportion of class ''j'' on node ''t''. Here we assume there are 6 classes. | |||
* Impurity of node t | |||
:<math>i(t)</math> is a nonnegative function <math>\phi</math> of the <math>p(1|t), \dots, p(6|t)</math> such that <math> \phi(1/6,1/6,\dots,1/6)</math> = maximumm <math>\phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0</math>. That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class. | |||
* Gini index of impurity | |||
:<math>i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).</math> | |||
* Goodness of the split s on node t | |||
:<math>\Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). </math> where <math>p_R</math> are the proportion of the cases in t go into the left node <math>t_L</math> and a proportion <math>p_R</math> go into right node <math>t_R</math>. | |||
A tree was grown in the following way: At the root node <math>t_1</math>, a search was made through all candidate splits to find that split <math>s^*</math> which gave the largest decrease in impurity; | |||
:<math>\Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).</math> | |||
* Class character of a terminal node was determined by the plurality rule. Specifically, if <math>p(j_0|t)=\max_j p(j|t)</math>, then ''t'' was designated as a class <math>j_0</math> terminal node. | |||
=== R packages === | |||
* [http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf rpart] | |||
* http://exploringdatablog.blogspot.com/2013/04/classification-tree-models.html | |||
== Partially additive (generalized) linear model trees == | |||
* https://eeecon.uibk.ac.at/~zeileis/news/palmtree/ | |||
* https://cran.r-project.org/web/packages/palmtree/index.html | |||
== Supervised Classification, Logistic and Multinomial == | |||
* http://freakonometrics.hypotheses.org/19230 | |||
== Variable selection == | |||
=== Review === | |||
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5969114/ Variable selection – A review and recommendations for the practicing statistician] by Heinze et al 2018. | |||
=== Variable selection and variable importance plot === | |||
* http://freakonometrics.hypotheses.org/19835 | |||
=== Variable selection and cross-validation === | |||
* http://freakonometrics.hypotheses.org/19925 | |||
* http://ellisp.github.io/blog/2016/06/05/bootstrap-cv-strategies/ | |||
=== Mallow ''C<sub>p</sub>'' === | |||
Mallows's ''C<sub>p</sub>'' addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the '''mean squared prediction error (MSPE)'''. | |||
:<math> | |||
E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2, | |||
</math> | </math> | ||
The ''C<sub>p</sub>'' statistic is defined as | |||
:<math> C_p={SSE_p \over S^2} - N + 2P. </math> | |||
* | |||
* https://en.wikipedia.org/wiki/Mallows%27s_Cp | |||
* Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster. | |||
* [https:// | === Variable selection for mode regression === | ||
http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017 | |||
* [ | |||
* [https:// | == Neural network == | ||
* [http://junma5.weebly.com/data-blog/build-your-own-neural-network-classifier-in-r Build your own neural network in R] | |||
* (Video) [https://youtu.be/ntKn5TPHHAk 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code] from the Coding Train. The book [http://natureofcode.com/book/chapter-10-neural-networks/ THE NATURE OF CODE] by DANIEL SHIFFMAN | |||
* [https://freakonometrics.hypotheses.org/52774 CLASSIFICATION FROM SCRATCH, NEURAL NETS]. The ROCR package was used to produce the ROC curve. | |||
== Support vector machine (SVM) == | |||
* [https://statcompute.wordpress.com/2016/03/19/improve-svm-tuning-through-parallelism/ Improve SVM tuning through parallelism] by using the '''foreach''' and '''doParallel''' packages. | |||
== Quadratic Discriminant Analysis (qda), KNN == | |||
[https://datarvalue.blogspot.com/2017/05/machine-learning-stock-market-data-part_16.html Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN] | |||
== [https://en.wikipedia.org/wiki/Regularization_(mathematics) Regularization] == | |||
Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting |
Revision as of 22:38, 7 September 2019
Bagging
Chapter 8 of the book.
- Bootstrap mean is approximately a posterior average.
- Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by
- \(\displaystyle \hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).\)
Where Bagging Might Work Better Than Boosting
CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8
Boosting
- Ch8.2 Bagging, Random Forests and Boosting of An Introduction to Statistical Learning and the code.
- An Attempt To Understand Boosting Algorithm
- gbm package. An implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. Includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).
- https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf
- http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and exercise
- Classification from scratch
AdaBoost
AdaBoost.M1 by Freund and Schapire (1997):
The error rate on the training sample is \(\displaystyle \bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)), \)
Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers \(\displaystyle G_m(x), m=1,2,\dots,M.\)
The predictions from all of them are combined through a weighted majority vote to produce the final prediction: \(\displaystyle G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)]. \) Here \(\displaystyle \alpha_1,\alpha_2,\dots,\alpha_M\) are computed by the boosting algorithm and weight the contribution of each respective \(\displaystyle G_m(x)\). Their effect is to give higher influence to the more accurate classifiers in the sequence.
Dropout regularization
DART: Dropout Regularization in Boosting Ensembles
Gradient boosting
- https://en.wikipedia.org/wiki/Gradient_boosting
- Machine Learning Basics - Gradient Boosting & XGBoost
- Gradient Boosting Essentials in R Using XGBOOST
Gradient descent
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function (Wikipedia).
- An Introduction to Gradient Descent and Linear Regression Easy to understand based on simple linear regression. Code is provided too.
- Applying gradient descent – primer / refresher
- An overview of Gradient descent optimization algorithms
- A Complete Tutorial on Ridge and Lasso Regression in Python
- How to choose the learning rate?
- Machine learning from Andrew Ng
- http://scikit-learn.org/stable/modules/sgd.html
- R packages
The error function from a simple linear regression looks like
- \(\displaystyle \begin{align} Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\ \end{align} \)
We compute the gradient first for each parameters.
- \(\displaystyle \begin{align} \frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\ \frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b)) \end{align} \)
The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called learning rate.
new_m &= m_current - (learningRate * m_gradient) new_b &= b_current - (learningRate * b_gradient)
After each iteration, derivative is closer to zero. Coding in R for the simple linear regression.
Gradient descent vs Newton's method
- What is the difference between Gradient Descent and Newton's Gradient Descent?
- Newton's Method vs Gradient Descent Method in tacking saddle points in Non-Convex Optimization
- Gradient Descent vs Newton Method
Classification and Regression Trees (CART)
Construction of the tree classifier
- Node proportion
- \(\displaystyle p(1|t) + \dots + p(6|t) =1 \) where \(\displaystyle p(j|t)\) define the node proportions (class proportion of class j on node t. Here we assume there are 6 classes.
- Impurity of node t
- \(\displaystyle i(t)\) is a nonnegative function \(\displaystyle \phi\) of the \(\displaystyle p(1|t), \dots, p(6|t)\) such that \(\displaystyle \phi(1/6,1/6,\dots,1/6)\) = maximumm \(\displaystyle \phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0\). That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.
- Gini index of impurity
- \(\displaystyle i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).\)
- Goodness of the split s on node t
- \(\displaystyle \Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). \) where \(\displaystyle p_R\) are the proportion of the cases in t go into the left node \(\displaystyle t_L\) and a proportion \(\displaystyle p_R\) go into right node \(\displaystyle t_R\).
A tree was grown in the following way: At the root node \(\displaystyle t_1\), a search was made through all candidate splits to find that split \(\displaystyle s^*\) which gave the largest decrease in impurity;
- \(\displaystyle \Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).\)
- Class character of a terminal node was determined by the plurality rule. Specifically, if \(\displaystyle p(j_0|t)=\max_j p(j|t)\), then t was designated as a class \(\displaystyle j_0\) terminal node.
R packages
Partially additive (generalized) linear model trees
- https://eeecon.uibk.ac.at/~zeileis/news/palmtree/
- https://cran.r-project.org/web/packages/palmtree/index.html
Supervised Classification, Logistic and Multinomial
Variable selection
Review
Variable selection – A review and recommendations for the practicing statistician by Heinze et al 2018.
Variable selection and variable importance plot
Variable selection and cross-validation
- http://freakonometrics.hypotheses.org/19925
- http://ellisp.github.io/blog/2016/06/05/bootstrap-cv-strategies/
Mallow Cp
Mallows's Cp addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the mean squared prediction error (MSPE).
- \(\displaystyle E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2, \)
The Cp statistic is defined as
- \(\displaystyle C_p={SSE_p \over S^2} - N + 2P. \)
- https://en.wikipedia.org/wiki/Mallows%27s_Cp
- Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.
Variable selection for mode regression
http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017
Neural network
- Build your own neural network in R
- (Video) 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code from the Coding Train. The book THE NATURE OF CODE by DANIEL SHIFFMAN
- CLASSIFICATION FROM SCRATCH, NEURAL NETS. The ROCR package was used to produce the ROC curve.
Support vector machine (SVM)
- Improve SVM tuning through parallelism by using the foreach and doParallel packages.
Quadratic Discriminant Analysis (qda), KNN
Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN
Regularization
Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting