Testmathj: Difference between revisions

Line 1:

~~== Bagging ==~~

~~Chapter 8 of the book.~~

* Bootstrap mean is approximately a posterior average.

* Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by

~~:<math>\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).</math>~~

~~[https://statcompute.wordpress.com/2016/01/02/where-bagging-might-work-better-than-boosting/ Where Bagging Might Work Better Than Boosting]~~

~~[https://freakonometrics.hypotheses.org/52777 CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8]~~

~~== Boosting ==~~

* Ch8.2 Bagging, Random Forests and Boosting of [http://www-bcf.usc.edu/~gareth/ISL/ An Introduction to Statistical Learning] and the [http://www-bcf.usc.edu/~gareth/ISL/Chapter%208%20Lab.txt code].

* [http://freakonometrics.hypotheses.org/19874 An Attempt To Understand Boosting Algorithm]

* [http://cran.r-project.org/web/packages/gbm/index.html gbm] package. An implementation of extensions to Freund and Schapire's '''AdaBoost algorithm''' and Friedman's '''gradient boosting machine'''. Includes regression methods for least squares, absolute loss, t-distribution loss, [http://mathewanalytics.com/2015/11/13/applied-statistical-theory-quantile-regression/ quantile regression], logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).

* https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf

* http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and [http://www.is.uni-freiburg.de/ressourcen/business-analytics/homework_ensemblelearning_questions.pdf exercise]

* [https://freakonometrics.hypotheses.org/52782 Classification from scratch]

~~=== AdaBoost ===~~

~~AdaBoost.M1 by Freund and Schapire (1997):~~

~~The error rate on the training sample is~~

~~<math>~~

~~\bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)),~~

~~</math>~~

~~Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers <math>G_m(x), m=1,2,\dots,M.</math>~~

~~The predictions from all of them are combined through a weighted majority vote to produce the final prediction:~~

~~<math>~~

~~G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)].~~

~~</math>~~

Here <math> \alpha_1,\alpha_2,\dots,\alpha_M</math> are computed by the boosting algorithm and weight the contribution of each respective <math>G_m(x)</math>. Their effect is to give higher influence to the more accurate classifiers in the sequence.

~~=== Dropout regularization ===~~

~~[https://statcompute.wordpress.com/2017/08/20/dart-dropout-regularization-in-boosting-ensembles/ DART: Dropout Regularization in Boosting Ensembles]~~

~~=== Gradient boosting ===~~

* https://en.wikipedia.org/wiki/Gradient_boosting

* [https://shirinsplayground.netlify.com/2018/11/ml_basics_gbm/ Machine Learning Basics - Gradient Boosting & XGBoost]

* [http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/139-gradient-boosting-essentials-in-r-using-xgboost/ Gradient Boosting Essentials in R Using XGBOOST]

~~== Gradient descent ==~~

~~Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function ([https://en.wikipedia.org/wiki/Gradient_descent Wikipedia]).~~

* [https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ An Introduction to Gradient Descent and Linear Regression] Easy to understand based on simple linear regression. Code is provided too.

* [http://gradientdescending.com/applying-gradient-descent-primer-refresher/ Applying gradient descent – primer / refresher]

* [http://sebastianruder.com/optimizing-gradient-descent/index.html An overview of Gradient descent optimization algorithms]

* [https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/ A Complete Tutorial on Ridge and Lasso Regression in Python]

* How to choose the learning rate?

** [http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html Machine learning] from Andrew Ng

** http://scikit-learn.org/stable/modules/sgd.html

* R packages

** https://cran.r-project.org/web/packages/gradDescent/index.html, https://www.rdocumentation.org/packages/gradDescent/versions/2.0

** https://cran.r-project.org/web/packages/sgd/index.html

~~The error function from a simple linear regression looks like~~

~~: <math>~~

~~\begin{align}~~

~~Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\~~

~~\end{align}~~

~~</math>~~

~~We compute the gradient first for each parameters.~~

~~: <math>~~

~~\begin{align}~~

~~\frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\~~

~~\frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b))~~

~~\end{align}~~

~~</math>~~

~~The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called '''learning rate'''.~~

~~<pre>~~

~~new_m &= m_current - (learningRate * m_gradient)~~

~~new_b &= b_current - (learningRate * b_gradient)~~

~~</pre>~~

~~After each iteration, derivative is closer to zero. [http://blog.hackerearth.com/gradient-descent-algorithm-linear-regression Coding in R] for the simple linear regression.~~

~~=== Gradient descent vs Newton's method ===~~

* [https://stackoverflow.com/a/12066869 What is the difference between Gradient Descent and Newton's Gradient Descent?]

* [http://www.santanupattanayak.com/2017/12/19/newtons-method-vs-gradient-descent-method-in-tacking-saddle-points-in-non-convex-optimization/ Newton's Method vs Gradient Descent Method in tacking saddle points in Non-Convex Optimization]

* [https://dinh-hung-tu.github.io/gradient-descent-vs-newton-method/ Gradient Descent vs Newton Method]

~~== Classification and Regression Trees (CART) ==~~

~~=== Construction of the tree classifier ===~~

* Node proportion

~~:<math> p(1|t) + \dots + p(6|t) =1 </math> where <math>p(j|t)</math> define the node proportions (class proportion of class ''j'' on node ''t''. Here we assume there are 6 classes.~~

* Impurity of node t

:<math>i(t)</math> is a nonnegative function <math>\phi</math> of the <math>p(1|t), \dots, p(6|t)</math> such that <math> \phi(1/6,1/6,\dots,1/6)</math> = maximumm <math>\phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0</math>. That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.

* Gini index of impurity

~~:<math>i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).</math>~~

* Goodness of the split s on node t

:<math>\Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). </math> where <math>p_R</math> are the proportion of the cases in t go into the left node <math>t_L</math> and a proportion <math>p_R</math> go into right node <math>t_R</math>.

A tree was grown in the following way: At the root node <math>t_1</math>, a search was made through all candidate splits to find that split <math>s^*</math> which gave the largest decrease in impurity;

~~:<math>\Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).</math>~~

* Class character of a terminal node was determined by the plurality rule. Specifically, if <math>p(j_0|t)=\max_j p(j|t)</math>, then ''t'' was designated as a class <math>j_0</math> terminal node.

~~=== R packages ===~~

* [http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf rpart]

* http://exploringdatablog.blogspot.com/2013/04/classification-tree-models.html

~~== Partially additive (generalized) linear model trees ==~~

* https://eeecon.uibk.ac.at/~zeileis/news/palmtree/

* https://cran.r-project.org/web/packages/palmtree/index.html

~~== Supervised Classification, Logistic and Multinomial ==~~

* http://freakonometrics.hypotheses.org/19230

~~== Variable selection ==~~

~~=== Review ===~~

~~[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5969114/ Variable selection – A review and recommendations for the practicing statistician] by Heinze et al 2018.~~

~~=== Variable selection and variable importance plot ===~~

* http://freakonometrics.hypotheses.org/19835

~~=== Variable selection and cross-validation ===~~

* http://freakonometrics.hypotheses.org/19925

* http://ellisp.github.io/blog/2016/06/05/bootstrap-cv-strategies/

~~=== Mallow ''Cp'' ===~~

~~Mallows's ''Cp'' addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the '''mean squared prediction error (MSPE)'''.~~

~~:<math>~~

~~E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2,~~

~~</math>~~

~~The ''Cp'' statistic is defined as~~

~~:<math> C_p={SSE_p \over S^2} - N + 2P. </math>~~

* https://en.wikipedia.org/wiki/Mallows%27s_Cp

* Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.

~~=== Variable selection for mode regression ===~~

~~http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017~~

~~== Neural network ==~~

* [http://junma5.weebly.com/data-blog/build-your-own-neural-network-classifier-in-r Build your own neural network in R]

* (Video) [https://youtu.be/ntKn5TPHHAk 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code] from the Coding Train. The book [http://natureofcode.com/book/chapter-10-neural-networks/ THE NATURE OF CODE] by DANIEL SHIFFMAN

* [https://freakonometrics.hypotheses.org/52774 CLASSIFICATION FROM SCRATCH, NEURAL NETS]. The ROCR package was used to produce the ROC curve.

~~== Support vector machine (SVM) ==~~

* [https://statcompute.wordpress.com/2016/03/19/improve-svm-tuning-through-parallelism/ Improve SVM tuning through parallelism] by using the '''foreach''' and '''doParallel''' packages.

~~== Quadratic Discriminant Analysis (qda), KNN ==~~

~~[https://datarvalue.blogspot.com/2017/05/machine-learning-stock-market-data-part_16.html Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN]~~

~~== [https://en.wikipedia.org/wiki/Regularization_(mathematics) Regularization] ==~~

~~Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting~~

@@ Line 1: / Line 1: @@
-== Bagging ==
-Chapter 8 of the book.
-* Bootstrap mean is approximately a posterior average.
-* Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by
-:<math>\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).</math>
-[https://statcompute.wordpress.com/2016/01/02/where-bagging-might-work-better-than-boosting/ Where Bagging Might Work Better Than Boosting]
-[https://freakonometrics.hypotheses.org/52777 CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8]
-== Boosting ==
-* Ch8.2 Bagging, Random Forests and Boosting of [http://www-bcf.usc.edu/~gareth/ISL/ An Introduction to Statistical Learning] and the [http://www-bcf.usc.edu/~gareth/ISL/Chapter%208%20Lab.txt code].
-* [http://freakonometrics.hypotheses.org/19874 An Attempt To Understand Boosting Algorithm]
-* [http://cran.r-project.org/web/packages/gbm/index.html gbm] package. An implementation of extensions to Freund and Schapire's '''AdaBoost algorithm''' and Friedman's '''gradient boosting machine'''. Includes regression methods for least squares, absolute loss, t-distribution loss, [http://mathewanalytics.com/2015/11/13/applied-statistical-theory-quantile-regression/ quantile regression], logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).
-* https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf
-* http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and [http://www.is.uni-freiburg.de/ressourcen/business-analytics/homework_ensemblelearning_questions.pdf exercise]
-* [https://freakonometrics.hypotheses.org/52782 Classification from scratch]
-=== AdaBoost ===
-AdaBoost.M1 by Freund and Schapire (1997):
-The error rate on the training sample is
-<math>
-\bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)),
-</math>
-Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers <math>G_m(x), m=1,2,\dots,M.</math>
-The predictions from all of them are combined through a weighted majority vote to produce the final prediction:
-<math>
-G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)].
-</math>
-Here <math> \alpha_1,\alpha_2,\dots,\alpha_M</math> are computed by the boosting algorithm and weight the contribution of each respective <math>G_m(x)</math>. Their effect is to give higher influence to the more accurate classifiers in the sequence.
-=== Dropout regularization ===
-[https://statcompute.wordpress.com/2017/08/20/dart-dropout-regularization-in-boosting-ensembles/ DART: Dropout Regularization in Boosting Ensembles]
-=== Gradient boosting ===
-* https://en.wikipedia.org/wiki/Gradient_boosting
-* [https://shirinsplayground.netlify.com/2018/11/ml_basics_gbm/ Machine Learning Basics - Gradient Boosting & XGBoost]
-* [http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/139-gradient-boosting-essentials-in-r-using-xgboost/ Gradient Boosting Essentials in R Using XGBOOST]
-== Gradient descent ==
-Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function ([https://en.wikipedia.org/wiki/Gradient_descent Wikipedia]).
-* [https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ An Introduction to Gradient Descent and Linear Regression] Easy to understand based on simple linear regression. Code is provided too.
-* [http://gradientdescending.com/applying-gradient-descent-primer-refresher/ Applying gradient descent – primer / refresher]
-* [http://sebastianruder.com/optimizing-gradient-descent/index.html An overview of Gradient descent optimization algorithms]
-* [https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/ A Complete Tutorial on Ridge and Lasso Regression in Python]
-* How to choose the learning rate?
-** [http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html Machine learning] from Andrew Ng
-** http://scikit-learn.org/stable/modules/sgd.html
-* R packages
-** https://cran.r-project.org/web/packages/gradDescent/index.html, https://www.rdocumentation.org/packages/gradDescent/versions/2.0
-** https://cran.r-project.org/web/packages/sgd/index.html
-The error function from a simple linear regression looks like
-: <math>
-\begin{align}
-Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\
-\end{align}
-</math>
-We compute the gradient first for each parameters.
-: <math>
-\begin{align}
-\frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\
-\frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b))
-\end{align}
-</math>
-The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called '''learning rate'''.
-<pre>
-new_m &= m_current - (learningRate * m_gradient)
-new_b &= b_current - (learningRate * b_gradient)
-</pre>
-After each iteration, derivative is closer to zero. [http://blog.hackerearth.com/gradient-descent-algorithm-linear-regression Coding in R] for the simple linear regression.
-=== Gradient descent vs Newton's method ===
-* [https://stackoverflow.com/a/12066869 What is the difference between Gradient Descent and Newton's Gradient Descent?]
-* [http://www.santanupattanayak.com/2017/12/19/newtons-method-vs-gradient-descent-method-in-tacking-saddle-points-in-non-convex-optimization/ Newton's Method vs Gradient Descent Method in tacking saddle points in Non-Convex Optimization]
-* [https://dinh-hung-tu.github.io/gradient-descent-vs-newton-method/ Gradient Descent vs Newton Method]
-== Classification and Regression Trees (CART) ==
-=== Construction of the tree classifier ===
-* Node proportion
-:<math> p(1|t) + \dots + p(6|t) =1 </math> where <math>p(j|t)</math> define the node proportions (class proportion of class ''j'' on node ''t''. Here we assume there are 6 classes.
-* Impurity of node t
-:<math>i(t)</math> is a nonnegative function <math>\phi</math> of the <math>p(1|t), \dots, p(6|t)</math> such that <math> \phi(1/6,1/6,\dots,1/6)</math> = maximumm <math>\phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0</math>. That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.
-* Gini index of impurity
-:<math>i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).</math>
-* Goodness of the split s on node t
-:<math>\Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). </math> where <math>p_R</math> are the proportion of the cases in t go into the left node <math>t_L</math> and a proportion <math>p_R</math> go into right node <math>t_R</math>.
-A tree was grown in the following way: At the root node <math>t_1</math>, a search was made through all candidate splits to find that split <math>s^*</math> which gave the largest decrease in impurity;
-:<math>\Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).</math>
-* Class character of a terminal node was determined by the plurality rule. Specifically, if <math>p(j_0|t)=\max_j p(j|t)</math>, then ''t'' was designated as a class <math>j_0</math> terminal node.
-=== R packages ===
-* [http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf rpart]
-* http://exploringdatablog.blogspot.com/2013/04/classification-tree-models.html
-== Partially additive (generalized) linear model trees ==
-* https://eeecon.uibk.ac.at/~zeileis/news/palmtree/
-* https://cran.r-project.org/web/packages/palmtree/index.html
-== Supervised Classification, Logistic and Multinomial ==
-* http://freakonometrics.hypotheses.org/19230
-== Variable selection ==
-=== Review ===
-[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5969114/ Variable selection – A review and recommendations for the practicing statistician] by Heinze et al 2018.
-=== Variable selection and variable importance plot ===
-* http://freakonometrics.hypotheses.org/19835
-=== Variable selection and cross-validation ===
-* http://freakonometrics.hypotheses.org/19925
-* http://ellisp.github.io/blog/2016/06/05/bootstrap-cv-strategies/
-=== Mallow ''C<sub>p</sub>'' ===
-Mallows's ''C<sub>p</sub>'' addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the '''mean squared prediction error (MSPE)'''.
-:<math>
-E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2,
-</math>
-The ''C<sub>p</sub>'' statistic is defined as
-:<math> C_p={SSE_p \over S^2} - N + 2P. </math>
-* https://en.wikipedia.org/wiki/Mallows%27s_Cp
-* Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.
-=== Variable selection for mode regression ===
-http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017
-== Neural network ==
-* [http://junma5.weebly.com/data-blog/build-your-own-neural-network-classifier-in-r Build your own neural network in R]
-* (Video) [https://youtu.be/ntKn5TPHHAk 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code] from the Coding Train. The book [http://natureofcode.com/book/chapter-10-neural-networks/ THE NATURE OF CODE] by DANIEL SHIFFMAN
-* [https://freakonometrics.hypotheses.org/52774 CLASSIFICATION FROM SCRATCH, NEURAL NETS]. The ROCR package was used to produce the ROC curve.
-== Support vector machine (SVM) ==
-* [https://statcompute.wordpress.com/2016/03/19/improve-svm-tuning-through-parallelism/ Improve SVM tuning through parallelism] by using the '''foreach''' and '''doParallel''' packages.
-== Quadratic Discriminant Analysis (qda), KNN ==
-[https://datarvalue.blogspot.com/2017/05/machine-learning-stock-market-data-part_16.html Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN]
-== [https://en.wikipedia.org/wiki/Regularization_(mathematics) Regularization] ==
-Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting