Testmathj: Difference between revisions

From David's Wiki
No edit summary
(Blanked the page)
Tag: Blanking
Line 1: Line 1:


== Bagging ==
Chapter 8 of the book.
* Bootstrap mean is approximately a posterior average.
* Bootstrap aggregation or bagging average: Average the prediction over a collection of bootstrap samples, thereby reducing its variance. The bagging estimate is defined by
:<math>\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x).</math>
[https://statcompute.wordpress.com/2016/01/02/where-bagging-might-work-better-than-boosting/ Where Bagging Might Work Better Than Boosting]
[https://freakonometrics.hypotheses.org/52777 CLASSIFICATION FROM SCRATCH, BAGGING AND FORESTS 10/8]
== Boosting ==
* Ch8.2 Bagging, Random Forests and Boosting of [http://www-bcf.usc.edu/~gareth/ISL/ An Introduction to Statistical Learning] and the [http://www-bcf.usc.edu/~gareth/ISL/Chapter%208%20Lab.txt code].
* [http://freakonometrics.hypotheses.org/19874 An Attempt To Understand Boosting Algorithm]
* [http://cran.r-project.org/web/packages/gbm/index.html gbm] package. An implementation of extensions to Freund and Schapire's '''AdaBoost algorithm''' and Friedman's '''gradient boosting machine'''. Includes regression methods for least squares, absolute loss, t-distribution loss, [http://mathewanalytics.com/2015/11/13/applied-statistical-theory-quantile-regression/ quantile regression], logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).
* https://www.biostat.wisc.edu/~kendzior/STAT877/illustration.pdf
* http://www.is.uni-freiburg.de/ressourcen/business-analytics/10_ensemblelearning.pdf and [http://www.is.uni-freiburg.de/ressourcen/business-analytics/homework_ensemblelearning_questions.pdf exercise]
* [https://freakonometrics.hypotheses.org/52782 Classification from scratch]
=== AdaBoost ===
AdaBoost.M1 by Freund and Schapire (1997):
The error rate on the training sample is
<math>
\bar{err} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i)),
</math>
Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers <math>G_m(x), m=1,2,\dots,M.</math>
The predictions from all of them are combined through a weighted majority vote to produce the final prediction:
<math>
G(x) = sign[\sum_{m=1}^M \alpha_m G_m(x)].
</math>
Here <math> \alpha_1,\alpha_2,\dots,\alpha_M</math> are computed by the boosting algorithm and weight the contribution of each respective <math>G_m(x)</math>. Their effect is to give higher influence to the more accurate classifiers in the sequence.
=== Dropout regularization ===
[https://statcompute.wordpress.com/2017/08/20/dart-dropout-regularization-in-boosting-ensembles/ DART: Dropout Regularization in Boosting Ensembles]
=== Gradient boosting ===
* https://en.wikipedia.org/wiki/Gradient_boosting
* [https://shirinsplayground.netlify.com/2018/11/ml_basics_gbm/ Machine Learning Basics - Gradient Boosting & XGBoost]
* [http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/139-gradient-boosting-essentials-in-r-using-xgboost/ Gradient Boosting Essentials in R Using XGBOOST]
== Gradient descent ==
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function ([https://en.wikipedia.org/wiki/Gradient_descent Wikipedia]).
* [https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ An Introduction to Gradient Descent and Linear Regression] Easy to understand based on simple linear regression. Code is provided too.
* [http://gradientdescending.com/applying-gradient-descent-primer-refresher/ Applying gradient descent – primer / refresher]
* [http://sebastianruder.com/optimizing-gradient-descent/index.html An overview of Gradient descent optimization algorithms]
* [https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/ A Complete Tutorial on Ridge and Lasso Regression in Python]
* How to choose the learning rate?
** [http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html Machine learning] from Andrew Ng
** http://scikit-learn.org/stable/modules/sgd.html
* R packages
** https://cran.r-project.org/web/packages/gradDescent/index.html, https://www.rdocumentation.org/packages/gradDescent/versions/2.0
** https://cran.r-project.org/web/packages/sgd/index.html
The error function from a simple linear regression looks like
: <math>
\begin{align}
Err(m,b) &= \frac{1}{N}\sum_{i=1}^n (y_i - (m x_i + b))^2, \\
\end{align}
</math>
We compute the gradient first for each parameters.
: <math>
\begin{align}
\frac{\partial Err}{\partial m} &= \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (m x_i + b)), \\
\frac{\partial Err}{\partial b} &= \frac{2}{n} \sum_{i=1}^n -(y_i - (m x_i + b))
\end{align}
</math>
The gradient descent algorithm uses an iterative method to update the estimates using a tuning parameter called '''learning rate'''.
<pre>
new_m &= m_current - (learningRate * m_gradient)
new_b &= b_current - (learningRate * b_gradient)
</pre>
After each iteration, derivative is closer to zero. [http://blog.hackerearth.com/gradient-descent-algorithm-linear-regression Coding in R] for the simple linear regression.
=== Gradient descent vs Newton's method ===
* [https://stackoverflow.com/a/12066869 What is the difference between Gradient Descent and Newton's Gradient Descent?]
* [http://www.santanupattanayak.com/2017/12/19/newtons-method-vs-gradient-descent-method-in-tacking-saddle-points-in-non-convex-optimization/ Newton's Method vs Gradient Descent Method in tacking saddle points in Non-Convex Optimization]
* [https://dinh-hung-tu.github.io/gradient-descent-vs-newton-method/ Gradient Descent vs Newton Method]
== Classification and Regression Trees (CART) ==
=== Construction of the tree classifier ===
* Node proportion
:<math> p(1|t) + \dots + p(6|t) =1 </math> where <math>p(j|t)</math> define the node proportions (class proportion of class ''j'' on node ''t''. Here we assume there are 6 classes.
* Impurity of node t
:<math>i(t)</math> is a nonnegative function <math>\phi</math> of the <math>p(1|t), \dots, p(6|t)</math> such that <math> \phi(1/6,1/6,\dots,1/6)</math> = maximumm <math>\phi(1,0,\dots,0)=0, \phi(0,1,0,\dots,0)=0, \dots, \phi(0,0,0,0,0,1)=0</math>. That is, the node impurity is largest when all classes are equally mixed together in it, and smallest when the node contains only one class.
* Gini index of impurity
:<math>i(t) = - \sum_{j=1}^6 p(j|t) \log p(j|t).</math>
* Goodness of the split s on node t
:<math>\Delta i(s, t) = i(t) -p_Li(t_L) - p_Ri(t_R). </math> where <math>p_R</math> are the proportion of the cases in t go into the left node <math>t_L</math> and a proportion <math>p_R</math> go into right node <math>t_R</math>.
A tree was grown in the following way: At the root node <math>t_1</math>, a search was made through all candidate splits to find that split <math>s^*</math> which gave the largest decrease in impurity;
:<math>\Delta i(s^*, t_1) = \max_{s} \Delta i(s, t_1).</math>
* Class character of a terminal node was determined by the plurality rule. Specifically, if <math>p(j_0|t)=\max_j p(j|t)</math>, then ''t'' was designated as a class <math>j_0</math> terminal node.
=== R packages ===
* [http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf rpart]
* http://exploringdatablog.blogspot.com/2013/04/classification-tree-models.html
== Partially additive (generalized) linear model trees ==
* https://eeecon.uibk.ac.at/~zeileis/news/palmtree/
* https://cran.r-project.org/web/packages/palmtree/index.html
== Supervised Classification, Logistic and Multinomial ==
* http://freakonometrics.hypotheses.org/19230
== Variable selection ==
=== Review ===
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5969114/ Variable selection – A review and recommendations for the practicing statistician] by Heinze et al 2018.
=== Variable selection and variable importance plot ===
* http://freakonometrics.hypotheses.org/19835
=== Variable selection and cross-validation ===
* http://freakonometrics.hypotheses.org/19925
* http://ellisp.github.io/blog/2016/06/05/bootstrap-cv-strategies/
=== Mallow ''C<sub>p</sub>'' ===
Mallows's ''C<sub>p</sub>'' addresses the issue of overfitting. The Cp statistic calculated on a sample of data estimates the '''mean squared prediction error (MSPE)'''.
:<math>
E\sum_j (\hat{Y}_j - E(Y_j\mid X_j))^2/\sigma^2,
</math>
The ''C<sub>p</sub>'' statistic is defined as
:<math> C_p={SSE_p \over S^2} - N + 2P. </math>
* https://en.wikipedia.org/wiki/Mallows%27s_Cp
* Used in Yuan & Lin (2006) group lasso. The degrees of freedom is estimated by the bootstrap or perturbation methods. Their paper mentioned the performance is comparable with that of 5-fold CV but is computationally much faster.
=== Variable selection for mode regression ===
http://www.tandfonline.com/doi/full/10.1080/02664763.2017.1342781 Chen & Zhou, Journal of applied statistics ,June 2017
== Neural network ==
* [http://junma5.weebly.com/data-blog/build-your-own-neural-network-classifier-in-r Build your own neural network in R]
* (Video) [https://youtu.be/ntKn5TPHHAk 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code] from the Coding Train. The book [http://natureofcode.com/book/chapter-10-neural-networks/ THE NATURE OF CODE] by DANIEL SHIFFMAN
* [https://freakonometrics.hypotheses.org/52774 CLASSIFICATION FROM SCRATCH, NEURAL NETS]. The ROCR package was used to produce the ROC curve.
== Support vector machine (SVM) ==
* [https://statcompute.wordpress.com/2016/03/19/improve-svm-tuning-through-parallelism/ Improve SVM tuning through parallelism] by using the '''foreach''' and '''doParallel''' packages.
== Quadratic Discriminant Analysis (qda), KNN ==
[https://datarvalue.blogspot.com/2017/05/machine-learning-stock-market-data-part_16.html Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN]
== [https://en.wikipedia.org/wiki/Regularization_(mathematics) Regularization] ==
Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting

Revision as of 22:39, 7 September 2019