Testmathj: Difference between revisions

From David's Wiki
No edit summary
No edit summary
Line 1: Line 1:
** The larger the <math>\lambda</math>, more coefficients are becoming zeros (think about '''coefficient path''' plots) and thus the simpler (more '''regularized''') the model.  
 
** If <math>\lambda</math> becomes zero, it reduces to the regular regression and if <math>\lambda</math> becomes infinity, the coefficients become zeros.
[[:File:Glmnetplot.svg]]  [[:File:Glmnet path.svg]] [[:File:Glmnet l1norm.svg]]
** In terms of the bias-variance tradeoff, the larger the <math>\lambda</math>, the higher the bias and the lower the variance of the coefficient estimators.
: <syntaxhighlight lang='rsplus'>
set.seed(1010)
n=1000;p=100
nzc=trunc(p/10)
x=matrix(rnorm(n*p),n,p)
beta=rnorm(nzc)
fx= x[,seq(nzc)] %*% beta
eps=rnorm(n)*5
y=drop(fx+eps)
px=exp(fx)
px=px/(1+px)
ly=rbinom(n=length(px),prob=px,size=1)
 
## Full lasso
set.seed(999)
cv.full <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE)
plot(cv.full)  # cross-validation curve and two lambda's
plot(glmnet(x, ly, family='binomial', alpha=1), xvar="lambda", label=TRUE) # coefficient path plot
plot(glmnet(x, ly, family='binomial', alpha=1))  # L1 norm plot
log(cv.full$lambda.min) # -4.546394
log(cv.full$lambda.1se) # -3.61605
sum(coef(cv.full, s=cv.full$lambda.min) != 0) # 44
 
## Ridge Regression to create the Adaptive Weights Vector
set.seed(999)
cv.ridge <- cv.glmnet(x, ly, family='binomial', alpha=0, parallel=TRUE)
wt <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)
                  [, 1][2:(ncol(x)+1)] ))^1 ## Using gamma = 1, exclude intercept
## Adaptive Lasso using the 'penalty.factor' argument
set.seed(999)
cv.lasso <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE, penalty.factor=wt)
# defautl type.measure="deviance" for logistic regression
plot(cv.lasso)
log(cv.lasso$lambda.min) # -2.995375
log(cv.lasso$lambda.1se) # -0.7625655
sum(coef(cv.lasso, s=cv.lasso$lambda.min) != 0) # 34
</syntaxhighlight>
* A list of potential lambdas: see [http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin Linear Regression] case. The λ sequence is determined by '''lambda.max''' and '''lambda.min.ratio'''. The latter (default is ifelse(nobs<nvars,0.01,0.0001)) is the ratio of smallest value of the generated λ sequence (say ''lambda.min'') to  ''lambda.max''. The program then generated ''nlambda'' values linear on the log scale from ''lambda.max'' down to ''lambda.min''. ''lambda.max'' is not given, but easily computed from the input x and y; it is the smallest value for ''lambda'' such that all the coefficients are zero.
* [https://privefl.github.io/blog/choosing-hyper-parameters-in-penalized-regression/ Choosing hyper-parameters (α and λ) in penalized regression] by Florian Privé
* [https://stats.stackexchange.com/questions/70249/feature-selection-model-with-glmnet-on-methylation-data-pn lambda.min vs lambda.1se]
** The '''lambda.1se''' represents the value of λ in the search that was simpler than the best model ('''lambda.min'''), but which has error within 1 standard error of the best model. In other words, using the value of ''lambda.1se'' as the selected value for λ results in a model that is slightly simpler than the best model but which cannot be distinguished from the best model in terms of error given the uncertainty in the k-fold CV estimate of the error of the best model.
** The '''lambda.min''' option refers to value of λ at the lowest CV error. The error at this value of λ is the average of the errors over the k folds and hence this estimate of the error is uncertain.
* https://www.rdocumentation.org/packages/glmnet/versions/2.0-10/topics/glmnet
* [http://blog.revolutionanalytics.com/2016/11/glmnetutils.html glmnetUtils: quality of life enhancements for elastic net regression with glmnet]
* Mixing parameter: alpha=1 is the '''lasso penalty''', and alpha=0 the '''ridge penalty''' and anything between 0–1 is '''Elastic net'''.
** RIdge regression uses Euclidean distance/L2-norm as the penalty. It won't remove any variables.
** Lasso uses L1-norm as the penalty. Some of the coefficients may be shrunk exactly to zero.
* [https://www.quora.com/In-ridge-regression-and-lasso-what-is-lambda In ridge regression and lasso, what is lambda?]
** Lambda is a penalty coefficient. Large lambda will shrink the coefficients.
** cv.glment()$lambda.1se gives the most regularized model such that error is within one standard error of the minimum
* cv.glmnet() has a logical parameter '''parallel''' which is useful if a cluster or cores have been previously allocated
* [http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf Ridge regression and the LASSO]
* Standard error/Confidence interval
** [https://www.reddit.com/r/statistics/comments/1vg8k0/standard_errors_in_glmnet/ Standard Errors in GLMNET] and [https://stackoverflow.com/questions/39750965/confidence-intervals-for-ridge-regression Confidence intervals for Ridge regression]
** '''[https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf#page=18 Why SEs are not meaningful and are usually not provided in penalized regression?]'''
**# Hint:  standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods.
**# '''Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias.'''
**# The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.
**# Any bootstrap-based calculations can only give an assessment of the variance of the estimates.
**# Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used.
** [https://stats.stackexchange.com/tags/glmnet/hot Hottest glmnet questions from stackexchange].
** [https://stats.stackexchange.com/questions/91462/standard-errors-for-lasso-prediction-using-r Standard errors for lasso prediction]. There might not be a consensus on a statistically valid method of calculating standard errors for the lasso predictions.
** [https://www4.stat.ncsu.edu/~lu/programcodes.html Code] for Adaptive-Lasso for Cox's proportional hazards model by Zhang & Lu (2007). This can compute the SE of estimates. The weights are originally based on the maximizers of the log partial likelihood. However, the beta may not be estimable in cases such as high-dimensional gene data, or the beta may be unstable if strong collinearity exists among covariates. In such cases, robust estimators such as ridge regression estimators would be used to determine the weights.
* LASSO vs Least angle regression
** https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
** [http://web.stanford.edu/~hastie/TALKS/larstalk.pdf Least Angle Regression, Forward Stagewise and the Lasso]
** https://www.quora.com/What-is-Least-Angle-Regression-and-when-should-it-be-used
** [http://statweb.stanford.edu/~tibs/lasso/simple.html A simple explanation of the Lasso and Least Angle Regression]
** https://stats.stackexchange.com/questions/4663/least-angle-regression-vs-lasso
** https://cran.r-project.org/web/packages/lars/index.html
* '''Oracle property''' and '''adaptive lasso'''
** [http://www.stat.wisc.edu/~shao/stat992/fan-li2001.pdf Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties], Fan & Li (2001) JASA
** [http://ricardoscr.github.io/how-to-adaptive-lasso.html Adaptive Lasso: What it is and how to implement in R]. Adaptive lasso weeks to minimize <math> RSS(\beta) + \lambda \sum_1^p \hat{\omega}_j |\beta_j| </math> where <math>\lambda</math> is the tuning parameter, <math>\hat{\omega}_j = \frac{1}{(|\hat{\beta}_j^{ini}|)^\gamma}</math> is the adaptive weights vector and <math>\hat{\beta}_j^{ini}</math> is an initial estimate of the coefficients obtained through ridge regression. '''Adaptive Lasso ends up penalizing more those coefficients with lower initial estimates.''' <math>\gamma</math> is a positive constant for adjustment of the adaptive weight vector, and the authors suggest the possible values of 0.5, 1 and 2.
** When n goes to infinity, <math>\hat{\omega}_j |\beta_j|  </math> converges to <math>I(\beta_j \neq 0) </math>. So the adaptive Lasso procedure can be regarded as an automatic implementation of best-subset selection in some asymptotic sense.
** [https://stats.stackexchange.com/questions/229142/what-is-the-oracle-property-of-an-estimator What is the oracle property of an estimator?] An oracle estimator must be consistent in 1) '''variable selection''' and 2) '''consistent parameter estimation'''.
** (Linear regression) The adaptive lasso and its oracle properties Zou (2006, JASA)
** (Cox model) Adaptive-LASSO for Cox's proportional hazard model by Zhang and Lu (2007, Biometrika)
**[https://insightr.wordpress.com/2017/06/14/when-the-lasso-fails/ When the LASSO fails???]. Adaptive lasso is used to demonstrate its usefulness.
* [https://statisticaloddsandends.wordpress.com/2018/11/13/a-deep-dive-into-glmnet-penalty-factor/ A deep dive into glmnet: penalty.factor], [https://statisticaloddsandends.wordpress.com/2018/11/15/a-deep-dive-into-glmnet-standardize/ standardize], [https://statisticaloddsandends.wordpress.com/2019/01/09/a-deep-dive-into-glmnet-offset/ offset]
** Lambda sequence is not affected by the "penalty.factor"
** How "penalty.factor" used by the objective function may need to be corrected
* Some issues:
** With group of highly correlated features, Lasso tends to select amongst them arbitrarily.
** Often empirically ridge has better predictive performance than lasso but lasso leads to sparser solution
** Elastic-net (Zou & Hastie '05) aims to address these issues: hybrid between Lasso and ridge regression, uses L1 and L2 penalties.
* [https://statcompute.wordpress.com/2019/02/23/gradient-free-optimization-for-glmnet-parameters/ Gradient-Free Optimization for GLMNET Parameters]
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2656-1 Gsslasso Cox]: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information, Tang et al BMC Bioinformatics 2019
 
=== Lasso logistic regression ===
https://freakonometrics.hypotheses.org/52894
 
=== Lagrange Multipliers ===
[https://medium.com/@andrew.chamberlain/a-simple-explanation-of-why-lagrange-multipliers-works-253e2cdcbf74 A Simple Explanation of Why Lagrange Multipliers Works]
 
=== How to solve lasso/convex optimization ===
* [https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf Convex Optimization] by Boyd S, Vandenberghe L, Cambridge 2004. It is cited by Zhang & Lu (2007). The '''interior point algorithm''' can be used to solve the optimization problem in adaptive lasso.
* Review of '''gradient descent''':
** Finding maximum: <math>w^{(t+1)} = w^{(t)} + \eta \frac{dg(w)}{dw}</math>, where <math>\eta</math> is stepsize.
** Finding minimum: <math>w^{(t+1)} = w^{(t)} - \eta \frac{dg(w)}{dw}</math>.
** [https://stackoverflow.com/questions/12066761/what-is-the-difference-between-gradient-descent-and-newtons-gradient-descent What is the difference between Gradient Descent and Newton's Gradient Descent?] Newton's method requires <math>g''(w)</math>, more smoothness of g(.).
** Finding minimum for multiple variables ('''gradient descent'''): <math>w^{(t+1)} = w^{(t)} - \eta \Delta g(w^{(t)})</math>. For the least squares problem, <math>g(w) = RSS(w)</math>.
** Finding minimum for multiple variables in the least squares problem (minimize <math>RSS(w)</math>):  <math>\text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = w_j^{(t)} - \eta \; \text{partial}(j)</math>
** Finding minimum for multiple variables in the ridge regression problem (minimize <math>RSS(w)+\lambda ||w||_2^2=(y-Hw)'(y-Hw)+\lambda w'w</math>): <math>\text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = (1-2\eta \lambda) w_j^{(t)} - \eta \; \text{partial}(j)</math>. Compared to the closed form approach: <math>\hat{w} = (H'H + \lambda I)^{-1}H'y</math> where 1. the inverse exists even N<D as long as <math>\lambda > 0</math> and 2. the complexity of inverse is <math>O(D^3)</math>, D is the dimension of the covariates.
* '''Cyclical coordinate descent''' was used ([https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet_beta.pdf#page=1 vignette]) in the glmnet package. See also '''[https://en.wikipedia.org/wiki/Coordinate_descent coordinate descent]'''. The reason we call it 'descent' is because we want to 'minimize' an objective function.
** <math>\hat{w}_j = \min_w g(\hat{w}_1, \cdots, \hat{w}_{j-1},w, \hat{w}_{j+1}, \cdots, \hat{w}_D)</math>
** See [https://www.jstatsoft.org/article/view/v033i01 paper] on JSS 2010. The Cox PHM case also uses the cyclical coordinate descent method; see the [https://www.jstatsoft.org/article/view/v039i05 paper] on JSS 2011.
** Coursera's [https://www.coursera.org/learn/ml-regression/lecture/rb179/feature-selection-lasso-and-nearest-neighbor-regression Machine learning course 2: Regression] at 1:42. [http://web.stanford.edu/~hastie/TALKS/CD.pdf#page=12 Soft-thresholding] the coefficients is the key for the L1 penalty. The range for the thresholding is controlled by <math>\lambda</math>. Note to view the videos and all materials in coursera we can enroll to audit the course without starting a trial.
** No step size is required as in gradient descent.
** [https://sandipanweb.wordpress.com/2017/05/04/implementing-lasso-regression-with-coordinate-descent-and-the-sub-gradient-of-the-l1-penalty-with-soft-thresholding/ Implementing LASSO Regression with Coordinate Descent, Sub-Gradient of the L1 Penalty and Soft Thresholding in Python]
** Coordinate descent in the least squares problem: <math>\frac{\partial}{\partial w_j} RSS(w)= -2 \rho_j + 2 w_j</math>; i.e. <math>\hat{w}_j = \rho_j</math>.
** Coordinate descent in the Lasso problem (for normalized features): <math>
\hat{w}_j =
\begin{cases}
\rho_j + \lambda/2, & \text{if }\rho_j < -\lambda/2 \\
0, & \text{if } -\lambda/2 \le \rho_j \le \lambda/2\\
\rho_j- \lambda/2, & \text{if }\rho_j > \lambda/2
\end{cases}
</math>

Revision as of 22:43, 7 September 2019

File:Glmnetplot.svg File:Glmnet path.svg File:Glmnet l1norm.svg

set.seed(1010)
n=1000;p=100
nzc=trunc(p/10)
x=matrix(rnorm(n*p),n,p)
beta=rnorm(nzc)
fx= x[,seq(nzc)] %*% beta
eps=rnorm(n)*5
y=drop(fx+eps)
px=exp(fx)
px=px/(1+px)
ly=rbinom(n=length(px),prob=px,size=1)

## Full lasso
set.seed(999)
cv.full <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE)
plot(cv.full)  # cross-validation curve and two lambda's
plot(glmnet(x, ly, family='binomial', alpha=1), xvar="lambda", label=TRUE) # coefficient path plot
plot(glmnet(x, ly, family='binomial', alpha=1))  # L1 norm plot
log(cv.full$lambda.min) # -4.546394
log(cv.full$lambda.1se) # -3.61605
sum(coef(cv.full, s=cv.full$lambda.min) != 0) # 44

## Ridge Regression to create the Adaptive Weights Vector
set.seed(999)
cv.ridge <- cv.glmnet(x, ly, family='binomial', alpha=0, parallel=TRUE)
wt <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)
                   [, 1][2:(ncol(x)+1)] ))^1 ## Using gamma = 1, exclude intercept
## Adaptive Lasso using the 'penalty.factor' argument
set.seed(999)
cv.lasso <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE, penalty.factor=wt)
# defautl type.measure="deviance" for logistic regression
plot(cv.lasso)
log(cv.lasso$lambda.min) # -2.995375
log(cv.lasso$lambda.1se) # -0.7625655
sum(coef(cv.lasso, s=cv.lasso$lambda.min) != 0) # 34

Lasso logistic regression

https://freakonometrics.hypotheses.org/52894

Lagrange Multipliers

A Simple Explanation of Why Lagrange Multipliers Works

How to solve lasso/convex optimization

  • Convex Optimization by Boyd S, Vandenberghe L, Cambridge 2004. It is cited by Zhang & Lu (2007). The interior point algorithm can be used to solve the optimization problem in adaptive lasso.
  • Review of gradient descent:
    • Finding maximum: \(\displaystyle w^{(t+1)} = w^{(t)} + \eta \frac{dg(w)}{dw}\), where \(\displaystyle \eta\) is stepsize.
    • Finding minimum: \(\displaystyle w^{(t+1)} = w^{(t)} - \eta \frac{dg(w)}{dw}\).
    • What is the difference between Gradient Descent and Newton's Gradient Descent? Newton's method requires \(\displaystyle g''(w)\), more smoothness of g(.).
    • Finding minimum for multiple variables (gradient descent): \(\displaystyle w^{(t+1)} = w^{(t)} - \eta \Delta g(w^{(t)})\). For the least squares problem, \(\displaystyle g(w) = RSS(w)\).
    • Finding minimum for multiple variables in the least squares problem (minimize \(\displaystyle RSS(w)\)): \(\displaystyle \text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = w_j^{(t)} - \eta \; \text{partial}(j)\)
    • Finding minimum for multiple variables in the ridge regression problem (minimize \(\displaystyle RSS(w)+\lambda ||w||_2^2=(y-Hw)'(y-Hw)+\lambda w'w\)): \(\displaystyle \text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = (1-2\eta \lambda) w_j^{(t)} - \eta \; \text{partial}(j)\). Compared to the closed form approach: \(\displaystyle \hat{w} = (H'H + \lambda I)^{-1}H'y\) where 1. the inverse exists even N<D as long as \(\displaystyle \lambda \gt 0\) and 2. the complexity of inverse is \(\displaystyle O(D^3)\), D is the dimension of the covariates.
  • Cyclical coordinate descent was used (vignette) in the glmnet package. See also coordinate descent. The reason we call it 'descent' is because we want to 'minimize' an objective function.
    • \(\displaystyle \hat{w}_j = \min_w g(\hat{w}_1, \cdots, \hat{w}_{j-1},w, \hat{w}_{j+1}, \cdots, \hat{w}_D)\)
    • See paper on JSS 2010. The Cox PHM case also uses the cyclical coordinate descent method; see the paper on JSS 2011.
    • Coursera's Machine learning course 2: Regression at 1:42. Soft-thresholding the coefficients is the key for the L1 penalty. The range for the thresholding is controlled by \(\displaystyle \lambda\). Note to view the videos and all materials in coursera we can enroll to audit the course without starting a trial.
    • No step size is required as in gradient descent.
    • Implementing LASSO Regression with Coordinate Descent, Sub-Gradient of the L1 Penalty and Soft Thresholding in Python
    • Coordinate descent in the least squares problem: \(\displaystyle \frac{\partial}{\partial w_j} RSS(w)= -2 \rho_j + 2 w_j\); i.e. \(\displaystyle \hat{w}_j = \rho_j\).
    • Coordinate descent in the Lasso problem (for normalized features): \(\displaystyle \hat{w}_j = \begin{cases} \rho_j + \lambda/2, & \text{if }\rho_j \lt -\lambda/2 \\ 0, & \text{if } -\lambda/2 \le \rho_j \le \lambda/2\\ \rho_j- \lambda/2, & \text{if }\rho_j \gt \lambda/2 \end{cases} \)