Testmathj: Difference between revisions

Revision as of 22:43, 7 September 2019

File:Glmnetplot.svg File:Glmnet path.svg File:Glmnet l1norm.svg

set.seed(1010)
n=1000;p=100
nzc=trunc(p/10)
x=matrix(rnorm(n*p),n,p)
beta=rnorm(nzc)
fx= x[,seq(nzc)] %*% beta
eps=rnorm(n)*5
y=drop(fx+eps)
px=exp(fx)
px=px/(1+px)
ly=rbinom(n=length(px),prob=px,size=1)

## Full lasso
set.seed(999)
cv.full <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE)
plot(cv.full)  # cross-validation curve and two lambda's
plot(glmnet(x, ly, family='binomial', alpha=1), xvar="lambda", label=TRUE) # coefficient path plot
plot(glmnet(x, ly, family='binomial', alpha=1))  # L1 norm plot
log(cv.full$lambda.min) # -4.546394
log(cv.full$lambda.1se) # -3.61605
sum(coef(cv.full, s=cv.full$lambda.min) != 0) # 44

## Ridge Regression to create the Adaptive Weights Vector
set.seed(999)
cv.ridge <- cv.glmnet(x, ly, family='binomial', alpha=0, parallel=TRUE)
wt <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)
                   [, 1][2:(ncol(x)+1)] ))^1 ## Using gamma = 1, exclude intercept
## Adaptive Lasso using the 'penalty.factor' argument
set.seed(999)
cv.lasso <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE, penalty.factor=wt)
# defautl type.measure="deviance" for logistic regression
plot(cv.lasso)
log(cv.lasso$lambda.min) # -2.995375
log(cv.lasso$lambda.1se) # -0.7625655
sum(coef(cv.lasso, s=cv.lasso$lambda.min) != 0) # 34

A list of potential lambdas: see Linear Regression case. The λ sequence is determined by lambda.max and lambda.min.ratio. The latter (default is ifelse(nobs<nvars,0.01,0.0001)) is the ratio of smallest value of the generated λ sequence (say lambda.min) to lambda.max. The program then generated nlambda values linear on the log scale from lambda.max down to lambda.min. lambda.max is not given, but easily computed from the input x and y; it is the smallest value for lambda such that all the coefficients are zero.
Choosing hyper-parameters (α and λ) in penalized regression by Florian Privé
lambda.min vs lambda.1se
- The lambda.1se represents the value of λ in the search that was simpler than the best model (lambda.min), but which has error within 1 standard error of the best model. In other words, using the value of lambda.1se as the selected value for λ results in a model that is slightly simpler than the best model but which cannot be distinguished from the best model in terms of error given the uncertainty in the k-fold CV estimate of the error of the best model.
- The lambda.min option refers to value of λ at the lowest CV error. The error at this value of λ is the average of the errors over the k folds and hence this estimate of the error is uncertain.
https://www.rdocumentation.org/packages/glmnet/versions/2.0-10/topics/glmnet
glmnetUtils: quality of life enhancements for elastic net regression with glmnet
Mixing parameter: alpha=1 is the lasso penalty, and alpha=0 the ridge penalty and anything between 0–1 is Elastic net.
- RIdge regression uses Euclidean distance/L2-norm as the penalty. It won't remove any variables.
- Lasso uses L1-norm as the penalty. Some of the coefficients may be shrunk exactly to zero.
In ridge regression and lasso, what is lambda?
- Lambda is a penalty coefficient. Large lambda will shrink the coefficients.
- cv.glment()$lambda.1se gives the most regularized model such that error is within one standard error of the minimum
cv.glmnet() has a logical parameter parallel which is useful if a cluster or cores have been previously allocated
Ridge regression and the LASSO
Standard error/Confidence interval
- Standard Errors in GLMNET and Confidence intervals for Ridge regression
- Why SEs are not meaningful and are usually not provided in penalized regression?
  1. Hint: standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods.
  2. Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias.
  3. The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.
  4. Any bootstrap-based calculations can only give an assessment of the variance of the estimates.
  5. Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used.
- Hottest glmnet questions from stackexchange.
- Standard errors for lasso prediction. There might not be a consensus on a statistically valid method of calculating standard errors for the lasso predictions.
- Code for Adaptive-Lasso for Cox's proportional hazards model by Zhang & Lu (2007). This can compute the SE of estimates. The weights are originally based on the maximizers of the log partial likelihood. However, the beta may not be estimable in cases such as high-dimensional gene data, or the beta may be unstable if strong collinearity exists among covariates. In such cases, robust estimators such as ridge regression estimators would be used to determine the weights.
LASSO vs Least angle regression
Oracle property and adaptive lasso
- Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties, Fan & Li (2001) JASA
- Adaptive Lasso: What it is and how to implement in R. Adaptive lasso weeks to minimize $\displaystyle RSS(\beta) + \lambda \sum_1^p \hat{\omega}_j |\beta_j| $ where $\displaystyle \lambda$ is the tuning parameter, $\displaystyle \hat{\omega}_j = \frac{1}{(|\hat{\beta}_j^{ini}|)^\gamma}$ is the adaptive weights vector and $\displaystyle \hat{\beta}_j^{ini}$ is an initial estimate of the coefficients obtained through ridge regression. Adaptive Lasso ends up penalizing more those coefficients with lower initial estimates. $\displaystyle \gamma$ is a positive constant for adjustment of the adaptive weight vector, and the authors suggest the possible values of 0.5, 1 and 2.
- When n goes to infinity, $\displaystyle \hat{\omega}_j |\beta_j| $ converges to $\displaystyle I(\beta_j \neq 0) $. So the adaptive Lasso procedure can be regarded as an automatic implementation of best-subset selection in some asymptotic sense.
- What is the oracle property of an estimator? An oracle estimator must be consistent in 1) variable selection and 2) consistent parameter estimation.
- (Linear regression) The adaptive lasso and its oracle properties Zou (2006, JASA)
- (Cox model) Adaptive-LASSO for Cox's proportional hazard model by Zhang and Lu (2007, Biometrika)
- When the LASSO fails???. Adaptive lasso is used to demonstrate its usefulness.
A deep dive into glmnet: penalty.factor, standardize, offset
- Lambda sequence is not affected by the "penalty.factor"
- How "penalty.factor" used by the objective function may need to be corrected
Some issues:
- With group of highly correlated features, Lasso tends to select amongst them arbitrarily.
- Often empirically ridge has better predictive performance than lasso but lasso leads to sparser solution
- Elastic-net (Zou & Hastie '05) aims to address these issues: hybrid between Lasso and ridge regression, uses L1 and L2 penalties.
Gradient-Free Optimization for GLMNET Parameters
Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information, Tang et al BMC Bioinformatics 2019

Lasso logistic regression

https://freakonometrics.hypotheses.org/52894

Lagrange Multipliers

A Simple Explanation of Why Lagrange Multipliers Works

How to solve lasso/convex optimization

Convex Optimization by Boyd S, Vandenberghe L, Cambridge 2004. It is cited by Zhang & Lu (2007). The interior point algorithm can be used to solve the optimization problem in adaptive lasso.
Review of gradient descent:
- Finding maximum: $\displaystyle w^{(t+1)} = w^{(t)} + \eta \frac{dg(w)}{dw}$, where $\displaystyle \eta$ is stepsize.
- Finding minimum: $\displaystyle w^{(t+1)} = w^{(t)} - \eta \frac{dg(w)}{dw}$.
- What is the difference between Gradient Descent and Newton's Gradient Descent? Newton's method requires $\displaystyle g''(w)$, more smoothness of g(.).
- Finding minimum for multiple variables (gradient descent): $\displaystyle w^{(t+1)} = w^{(t)} - \eta \Delta g(w^{(t)})$. For the least squares problem, $\displaystyle g(w) = RSS(w)$.
- Finding minimum for multiple variables in the least squares problem (minimize $\displaystyle RSS(w)$): $\displaystyle \text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = w_j^{(t)} - \eta \; \text{partial}(j)$
- Finding minimum for multiple variables in the ridge regression problem (minimize $\displaystyle RSS(w)+\lambda ||w||_2^2=(y-Hw)'(y-Hw)+\lambda w'w$): $\displaystyle \text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = (1-2\eta \lambda) w_j^{(t)} - \eta \; \text{partial}(j)$. Compared to the closed form approach: $\displaystyle \hat{w} = (H'H + \lambda I)^{-1}H'y$ where 1. the inverse exists even N<D as long as $\displaystyle \lambda \gt 0$ and 2. the complexity of inverse is $\displaystyle O(D^3)$, D is the dimension of the covariates.
Cyclical coordinate descent was used (vignette) in the glmnet package. See also coordinate descent. The reason we call it 'descent' is because we want to 'minimize' an objective function.
- $\displaystyle \hat{w}_j = \min_w g(\hat{w}_1, \cdots, \hat{w}_{j-1},w, \hat{w}_{j+1}, \cdots, \hat{w}_D)$
- See paper on JSS 2010. The Cox PHM case also uses the cyclical coordinate descent method; see the paper on JSS 2011.
- Coursera's Machine learning course 2: Regression at 1:42. Soft-thresholding the coefficients is the key for the L1 penalty. The range for the thresholding is controlled by $\displaystyle \lambda$. Note to view the videos and all materials in coursera we can enroll to audit the course without starting a trial.
- No step size is required as in gradient descent.
- Implementing LASSO Regression with Coordinate Descent, Sub-Gradient of the L1 Penalty and Soft Thresholding in Python
- Coordinate descent in the least squares problem: $\displaystyle \frac{\partial}{\partial w_j} RSS(w)= -2 \rho_j + 2 w_j$; i.e. $\displaystyle \hat{w}_j = \rho_j$.
- Coordinate descent in the Lasso problem (for normalized features): $\displaystyle \hat{w}_j = \begin{cases} \rho_j + \lambda/2, & \text{if }\rho_j \lt -\lambda/2 \\ 0, & \text{if } -\lambda/2 \le \rho_j \le \lambda/2\\ \rho_j- \lambda/2, & \text{if }\rho_j \gt \lambda/2 \end{cases} $