Testmathj

From David's Wiki
Revision as of 22:43, 7 September 2019 by David (talk | contribs)
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

File:Glmnetplot.svg File:Glmnet path.svg File:Glmnet l1norm.svg

set.seed(1010)
n=1000;p=100
nzc=trunc(p/10)
x=matrix(rnorm(n*p),n,p)
beta=rnorm(nzc)
fx= x[,seq(nzc)] %*% beta
eps=rnorm(n)*5
y=drop(fx+eps)
px=exp(fx)
px=px/(1+px)
ly=rbinom(n=length(px),prob=px,size=1)

## Full lasso
set.seed(999)
cv.full <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE)
plot(cv.full)  # cross-validation curve and two lambda's
plot(glmnet(x, ly, family='binomial', alpha=1), xvar="lambda", label=TRUE) # coefficient path plot
plot(glmnet(x, ly, family='binomial', alpha=1))  # L1 norm plot
log(cv.full$lambda.min) # -4.546394
log(cv.full$lambda.1se) # -3.61605
sum(coef(cv.full, s=cv.full$lambda.min) != 0) # 44

## Ridge Regression to create the Adaptive Weights Vector
set.seed(999)
cv.ridge <- cv.glmnet(x, ly, family='binomial', alpha=0, parallel=TRUE)
wt <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)
                   [, 1][2:(ncol(x)+1)] ))^1 ## Using gamma = 1, exclude intercept
## Adaptive Lasso using the 'penalty.factor' argument
set.seed(999)
cv.lasso <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE, penalty.factor=wt)
# defautl type.measure="deviance" for logistic regression
plot(cv.lasso)
log(cv.lasso$lambda.min) # -2.995375
log(cv.lasso$lambda.1se) # -0.7625655
sum(coef(cv.lasso, s=cv.lasso$lambda.min) != 0) # 34

Lasso logistic regression

https://freakonometrics.hypotheses.org/52894

Lagrange Multipliers

A Simple Explanation of Why Lagrange Multipliers Works

How to solve lasso/convex optimization

  • Convex Optimization by Boyd S, Vandenberghe L, Cambridge 2004. It is cited by Zhang & Lu (2007). The interior point algorithm can be used to solve the optimization problem in adaptive lasso.
  • Review of gradient descent:
    • Finding maximum: \(\displaystyle w^{(t+1)} = w^{(t)} + \eta \frac{dg(w)}{dw}\), where \(\displaystyle \eta\) is stepsize.
    • Finding minimum: \(\displaystyle w^{(t+1)} = w^{(t)} - \eta \frac{dg(w)}{dw}\).
    • What is the difference between Gradient Descent and Newton's Gradient Descent? Newton's method requires \(\displaystyle g''(w)\), more smoothness of g(.).
    • Finding minimum for multiple variables (gradient descent): \(\displaystyle w^{(t+1)} = w^{(t)} - \eta \Delta g(w^{(t)})\). For the least squares problem, \(\displaystyle g(w) = RSS(w)\).
    • Finding minimum for multiple variables in the least squares problem (minimize \(\displaystyle RSS(w)\)): \(\displaystyle \text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = w_j^{(t)} - \eta \; \text{partial}(j)\)
    • Finding minimum for multiple variables in the ridge regression problem (minimize \(\displaystyle RSS(w)+\lambda ||w||_2^2=(y-Hw)'(y-Hw)+\lambda w'w\)): \(\displaystyle \text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = (1-2\eta \lambda) w_j^{(t)} - \eta \; \text{partial}(j)\). Compared to the closed form approach: \(\displaystyle \hat{w} = (H'H + \lambda I)^{-1}H'y\) where 1. the inverse exists even N<D as long as \(\displaystyle \lambda \gt 0\) and 2. the complexity of inverse is \(\displaystyle O(D^3)\), D is the dimension of the covariates.
  • Cyclical coordinate descent was used (vignette) in the glmnet package. See also coordinate descent. The reason we call it 'descent' is because we want to 'minimize' an objective function.
    • \(\displaystyle \hat{w}_j = \min_w g(\hat{w}_1, \cdots, \hat{w}_{j-1},w, \hat{w}_{j+1}, \cdots, \hat{w}_D)\)
    • See paper on JSS 2010. The Cox PHM case also uses the cyclical coordinate descent method; see the paper on JSS 2011.
    • Coursera's Machine learning course 2: Regression at 1:42. Soft-thresholding the coefficients is the key for the L1 penalty. The range for the thresholding is controlled by \(\displaystyle \lambda\). Note to view the videos and all materials in coursera we can enroll to audit the course without starting a trial.
    • No step size is required as in gradient descent.
    • Implementing LASSO Regression with Coordinate Descent, Sub-Gradient of the L1 Penalty and Soft Thresholding in Python
    • Coordinate descent in the least squares problem: \(\displaystyle \frac{\partial}{\partial w_j} RSS(w)= -2 \rho_j + 2 w_j\); i.e. \(\displaystyle \hat{w}_j = \rho_j\).
    • Coordinate descent in the Lasso problem (for normalized features): \(\displaystyle \hat{w}_j = \begin{cases} \rho_j + \lambda/2, & \text{if }\rho_j \lt -\lambda/2 \\ 0, & \text{if } -\lambda/2 \le \rho_j \le \lambda/2\\ \rho_j- \lambda/2, & \text{if }\rho_j \gt \lambda/2 \end{cases} \)