Testmathj: Difference between revisions

From David's Wiki
No edit summary
No edit summary
Line 1: Line 1:


== Use ==
## Full lasso
When entropy was applied to the variable selection, we want to select a class variable which gives a largest entropy difference between without any class variable (compute entropy using response only) and with that class variable (entropy is computed by adding entropy in each class level) because this variable is most discriminative and it gives most '''information gain'''. For example,  
set.seed(999)
* entropy (without any class)=.94,
cv.full <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE)
* entropy(var 1) = .69,  
plot(cv.full) # cross-validation curve and two lambda's
* entropy(var 2)=.91,  
plot(glmnet(x, ly, family='binomial', alpha=1), xvar="lambda", label=TRUE) # coefficient path plot
* entropy(var 3)=.725.  
plot(glmnet(x, ly, family='binomial', alpha=1))  # L1 norm plot
We will choose variable 1 since it gives the largest gain (.94 - .69) compared to the other variables (.94 -.91, .94 -.725).
log(cv.full$lambda.min) # -4.546394
log(cv.full$lambda.1se) # -3.61605
sum(coef(cv.full, s=cv.full$lambda.min) != 0) # 44


Why is picking the attribute with the most information gain beneficial? It ''reduces'' entropy, which increases predictability. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.
## Ridge Regression to create the Adaptive Weights Vector
 
set.seed(999)
Consider a split of a continuous variable. Where should we cut the continuous variable to create a binary partition with the highest gain? Suppose cut point c1 creates an entropy .9 and another cut point c2 creates an entropy .1. We should choose c2.
cv.ridge <- cv.glmnet(x, ly, family='binomial', alpha=0, parallel=TRUE)
 
wt <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)
== Related ==
                  [, 1][2:(ncol(x)+1)] ))^1 ## Using gamma = 1, exclude intercept
In addition to information gain, gini (dʒiːni) index is another metric used in decision tree. See [http://en.wikipedia.org/wiki/Decision_tree_learning wikipedia page] about decision tree learning.
## Adaptive Lasso using the 'penalty.factor' argument
 
set.seed(999)
= Ensembles =
cv.lasso <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE, penalty.factor=wt)
* Combining classifiers. Pro: better classification performance. Con: time consuming.
# defautl type.measure="deviance" for logistic regression
* Comic http://flowingdata.com/2017/09/05/xkcd-ensemble-model/
plot(cv.lasso)
* [http://www.win-vector.com/blog/2019/07/common-ensemble-models-can-be-biased/ Common Ensemble Models can be Biased]
log(cv.lasso$lambda.min) # -2.995375
 
log(cv.lasso$lambda.1se) # -0.7625655
== Bagging ==
sum(coef(cv.lasso, s=cv.lasso$lambda.min) != 0) # 34
Draw N bootstrap samples and summary the results (averaging for regression problem, majority vote for classification problem). Decrease variance without changing bias. Not help much with underfit or high bias models.
 
=== Random forest ===
'''Variance importance''': if you scramble the values of a variable, and the accuracy of your tree does not change much, then the variable is not very important.
 
Why is it useful to compute variance importance? So the model's predictions are easier to interpret (not improve the prediction performance).
 
Random forest has advantages of easier to run in parallel and suitable for small n large p problems.
 
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2264-5 Random forest versus logistic regression: a large-scale benchmark experiment] by Raphael Couronné, BMC Bioinformatics 2018
 
[https://github.com/suiji/arborist Arborist]: Parallelized, Extensible Random Forests
 
[https://academic.oup.com/bioinformatics/article-abstract/35/15/2701/5250706?redirectedFrom=fulltext On what to permute in test-based approaches for variable importance measures in Random Forests]
 
== Boosting ==
Instead of selecting data points randomly with the boostrap, it favors the misclassified points.
 
Algorithm:
* Initialize the weights
* Repeat
** resample with respect to weights
** retrain the model
** recompute weights
 
Since boosting requires computation in iterative and bagging can be run in parallel, bagging has an advantage over boosting when the data is very large.
 
== Time series ==
[https://petolau.github.io/Ensemble-of-trees-for-forecasting-time-series/ Ensemble learning for time series forecasting in R]
 
= p-values =
==  p-values ==
* Prob(Data | H0)
* https://en.wikipedia.org/wiki/P-value
* [https://amstat.tandfonline.com/toc/utas20/73/sup1 Statistical Inference in the 21st Century: A World Beyond p < 0.05] The American Statistician, 2019
* [https://matloff.wordpress.com/2016/03/07/after-150-years-the-asa-says-no-to-p-values/ THE ASA SAYS NO TO P-VALUES] The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis. We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important.
* [http://www.r-statistics.com/2016/03/its-not-the-p-values-fault-reflections-on-the-recent-asa-statement/ It’s not the p-values’ fault]
* [https://stablemarkets.wordpress.com/2016/05/21/exploring-p-values-with-simulations-in-r/ Exploring P-values with Simulations in R] from Stable Markets.
* p-value and [https://en.wikipedia.org/wiki/Effect_size effect size]. http://journals.sagepub.com/doi/full/10.1177/1745691614553988
 
== Distribution of p values in medical abstracts ==
* http://www.ncbi.nlm.nih.gov/pubmed/26608725
* [https://github.com/jtleek/tidypvals An R package with several million published p-values in tidy data sets] by Jeff Leek.
 
== nominal p-value and Empirical p-values ==
* Nominal p-values are based on asymptotic null distributions
* Empirical p-values are computed from simulations/permutations
 
== (nominal) alpha level ==
Conventional methodology for statistical testing is, in advance of undertaking the test, to set a NOMINAL ALPHA CRITERION LEVEL (often 0.05). The outcome is classified as showing STATISTICAL SIGNIFICANCE if the actual ALPHA (probability of the outcome under the null hypothesis) is no greater than this NOMINAL ALPHA CRITERION LEVEL.
* http://www.translationdirectory.com/glossaries/glossary033.htm
* http://courses.washington.edu/p209s07/lecturenotes/Week%205_Monday%20overheads.pdf
 
== Normality assumption ==
[https://www.biorxiv.org/content/early/2018/12/20/498931 Violating the normality assumption may be the lesser of two evils]
 
= T-statistic =
See [[T-test#T-statistic|T-statistic]].
 
= ANOVA =
See [[T-test#ANOVA|ANOVA]].
 
= [https://en.wikipedia.org/wiki/Goodness_of_fit Goodness of fit] =
== [https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test Chi-square tests] ==
* [http://freakonometrics.hypotheses.org/20531 An application of chi-square tests]
 
== Fitting distribution ==
[https://magesblog.com/post/2011-12-01-fitting-distributions-with-r/ Fitting distributions with R]
 
= Contingency Tables =
== [https://en.wikipedia.org/wiki/Odds_ratio Odds ratio and Risk ratio] ==
The ratio of the odds of an event occurring in one group to the odds of it occurring in another group
<pre>
        drawn  | not drawn |
-------------------------------------
white |  A      |  B      | Wh
-------------------------------------
black |  C      |  D      | Bk
</pre>
* Odds Ratio = (A / C) / (B / D) = (AD) / (BC)
* Risk Ratio = (A / Wh) / (C / Bk)
 
== Hypergeometric, One-tailed Fisher exact test ==
* https://www.bioconductor.org/help/course-materials/2009/SeattleApr09/gsea/ (Are interesting features over-represented? or are selected genes more often in the ''GO category'' than expected by chance?)
* https://en.wikipedia.org/wiki/Hypergeometric_distribution. '' In a test for over-representation of successes in the sample, the hypergeometric p-value is calculated as the probability of randomly drawing '''k''' or more successes from the population in '''n''' total draws. In a test for under-representation, the p-value is the probability of randomly drawing '''k''' or fewer successes.''
* http://stats.stackexchange.com/questions/62235/one-tailed-fishers-exact-test-and-the-hypergeometric-distribution
* Two sided hypergeometric test
** http://stats.stackexchange.com/questions/155189/how-to-make-a-two-tailed-hypergeometric-test
** http://stats.stackexchange.com/questions/140107/p-value-in-a-two-tail-test-with-asymmetric-null-distribution
** http://stats.stackexchange.com/questions/19195/explaining-two-tailed-tests
* https://www.biostars.org/p/90662/ When computing the p-value (tail probability), consider to use 1 - Prob(observed -1) instead of 1 - Prob(observed) for discrete distribution.
* https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k).
<pre>
        drawn  | not drawn |
-------------------------------------
white |  x      |          | m
-------------------------------------
black |  k-x    |          | n
-------------------------------------
      |  k      |          | m+n
</pre>
 
For example, k=100, m=100, m+n=1000,
<syntaxhighlight lang='rsplus'>
> 1 - phyper(10, 100, 10^3-100, 100, log.p=F)
[1] 0.4160339
> a <- dhyper(0:10, 100, 10^3-100, 100)
> cumsum(rev(a))
  [1] 1.566158e-140 1.409558e-135 3.136408e-131 3.067025e-127 1.668004e-123 5.739613e-120 1.355765e-116
  [8] 2.325536e-113 3.018276e-110 3.058586e-107 2.480543e-104 1.642534e-101  9.027724e-99  4.175767e-96
[15]  1.644702e-93  5.572070e-91  1.638079e-88  4.210963e-86  9.530281e-84  1.910424e-81  3.410345e-79
[22]  5.447786e-77  7.821658e-75  1.013356e-72  1.189000e-70  1.267638e-68  1.231736e-66  1.093852e-64
[29]  8.900857e-63  6.652193e-61  4.576232e-59  2.903632e-57  1.702481e-55  9.240350e-54  4.650130e-52
[36]  2.173043e-50  9.442985e-49  3.820823e-47  1.441257e-45  5.074077e-44  1.669028e-42  5.134399e-41
[43]  1.478542e-39  3.989016e-38  1.009089e-36  2.395206e-35  5.338260e-34  1.117816e-32  2.200410e-31
[50]  4.074043e-30  7.098105e-29  1.164233e-27  1.798390e-26  2.617103e-25  3.589044e-24  4.639451e-23
[57]  5.654244e-22  6.497925e-21  7.042397e-20  7.198582e-19  6.940175e-18  6.310859e-17  5.412268e-16
[64]  4.377256e-15  3.338067e-14  2.399811e-13  1.626091e-12  1.038184e-11  6.243346e-11  3.535115e-10
[71]  1.883810e-09  9.442711e-09  4.449741e-08  1.970041e-07  8.188671e-07  3.193112e-06  1.167109e-05
[78]  3.994913e-05  1.279299e-04  3.828641e-04  1.069633e-03  2.786293e-03  6.759071e-03  1.525017e-02
[85]  3.196401e-02  6.216690e-02  1.120899e-01  1.872547e-01  2.898395e-01  4.160339e-01  5.550192e-01
[92]  6.909666e-01  8.079129e-01  8.953150e-01  9.511926e-01  9.811343e-01  9.942110e-01  9.986807e-01
[99]  9.998018e-01  9.999853e-01  1.000000e+00
 
# Density plot
plot(0:100, dhyper(0:100, 100, 10^3-100, 100), type='h')
</syntaxhighlight>
[[:File:Dhyper.svg]]
 
Moreover,
<pre>
  1 - phyper(q=10, m, n, k)
= 1 - sum_{x=0}^{x=10} phyper(x, m, n, k)
= 1 - sum(a[1:11]) # R's index starts from 1.
</pre>
 
Another example is the data from [https://david.ncifcrf.gov/helps/functional_annotation.html#fisher the functional annotation tool] in DAVID.
<pre>
              | gene list | not gene list |
-------------------------------------------------------
pathway        |  3  (q)  |              | 40 (m)
-------------------------------------------------------
not in pathway |  297      |              | 29960 (n)
-------------------------------------------------------
              |  300 (k)  |              | 30000
</pre>
The one-tailed p-value from the hypergeometric test is calculated as 1 - phyper(3-1, 40, 29960, 300) = 0.0074.
 
== [https://en.wikipedia.org/wiki/Fisher%27s_exact_test Fisher's exact test] ==
Following the above example from the DAVID website, the following R command calculates the Fisher exact test for independence in 2x2 contingency tables.
<syntaxhighlight lang='rsplus'>
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2)) #  alternative = "two.sided" by default
 
        Fisher's Exact Test for Count Data
 
data:  matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.008853
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  1.488738 23.966741
sample estimates:
odds ratio
  7.564602
 
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="greater")
 
        Fisher's Exact Test for Count Data
 
data:  matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.008853
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
1.973  Inf
sample estimates:
odds ratio
  7.564602
 
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="less")
 
        Fisher's Exact Test for Count Data
 
data:  matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.9991
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
  0.00000 20.90259
sample estimates:
odds ratio
  7.564602
</syntaxhighlight>
 
From the documentation of [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html fisher.test]
<pre>
Usage:
    fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
                control = list(), or = 1, alternative = "two.sided",
                conf.int = TRUE, conf.level = 0.95,
                simulate.p.value = FALSE, B = 2000)
</pre>
* For 2 by 2 cases, p-values are obtained directly using the (central or non-central) hypergeometric distribution.
* For 2 by 2 tables, the null of conditional independence is equivalent to the hypothesis that the odds ratio equals one.
* The alternative for a one-sided test is based on the odds ratio, so ‘alternative = "greater"’ is a test of the odds ratio being bigger than ‘or’.
* Two-sided tests are based on the probabilities of the tables, and take as ‘more extreme’ all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.
 
== Chi-square independence test ==
[https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/ Exploring the underlying theory of the chi-square test through simulation - part 2]
 
== [https://en.wikipedia.org/wiki/Gene_set_enrichment_analysis GSEA] ==
Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states
 
* https://www.bioconductor.org/help/course-materials/2015/SeattleApr2015/E_GeneSetEnrichment.html
* http://software.broadinstitute.org/gsea/index.jsp
* [http://www.biorxiv.org/content/biorxiv/early/2017/09/08/186288.full.pdf Statistical power of gene-set enrichment analysis is a function of gene set correlation structure] by SWANSON 2017
* [https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis], [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR] package
 
Two categories of GSEA procedures:
* Competitive:  compare genes in the test set relative to all other genes.
* Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by [http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/MeV/manual/gsea.html Jiang & Gentleman Bioinformatics 2007]
 
== McNemar’s test on paired nominal data ==
https://en.wikipedia.org/wiki/McNemar%27s_test
 
= Confidence vs Credibility Intervals =
http://freakonometrics.hypotheses.org/18117
 
= Power analysis/Sample Size determination =
See [[Power|Power]].
 
= Common covariance/correlation structures =
See [https://onlinecourses.science.psu.edu/stat502/node/228 psu.edu]. Assume covariance <math>\Sigma = (\sigma_{ij})_{p\times p} </math>
 
* Diagonal structure: <math>\sigma_{ij} = 0</math> if <math>i \neq j</math>.
* Compound symmetry: <math>\sigma_{ij} = \rho</math> if <math>i \neq j</math>.
* First-order autoregressive AR(1) structure: <math>\sigma_{ij} = \rho^{|i - j|}</math>. <syntaxhighlight lang='rsplus'>
rho <- .8
p <- 5
blockMat <- rho ^ abs(matrix(1:p, p, p, byrow=T) - matrix(1:p, p, p))
</syntaxhighlight>
* Banded matrix: <math>\sigma_{ii}=1, \sigma_{i,i+1}=\sigma_{i+1,i} \neq 0, \sigma_{i,i+2}=\sigma_{i+2,i} \neq 0</math> and <math>\sigma_{ij}=0</math> for <math>|i-j| \ge 3</math>.
* Spatial Power
* Unstructured Covariance
* [https://en.wikipedia.org/wiki/Toeplitz_matrix Toeplitz structure]
 
To create blocks of correlation matrix, use the "%x%" operator. See [https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/kronecker kronecker()].
<syntaxhighlight lang='rsplus'>
covMat <- diag(n.blocks) %x% blockMat
</syntaxhighlight>
 
= Counter/Special Examples =
== Correlated does not imply independence ==
Suppose X is a normally-distributed random variable with zero mean.  Let Y = X^2.  Clearly X and Y are not independent: if you know X, you also know Y.  And if you know Y, you know the absolute value of X.
 
The covariance of X and Y is
<pre>
  Cov(X,Y) = E(XY) - E(X)E(Y) = E(X^3) - 0*E(Y) = E(X^3)
          = 0,
</pre>
because the distribution of X is symmetric around zero. Thus the correlation r(X,Y) = Cov(X,Y)/Sqrt[Var(X)Var(Y)] = 0, and we have a situation where the variables are not independent, yet
have (linear) correlation r(X,Y) = 0.
 
This example shows how a linear correlation coefficient does not encapsulate anything about the quadratic dependence of Y upon X.
 
== Spearman vs Pearson correlation ==
Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship. https://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation
 
<pre>
x=(1:100); 
y=exp(x);                       
cor(x,y, method='spearman') # 1
cor(x,y, method='pearson')  # .25
</pre>
 
== Spearman vs Wilcoxon ==
By [http://www.talkstats.com/threads/wilcoxon-signed-rank-test-or-spearmans-rho.42395/ this post]
* Wilcoxon used to compare categorical versus non-normal continuous variable
* Spearman's rho used to compare two continuous (including '''ordinal''') variables that one or both aren't normally distributed
 
== Spearman vs [https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient Kendall correlation] ==
* Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the '''ordinal''' association between two measured quantities.
* [https://stats.stackexchange.com/questions/3943/kendall-tau-or-spearmans-rho Kendall Tau or Spearman's rho?]
 
== [http://en.wikipedia.org/wiki/Anscombe%27s_quartet Anscombe quartet] ==
 
Four datasets have almost same properties: same mean in X, same mean in Y, same variance in X, (almost) same variance in Y, same correlation in X and Y, same linear regression.
 
[[:File:Anscombe quartet 3.svg]]
 
== The real meaning of spurious correlations ==
https://nsaunders.wordpress.com/2017/02/03/the-real-meaning-of-spurious-correlations/
<syntaxhighlight lang='rsplus'>
library(ggplot2)
set.seed(123)
spurious_data <- data.frame(x = rnorm(500, 10, 1),
                            y = rnorm(500, 10, 1),
                            z = rnorm(500, 30, 3))
cor(spurious_data$x, spurious_data$y)
# [1] -0.05943856
spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) +
theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")
 
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.4517972
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
theme_bw() + geom_smooth(method = "lm") +
scale_color_gradientn(colours = c("red", "white", "blue")) +
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")
 
spurious_data$z <- rnorm(500, 30, 6)
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.8424597
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
theme_bw() + geom_smooth(method = "lm") +
scale_color_gradientn(colours = c("red", "white", "blue")) +
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")
</syntaxhighlight>
</syntaxhighlight>
* A list of potential lambdas: see [http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin Linear Regression] case. The λ sequence is determined by '''lambda.max''' and '''lambda.min.ratio'''. The latter (default is ifelse(nobs<nvars,0.01,0.0001)) is the ratio of smallest value of the generated λ sequence (say ''lambda.min'') to  ''lambda.max''. The program then generated ''nlambda'' values linear on the log scale from ''lambda.max'' down to ''lambda.min''. ''lambda.max'' is not given, but easily computed from the input x and y; it is the smallest value for ''lambda'' such that all the coefficients are zero.
* [https://privefl.github.io/blog/choosing-hyper-parameters-in-penalized-regression/ Choosing hyper-parameters (α and λ) in penalized regression] by Florian Privé
* [https://stats.stackexchange.com/questions/70249/feature-selection-model-with-glmnet-on-methylation-data-pn lambda.min vs lambda.1se]
** The '''lambda.1se''' represents the value of λ in the search that was simpler than the best model ('''lambda.min'''), but which has error within 1 standard error of the best model. In other words, using the value of ''lambda.1se'' as the selected value for λ results in a model that is slightly simpler than the best model but which cannot be distinguished from the best model in terms of error given the uncertainty in the k-fold CV estimate of the error of the best model.
** The '''lambda.min''' option refers to value of λ at the lowest CV error. The error at this value of λ is the average of the errors over the k folds and hence this estimate of the error is uncertain.
* https://www.rdocumentation.org/packages/glmnet/versions/2.0-10/topics/glmnet
* [http://blog.revolutionanalytics.com/2016/11/glmnetutils.html glmnetUtils: quality of life enhancements for elastic net regression with glmnet]
* Mixing parameter: alpha=1 is the '''lasso penalty''', and alpha=0 the '''ridge penalty''' and anything between 0–1 is '''Elastic net'''.
** RIdge regression uses Euclidean distance/L2-norm as the penalty. It won't remove any variables.
** Lasso uses L1-norm as the penalty. Some of the coefficients may be shrunk exactly to zero.
* [https://www.quora.com/In-ridge-regression-and-lasso-what-is-lambda In ridge regression and lasso, what is lambda?]
** Lambda is a penalty coefficient. Large lambda will shrink the coefficients.
** cv.glment()$lambda.1se gives the most regularized model such that error is within one standard error of the minimum
* cv.glmnet() has a logical parameter '''parallel''' which is useful if a cluster or cores have been previously allocated
* [http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf Ridge regression and the LASSO]
* Standard error/Confidence interval
** [https://www.reddit.com/r/statistics/comments/1vg8k0/standard_errors_in_glmnet/ Standard Errors in GLMNET] and [https://stackoverflow.com/questions/39750965/confidence-intervals-for-ridge-regression Confidence intervals for Ridge regression]
** '''[https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf#page=18 Why SEs are not meaningful and are usually not provided in penalized regression?]'''
**# Hint:  standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods.
**# '''Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias.'''
**# The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.
**# Any bootstrap-based calculations can only give an assessment of the variance of the estimates.
**# Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used.
** [https://stats.stackexchange.com/tags/glmnet/hot Hottest glmnet questions from stackexchange].
** [https://stats.stackexchange.com/questions/91462/standard-errors-for-lasso-prediction-using-r Standard errors for lasso prediction]. There might not be a consensus on a statistically valid method of calculating standard errors for the lasso predictions.
** [https://www4.stat.ncsu.edu/~lu/programcodes.html Code] for Adaptive-Lasso for Cox's proportional hazards model by Zhang & Lu (2007). This can compute the SE of estimates. The weights are originally based on the maximizers of the log partial likelihood. However, the beta may not be estimable in cases such as high-dimensional gene data, or the beta may be unstable if strong collinearity exists among covariates. In such cases, robust estimators such as ridge regression estimators would be used to determine the weights.
* LASSO vs Least angle regression
** https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
** [http://web.stanford.edu/~hastie/TALKS/larstalk.pdf Least Angle Regression, Forward Stagewise and the Lasso]
** https://www.quora.com/What-is-Least-Angle-Regression-and-when-should-it-be-used
** [http://statweb.stanford.edu/~tibs/lasso/simple.html A simple explanation of the Lasso and Least Angle Regression]
** https://stats.stackexchange.com/questions/4663/least-angle-regression-vs-lasso
** https://cran.r-project.org/web/packages/lars/index.html
* '''Oracle property''' and '''adaptive lasso'''
** [http://www.stat.wisc.edu/~shao/stat992/fan-li2001.pdf Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties], Fan & Li (2001) JASA
** [http://ricardoscr.github.io/how-to-adaptive-lasso.html Adaptive Lasso: What it is and how to implement in R]. Adaptive lasso weeks to minimize <math> RSS(\beta) + \lambda \sum_1^p \hat{\omega}_j |\beta_j| </math> where <math>\lambda</math> is the tuning parameter, <math>\hat{\omega}_j = \frac{1}{(|\hat{\beta}_j^{ini}|)^\gamma}</math> is the adaptive weights vector and <math>\hat{\beta}_j^{ini}</math> is an initial estimate of the coefficients obtained through ridge regression. '''Adaptive Lasso ends up penalizing more those coefficients with lower initial estimates.''' <math>\gamma</math> is a positive constant for adjustment of the adaptive weight vector, and the authors suggest the possible values of 0.5, 1 and 2.
** When n goes to infinity, <math>\hat{\omega}_j |\beta_j|  </math> converges to <math>I(\beta_j \neq 0) </math>. So the adaptive Lasso procedure can be regarded as an automatic implementation of best-subset selection in some asymptotic sense.
** [https://stats.stackexchange.com/questions/229142/what-is-the-oracle-property-of-an-estimator What is the oracle property of an estimator?] An oracle estimator must be consistent in 1) '''variable selection''' and 2) '''consistent parameter estimation'''.
** (Linear regression) The adaptive lasso and its oracle properties Zou (2006, JASA)
** (Cox model) Adaptive-LASSO for Cox's proportional hazard model by Zhang and Lu (2007, Biometrika)
**[https://insightr.wordpress.com/2017/06/14/when-the-lasso-fails/ When the LASSO fails???]. Adaptive lasso is used to demonstrate its usefulness.
* [https://statisticaloddsandends.wordpress.com/2018/11/13/a-deep-dive-into-glmnet-penalty-factor/ A deep dive into glmnet: penalty.factor], [https://statisticaloddsandends.wordpress.com/2018/11/15/a-deep-dive-into-glmnet-standardize/ standardize], [https://statisticaloddsandends.wordpress.com/2019/01/09/a-deep-dive-into-glmnet-offset/ offset]
** Lambda sequence is not affected by the "penalty.factor"
** How "penalty.factor" used by the objective function may need to be corrected
* Some issues:
** With group of highly correlated features, Lasso tends to select amongst them arbitrarily.
** Often empirically ridge has better predictive performance than lasso but lasso leads to sparser solution
** Elastic-net (Zou & Hastie '05) aims to address these issues: hybrid between Lasso and ridge regression, uses L1 and L2 penalties.
* [https://statcompute.wordpress.com/2019/02/23/gradient-free-optimization-for-glmnet-parameters/ Gradient-Free Optimization for GLMNET Parameters]
* [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2656-1 Gsslasso Cox]: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information, Tang et al BMC Bioinformatics 2019


= Time series =
=== Lasso logistic regression ===
* [http://ellisp.github.io/blog/2016/12/07/arima-prediction-intervals Why time series forecasts prediction intervals aren't as good as we'd hope]
https://freakonometrics.hypotheses.org/52894
 
== Structural change ==
[https://datascienceplus.com/structural-changes-in-global-warming/ Structural Changes in Global Warming]
 
== AR(1) processes and random walks ==
[https://fdabl.github.io/r/Spurious-Correlation.html Spurious correlations and random walks]
 
= Measurement Error model =
* [https://en.wikipedia.org/wiki/Errors-in-variables_models Errors-in-variables models/errors-in-variables models or measurement error models]
* [https://onlinelibrary.wiley.com/doi/10.1111/biom.13112 Simulation‐‐Selection‐‐Extrapolation: Estimation in High‐‐Dimensional Errors‐‐in‐‐Variables Models] Nghiem 2019
 
= Dictionary =
* '''Prognosis''' is the probability that an event or diagnosis will result in a particular outcome.
** For example, on the paper [http://clincancerres.aacrjournals.org/content/18/21/6065.figures-only Developing and Validating Continuous Genomic Signatures in Randomized Clinical Trials for Predictive Medicine] by Matsui 2012, the prognostic score .1 (0.9) represents a '''good (poor)''' prognosis.
** Prostate cancer has a much higher one-year overall survival rate than pancreatic cancer, and thus has a better prognosis. See [https://en.wikipedia.org/wiki/Survival_rate Survival rate] in wikipedia.
 
= Data =
== Eleven quick tips for finding research data ==
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006038
 
= Books =
* [https://leanpub.com/biostatmethods Methods in Biostatistics with R] ($)
* [http://web.stanford.edu/class/bios221/book/ Modern Statistics for Modern Biology] (free)
* Principles of Applied Statistics, by David Cox & Christl Donnelly
* [https://www.amazon.com/Freedman-Robert-Pisani-Statistics-Hardcover/dp/B004QNRMDK/ Statistics] by David Freedman,Robert Pisani, Roger Purves
* [https://onlinelibrary.wiley.com/topic/browse/000113 Wiley Online Library -> Statistics] (Access by NIH Library)
* [https://web.stanford.edu/~hastie/CASI/ Computer Age Statistical Inference: Algorithms, Evidence and Data Science] by Efron and Hastie 2016


= Social =
=== Lagrange Multipliers ===
== JSM ==
[https://medium.com/@andrew.chamberlain/a-simple-explanation-of-why-lagrange-multipliers-works-253e2cdcbf74 A Simple Explanation of Why Lagrange Multipliers Works]
* 2019
** [https://minecr.shinyapps.io/jsm2019-schedule/ JSM 2019] and the [http://www.citizen-statistician.org/2019/07/shiny-for-jsm-2019/ post].
** [https://rviews.rstudio.com/2019/07/19/an-r-users-guide-to-jsm-2019/ An R Users Guide to JSM 2019]


== Following ==
=== How to solve lasso/convex optimization ===
* [http://jtleek.com/ Jeff Leek], https://twitter.com/jtleek
* [https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf Convex Optimization] by Boyd S, Vandenberghe L, Cambridge 2004. It is cited by Zhang & Lu (2007). The '''interior point algorithm''' can be used to solve the optimization problem in adaptive lasso.
* Revolutions, http://blog.revolutionanalytics.com/  
* Review of '''gradient descent''':
* RStudio Blog, https://blog.rstudio.com/
** Finding maximum: <math>w^{(t+1)} = w^{(t)} + \eta \frac{dg(w)}{dw}</math>, where <math>\eta</math> is stepsize.
* Sean Davis, https://twitter.com/seandavis12, https://github.com/seandavi
** Finding minimum: <math>w^{(t+1)} = w^{(t)} - \eta \frac{dg(w)}{dw}</math>.
* [http://stephenturner.us/post/ Stephen Turner], https://twitter.com/genetics_blog
** [https://stackoverflow.com/questions/12066761/what-is-the-difference-between-gradient-descent-and-newtons-gradient-descent What is the difference between Gradient Descent and Newton's Gradient Descent?] Newton's method requires <math>g''(w)</math>, more smoothness of g(.).
** Finding minimum for multiple variables ('''gradient descent'''): <math>w^{(t+1)} = w^{(t)} - \eta \Delta g(w^{(t)})</math>. For the least squares problem, <math>g(w) = RSS(w)</math>.
** Finding minimum for multiple variables in the least squares problem (minimize <math>RSS(w)</math>):  <math>\text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = w_j^{(t)} - \eta \; \text{partial}(j)</math>
** Finding minimum for multiple variables in the ridge regression problem (minimize <math>RSS(w)+\lambda ||w||_2^2=(y-Hw)'(y-Hw)+\lambda w'w</math>): <math>\text{partial}(j) = -2\sum h_j(x_i)(y_i - \hat{y}_i(w^{(t)}), w_j^{(t+1)} = (1-2\eta \lambda) w_j^{(t)} - \eta \; \text{partial}(j)</math>. Compared to the closed form approach: <math>\hat{w} = (H'H + \lambda I)^{-1}H'y</math> where 1. the inverse exists even N<D as long as <math>\lambda > 0</math> and 2. the complexity of inverse is <math>O(D^3)</math>, D is the dimension of the covariates.
* '''Cyclical coordinate descent''' was used ([https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet_beta.pdf#page=1 vignette]) in the glmnet package. See also '''[https://en.wikipedia.org/wiki/Coordinate_descent coordinate descent]'''. The reason we call it 'descent' is because we want to 'minimize' an objective function.
** <math>\hat{w}_j = \min_w g(\hat{w}_1, \cdots, \hat{w}_{j-1},w, \hat{w}_{j+1}, \cdots, \hat{w}_D)</math>
** See [https://www.jstatsoft.org/article/view/v033i01 paper] on JSS 2010. The Cox PHM case also uses the cyclical coordinate descent method; see the [https://www.jstatsoft.org/article/view/v039i05 paper] on JSS 2011.
** Coursera's [https://www.coursera.org/learn/ml-regression/lecture/rb179/feature-selection-lasso-and-nearest-neighbor-regression Machine learning course 2: Regression] at 1:42. [http://web.stanford.edu/~hastie/TALKS/CD.pdf#page=12 Soft-thresholding] the coefficients is the key for the L1 penalty. The range for the thresholding is controlled by <math>\lambda</math>. Note to view the videos and all materials in coursera we can enroll to audit the course without starting a trial.
** No step size is required as in gradient descent.
** [https://sandipanweb.wordpress.com/2017/05/04/implementing-lasso-regression-with-coordinate-descent-and-the-sub-gradient-of-the-l1-penalty-with-soft-thresholding/ Implementing LASSO Regression with Coordinate Descent, Sub-Gradient of the L1 Penalty and Soft Thresholding in Python]
** Coordinate descent in the least squares problem: <math>\frac{\partial}{\partial w_j} RSS(w)= -2 \rho_j + 2 w_j</math>; i.e. <math>\hat{w}_j = \rho_j</math>.
** Coordinate descent in the Lasso problem (for normalized features): <math>
\hat{w}_j =
\begin{cases}
\rho_j + \lambda/2, & \text{if }\rho_j < -\lambda/2 \\
0, & \text{if } -\lambda/2 \le \rho_j \le \lambda/2\\
\rho_j- \lambda/2, & \text{if }\rho_j > \lambda/2
\end{cases}
</math>
** Choosing <math>\lambda</math> via cross validation tends to favor less sparse solutions and thus smaller <math>\lambda</math> then optimal choice for feature selection. See "Machine learning: a probabilistic perspective", Murphy 2012.
* Classical: Least angle regression (LARS) Efron et al 2004.
* [https://www.mathworks.com/help/stats/lasso.html?s_tid=gn_loc_drop Alternating Direction Method of Multipliers (ADMM)]. Boyd, 2011. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.” Foundations and Trends in Machine Learning. Vol. 3, No. 1, 2010, pp. 1–122.
** https://stanford.edu/~boyd/papers/pdf/admm_slides.pdf
** [https://cran.r-project.org/web/packages/ADMM/ ADMM] package
** [https://www.quora.com/Convex-Optimization-Whats-the-advantage-of-alternating-direction-method-of-multipliers-ADMM-and-whats-the-use-case-for-this-type-of-method-compared-against-classic-gradient-descent-or-conjugate-gradient-descent-method What's the advantage of alternating direction method of multipliers (ADMM), and what's the use case for this type of method compared against classic gradient descent or conjugate gradient descent method?]
* [https://math.stackexchange.com/questions/771585/convexity-of-lasso If some variables in design matrix are correlated, then LASSO is convex or not?]
* Tibshirani. [http://www.jstor.org/stable/2346178 Regression shrinkage and selection via the lasso] (free). JRSS B 1996.
* [http://www.econ.uiuc.edu/~roger/research/conopt/coptr.pdf Convex Optimization in R] by Koenker & Mizera 2014.
* [https://web.stanford.edu/~hastie/Papers/pathwise.pdf Pathwise coordinate optimization] by Friedman et al 2007.
* [http://web.stanford.edu/~hastie/StatLearnSparsity/ Statistical learning with sparsity: the Lasso and generalizations] T. Hastie, R. Tibshirani, and M. Wainwright, 2015 (book)
* Element of Statistical Learning (book)
* https://youtu.be/A5I1G1MfUmA StatsLearning Lect8h 110913
* Fu's (1998) shooting algorithm for Lasso ([http://web.stanford.edu/~hastie/TALKS/CD.pdf#page=11 mentioned] in the history of coordinate descent) and Zhang & Lu's (2007) modified shooting algorithm for adaptive Lasso.
* [https://www.cs.ubc.ca/~murphyk/MLbook/ Machine Learning: a Probabilistic Perspective] Choosing <math>\lambda</math> via cross validation tends to favor less sparse solutions and thus smaller <math>\lambda</math> than optimal choice for feature selection.

Revision as of 22:36, 7 September 2019

    1. Full lasso

set.seed(999) cv.full <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE) plot(cv.full) # cross-validation curve and two lambda's plot(glmnet(x, ly, family='binomial', alpha=1), xvar="lambda", label=TRUE) # coefficient path plot plot(glmnet(x, ly, family='binomial', alpha=1)) # L1 norm plot log(cv.full$lambda.min) # -4.546394 log(cv.full$lambda.1se) # -3.61605 sum(coef(cv.full, s=cv.full$lambda.min) != 0) # 44

    1. Ridge Regression to create the Adaptive Weights Vector

set.seed(999) cv.ridge <- cv.glmnet(x, ly, family='binomial', alpha=0, parallel=TRUE) wt <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)

                  [, 1][2:(ncol(x)+1)] ))^1 ## Using gamma = 1, exclude intercept
    1. Adaptive Lasso using the 'penalty.factor' argument

set.seed(999) cv.lasso <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE, penalty.factor=wt)

  1. defautl type.measure="deviance" for logistic regression

plot(cv.lasso) log(cv.lasso$lambda.min) # -2.995375 log(cv.lasso$lambda.1se) # -0.7625655 sum(coef(cv.lasso, s=cv.lasso$lambda.min) != 0) # 34 </syntaxhighlight>

Lasso logistic regression

https://freakonometrics.hypotheses.org/52894

Lagrange Multipliers

A Simple Explanation of Why Lagrange Multipliers Works

How to solve lasso/convex optimization