Testmathj: Difference between revisions

(21 intermediate revisions by the same user not shown)

Line 1:

~~== Use ==~~

When entropy was applied to the variable selection, we want to select a class variable which gives a largest entropy difference between without any class variable (compute entropy using response only) and with that class variable (entropy is computed by adding entropy in each class level) because this variable is most discriminative and it gives most '''information gain'''. For example,

* entropy (without any class)=.94,

* entropy(var 1) = .69,

* entropy(var 2)=.91,

* entropy(var 3)=.725.

~~We will choose variable 1 since it gives the largest gain (.94 - .69) compared to the other variables (.94 -.91, .94 -.725).~~

Why is picking the attribute with the most information gain beneficial? It ''reduces'' entropy, which increases predictability. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.

Consider a split of a continuous variable. Where should we cut the continuous variable to create a binary partition with the highest gain? Suppose cut point c1 creates an entropy .9 and another cut point c2 creates an entropy .1. We should choose c2.

~~== Related ==~~

~~In addition to information gain, gini (dʒiːni) index is another metric used in decision tree. See [http://en.wikipedia.org/wiki/Decision_tree_learning wikipedia page] about decision tree learning.~~

~~= Ensembles =~~

* Combining classifiers. Pro: better classification performance. Con: time consuming.

* Comic http://flowingdata.com/2017/09/05/xkcd-ensemble-model/

* [http://www.win-vector.com/blog/2019/07/common-ensemble-models-can-be-biased/ Common Ensemble Models can be Biased]

~~== Bagging ==~~

Draw N bootstrap samples and summary the results (averaging for regression problem, majority vote for classification problem). Decrease variance without changing bias. Not help much with underfit or high bias models.

~~=== Random forest ===~~

~~'''Variance importance''': if you scramble the values of a variable, and the accuracy of your tree does not change much, then the variable is not very important.~~

~~Why is it useful to compute variance importance? So the model's predictions are easier to interpret (not improve the prediction performance).~~

~~Random forest has advantages of easier to run in parallel and suitable for small n large p problems.~~

[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2264-5 Random forest versus logistic regression: a large-scale benchmark experiment] by Raphael Couronné, BMC Bioinformatics 2018

~~[https://github.com/suiji/arborist Arborist]: Parallelized, Extensible Random Forests~~

~~[https://academic.oup.com/bioinformatics/article-abstract/35/15/2701/5250706?redirectedFrom=fulltext On what to permute in test-based approaches for variable importance measures in Random Forests]~~

~~== Boosting ==~~

~~Instead of selecting data points randomly with the boostrap, it favors the misclassified points.~~

~~Algorithm:~~

* Initialize the weights

* Repeat

** resample with respect to weights

** retrain the model

** recompute weights

~~Since boosting requires computation in iterative and bagging can be run in parallel, bagging has an advantage over boosting when the data is very large.~~

~~== Time series ==~~

~~[https://petolau.github.io/Ensemble-of-trees-for-forecasting-time-series/ Ensemble learning for time series forecasting in R]~~

~~= p-values =~~

~~== p-values ==~~

* Prob(Data | H0)

* https://en.wikipedia.org/wiki/P-value

* [https://amstat.tandfonline.com/toc/utas20/73/sup1 Statistical Inference in the 21st Century: A World Beyond p < 0.05] The American Statistician, 2019

* [https://matloff.wordpress.com/2016/03/07/after-150-years-the-asa-says-no-to-p-values/ THE ASA SAYS NO TO P-VALUES] The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis. We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important.

* [http://www.r-statistics.com/2016/03/its-not-the-p-values-fault-reflections-on-the-recent-asa-statement/ It’s not the p-values’ fault]

* [https://stablemarkets.wordpress.com/2016/05/21/exploring-p-values-with-simulations-in-r/ Exploring P-values with Simulations in R] from Stable Markets.

* p-value and [https://en.wikipedia.org/wiki/Effect_size effect size]. http://journals.sagepub.com/doi/full/10.1177/1745691614553988

~~== Distribution of p values in medical abstracts ==~~

* http://www.ncbi.nlm.nih.gov/pubmed/26608725

* [https://github.com/jtleek/tidypvals An R package with several million published p-values in tidy data sets] by Jeff Leek.

~~== nominal p-value and Empirical p-values ==~~

* Nominal p-values are based on asymptotic null distributions

* Empirical p-values are computed from simulations/permutations

~~== (nominal) alpha level ==~~

Conventional methodology for statistical testing is, in advance of undertaking the test, to set a NOMINAL ALPHA CRITERION LEVEL (often 0.05). The outcome is classified as showing STATISTICAL SIGNIFICANCE if the actual ALPHA (probability of the outcome under the null hypothesis) is no greater than this NOMINAL ALPHA CRITERION LEVEL.

* http://www.translationdirectory.com/glossaries/glossary033.htm

* http://courses.washington.edu/p209s07/lecturenotes/Week%205_Monday%20overheads.pdf

~~== Normality assumption ==~~

~~[https://www.biorxiv.org/content/early/2018/12/20/498931 Violating the normality assumption may be the lesser of two evils]~~

~~= T-statistic =~~

~~See [[T-test#T-statistic|T-statistic]].~~

~~= ANOVA =~~

~~See [[T-test#ANOVA|ANOVA]].~~

~~= [https://en.wikipedia.org/wiki/Goodness_of_fit Goodness of fit] =~~

~~== [https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test Chi-square tests] ==~~

* [http://freakonometrics.hypotheses.org/20531 An application of chi-square tests]

~~== Fitting distribution ==~~

~~[https://magesblog.com/post/2011-12-01-fitting-distributions-with-r/ Fitting distributions with R]~~

~~= Contingency Tables =~~

~~== [https://en.wikipedia.org/wiki/Odds_ratio Odds ratio and Risk ratio] ==~~

~~The ratio of the odds of an event occurring in one group to the odds of it occurring in another group~~

~~<pre>~~

~~drawn | not drawn |~~

~~-------------------------------------~~

~~white | A | B | Wh~~

~~-------------------------------------~~

~~black | C | D | Bk~~

~~</pre>~~

* Odds Ratio = (A / C) / (B / D) = (AD) / (BC)

* Risk Ratio = (A / Wh) / (C / Bk)

~~== Hypergeometric, One-tailed Fisher exact test ==~~

* https://www.bioconductor.org/help/course-materials/2009/SeattleApr09/gsea/ (Are interesting features over-represented? or are selected genes more often in the ''GO category'' than expected by chance?)

* https://en.wikipedia.org/wiki/Hypergeometric_distribution. '' In a test for over-representation of successes in the sample, the hypergeometric p-value is calculated as the probability of randomly drawing '''k''' or more successes from the population in '''n''' total draws. In a test for under-representation, the p-value is the probability of randomly drawing '''k''' or fewer successes.''

* http://stats.stackexchange.com/questions/62235/one-tailed-fishers-exact-test-and-the-hypergeometric-distribution

* Two sided hypergeometric test

** http://stats.stackexchange.com/questions/155189/how-to-make-a-two-tailed-hypergeometric-test

** http://stats.stackexchange.com/questions/140107/p-value-in-a-two-tail-test-with-asymmetric-null-distribution

** http://stats.stackexchange.com/questions/19195/explaining-two-tailed-tests

* https://www.biostars.org/p/90662/ When computing the p-value (tail probability), consider to use 1 - Prob(observed -1) instead of 1 - Prob(observed) for discrete distribution.

* https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k).

~~<pre>~~

~~drawn | not drawn |~~

~~-------------------------------------~~

~~white | x | | m~~

~~-------------------------------------~~

~~black | k-x | | n~~

~~-------------------------------------~~

~~| k | | m+n~~

~~</pre>~~

~~For example, k=100, m=100, m+n=1000,~~

~~<syntaxhighlight lang='rsplus'>~~

~~> 1 - phyper(10, 100, 10^3-100, 100, log.p=F)~~

~~[1] 0.4160339~~

~~> a <- dhyper(0:10, 100, 10^3-100, 100)~~

~~> cumsum(rev(a))~~

~~[1] 1.566158e-140 1.409558e-135 3.136408e-131 3.067025e-127 1.668004e-123 5.739613e-120 1.355765e-116~~

~~[8] 2.325536e-113 3.018276e-110 3.058586e-107 2.480543e-104 1.642534e-101 9.027724e-99 4.175767e-96~~

~~[15] 1.644702e-93 5.572070e-91 1.638079e-88 4.210963e-86 9.530281e-84 1.910424e-81 3.410345e-79~~

~~[22] 5.447786e-77 7.821658e-75 1.013356e-72 1.189000e-70 1.267638e-68 1.231736e-66 1.093852e-64~~

~~[29] 8.900857e-63 6.652193e-61 4.576232e-59 2.903632e-57 1.702481e-55 9.240350e-54 4.650130e-52~~

~~[36] 2.173043e-50 9.442985e-49 3.820823e-47 1.441257e-45 5.074077e-44 1.669028e-42 5.134399e-41~~

~~[43] 1.478542e-39 3.989016e-38 1.009089e-36 2.395206e-35 5.338260e-34 1.117816e-32 2.200410e-31~~

~~[50] 4.074043e-30 7.098105e-29 1.164233e-27 1.798390e-26 2.617103e-25 3.589044e-24 4.639451e-23~~

~~[57] 5.654244e-22 6.497925e-21 7.042397e-20 7.198582e-19 6.940175e-18 6.310859e-17 5.412268e-16~~

~~[64] 4.377256e-15 3.338067e-14 2.399811e-13 1.626091e-12 1.038184e-11 6.243346e-11 3.535115e-10~~

~~[71] 1.883810e-09 9.442711e-09 4.449741e-08 1.970041e-07 8.188671e-07 3.193112e-06 1.167109e-05~~

~~[78] 3.994913e-05 1.279299e-04 3.828641e-04 1.069633e-03 2.786293e-03 6.759071e-03 1.525017e-02~~

~~[85] 3.196401e-02 6.216690e-02 1.120899e-01 1.872547e-01 2.898395e-01 4.160339e-01 5.550192e-01~~

~~[92] 6.909666e-01 8.079129e-01 8.953150e-01 9.511926e-01 9.811343e-01 9.942110e-01 9.986807e-01~~

~~[99] 9.998018e-01 9.999853e-01 1.000000e+00~~

~~# Density plot~~

~~plot(0:100, dhyper(0:100, 100, 10^3-100, 100), type='h')~~

~~</syntaxhighlight>~~

~~[[:File:Dhyper.svg]]~~

~~Moreover,~~

~~<pre>~~

~~1 - phyper(q=10, m, n, k)~~

~~= 1 - sum_{x=0}^{x=10} phyper(x, m, n, k)~~

~~= 1 - sum(a[1:11]) # R's index starts from 1.~~

~~</pre>~~

~~Another example is the data from [https://david.ncifcrf.gov/helps/functional_annotation.html#fisher the functional annotation tool] in DAVID.~~

~~<pre>~~

~~| gene list | not gene list |~~

~~-------------------------------------------------------~~

~~pathway | 3 (q) | | 40 (m)~~

~~-------------------------------------------------------~~

~~not in pathway | 297 | | 29960 (n)~~

~~-------------------------------------------------------~~

~~| 300 (k) | | 30000~~

~~</pre>~~

~~The one-tailed p-value from the hypergeometric test is calculated as 1 - phyper(3-1, 40, 29960, 300) = 0.0074.~~

~~== [https://en.wikipedia.org/wiki/Fisher%27s_exact_test Fisher's exact test] ==~~

~~Following the above example from the DAVID website, the following R command calculates the Fisher exact test for independence in 2x2 contingency tables.~~

~~<syntaxhighlight lang='rsplus'>~~

~~> fisher.test(matrix(c(3, 40, 297, 29960), nr=2)) # alternative = "two.sided" by default~~

~~Fisher's Exact Test for Count Data~~

~~data: matrix(c(3, 40, 297, 29960), nr = 2)~~

~~p-value = 0.008853~~

~~alternative hypothesis: true odds ratio is not equal to 1~~

~~95 percent confidence interval:~~

~~1.488738 23.966741~~

~~sample estimates:~~

~~odds ratio~~

~~7.564602~~

~~> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="greater")~~

~~Fisher's Exact Test for Count Data~~

~~data: matrix(c(3, 40, 297, 29960), nr = 2)~~

~~p-value = 0.008853~~

~~alternative hypothesis: true odds ratio is greater than 1~~

~~95 percent confidence interval:~~

~~1.973 Inf~~

~~sample estimates:~~

~~odds ratio~~

~~7.564602~~

~~> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="less")~~

~~Fisher's Exact Test for Count Data~~

~~data: matrix(c(3, 40, 297, 29960), nr = 2)~~

~~p-value = 0.9991~~

~~alternative hypothesis: true odds ratio is less than 1~~

~~95 percent confidence interval:~~

~~0.00000 20.90259~~

~~sample estimates:~~

~~odds ratio~~

~~7.564602~~

~~</syntaxhighlight>~~

~~From the documentation of [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html fisher.test]~~

~~<pre>~~

~~Usage:~~

~~fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,~~

~~control = list(), or = 1, alternative = "two.sided",~~

~~conf.int = TRUE, conf.level = 0.95,~~

~~simulate.p.value = FALSE, B = 2000)~~

~~</pre>~~

* For 2 by 2 cases, p-values are obtained directly using the (central or non-central) hypergeometric distribution.

* For 2 by 2 tables, the null of conditional independence is equivalent to the hypothesis that the odds ratio equals one.

* The alternative for a one-sided test is based on the odds ratio, so ‘alternative = "greater"’ is a test of the odds ratio being bigger than ‘or’.

* Two-sided tests are based on the probabilities of the tables, and take as ‘more extreme’ all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.

~~== Chi-square independence test ==~~

~~[https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/ Exploring the underlying theory of the chi-square test through simulation - part 2]~~

~~== [https://en.wikipedia.org/wiki/Gene_set_enrichment_analysis GSEA] ==~~

~~Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states~~

* https://www.bioconductor.org/help/course-materials/2015/SeattleApr2015/E_GeneSetEnrichment.html

* http://software.broadinstitute.org/gsea/index.jsp

* [http://www.biorxiv.org/content/biorxiv/early/2017/09/08/186288.full.pdf Statistical power of gene-set enrichment analysis is a function of gene set correlation structure] by SWANSON 2017

* [https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis], [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR] package

~~Two categories of GSEA procedures:~~

* Competitive: compare genes in the test set relative to all other genes.

* Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by [http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/MeV/manual/gsea.html Jiang & Gentleman Bioinformatics 2007]

~~== McNemar’s test on paired nominal data ==~~

~~https://en.wikipedia.org/wiki/McNemar%27s_test~~

~~= Confidence vs Credibility Intervals =~~

~~http://freakonometrics.hypotheses.org/18117~~

~~= Power analysis/Sample Size determination =~~

~~See [[Power|Power]].~~

~~= Common covariance/correlation structures =~~

~~See [https://onlinecourses.science.psu.edu/stat502/node/228 psu.edu]. Assume covariance <math>\Sigma = (\sigma_{ij})_{p\times p} </math>~~

* Diagonal structure: <math>\~~sigma_{ij} = 0</math> if <math>i \neq j</math>.~~

* Compound symmetry: <math>\sigma_{ij} = \rho</math> if <math>i \neq j</math>.

* First-order autoregressive AR(1) structure: <math>\sigma_{ij} = \rho^{|~~i - j|}</math>. <syntaxhighlight lang='rsplus'>~~

~~rho <- .8~~

~~p <- 5~~

~~blockMat <- rho ^ abs(matrix(1:p, p, p, byrow=T) - matrix(1:p, p, p))~~

~~</syntaxhighlight>~~

* Banded matrix: <math>\~~sigma_~~{ii}~~=1, \sigma_~~{~~i,i+1}=\sigma_{i+1,i} \neq 0, \sigma_{i,i+2~~}~~=\sigma_{i+2,i} \neq 0</math> and <math>~~\~~sigma_{ij}=0</math> for <math>|i-j~~| ~~\ge 3~~</math>.

* Spatial Power

* Unstructured Covariance

* [https://en.wikipedia.org/wiki/Toeplitz_matrix Toeplitz structure]

~~To create blocks of correlation matrix, use the "%x%" operator. See [https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/kronecker kronecker()].~~

~~<syntaxhighlight lang='rsplus'>~~

~~covMat <- diag(n.blocks) %x% blockMat~~

~~</syntaxhighlight>~~

~~= Counter/Special Examples =~~

~~== Correlated does not imply independence ==~~

Suppose X is a normally-distributed random variable with zero mean. Let Y = X^2. Clearly X and Y are not independent: if you know X, you also know Y. And if you know Y, you know the absolute value of X.

~~The covariance of X and Y is~~

~~<pre>~~

~~Cov(X,Y) = E(XY) - E(X)E(Y) = E(X^3) - 0*E(Y) = E(X^3)~~

~~= 0,~~

~~</pre>~~

~~because the distribution of X is symmetric around zero. Thus the correlation r(X,Y) = Cov(X,Y)/Sqrt[Var(X)Var(Y)] = 0, and we have a situation where the variables are not independent, yet~~

~~have (linear) correlation r(X,Y) = 0.~~

~~This example shows how a linear correlation coefficient does not encapsulate anything about the quadratic dependence of Y upon X.~~

~~== Spearman vs Pearson correlation ==~~

~~Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship. https://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation~~

~~<pre>~~

~~x=(1:100);~~

~~y=exp(x);~~

~~cor(x,y, method='spearman') # 1~~

~~cor(x,y, method='pearson') # .25~~

~~</pre>~~

~~== Spearman vs Wilcoxon ==~~

~~By [http://www.talkstats.com/threads/wilcoxon-signed-rank-test-or-spearmans-rho.42395/ this post]~~

* Wilcoxon used to compare categorical versus non-normal continuous variable

* Spearman's rho used to compare two continuous (including '''ordinal''') variables that one or both aren't normally distributed

~~== Spearman vs [https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient Kendall correlation] ==~~

* Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the '''ordinal''' association between two measured quantities.

* [https://stats.stackexchange.com/questions/3943/kendall-tau-or-spearmans-rho Kendall Tau or Spearman's rho?]

~~== [http://en.wikipedia.org/wiki/Anscombe%27s_quartet Anscombe quartet] ==~~

~~Four datasets have almost same properties: same mean in X, same mean in Y, same variance in X, (almost) same variance in Y, same correlation in X and Y, same linear regression.~~

~~[[:File:Anscombe quartet 3.svg]]~~

~~== The real meaning of spurious correlations ==~~

~~https://nsaunders.wordpress.com/2017/02/03/the-real-meaning-of-spurious-correlations/~~

~~<syntaxhighlight lang='rsplus'>~~

~~library(ggplot2)~~

~~set.seed(123)~~

~~spurious_data <- data.frame(x = rnorm(500, 10, 1),~~

~~y = rnorm(500, 10, 1),~~

~~z = rnorm(500, 30, 3))~~

~~cor(spurious_data$x, spurious_data$y)~~

~~# [1] -0.05943856~~

~~spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) +~~

~~theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")~~

~~cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)~~

~~# [1] 0.4517972~~

~~spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +~~

~~theme_bw() + geom_smooth(method = "lm") +~~

~~scale_color_gradientn(colours = c("red", "white", "blue")) +~~

~~labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")~~

~~spurious_data$z <- rnorm(500, 30, 6)~~

~~cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)~~

~~# [1] 0.8424597~~

~~spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +~~

~~theme_bw() + geom_smooth(method = "lm") +~~

~~scale_color_gradientn(colours = c("red", "white", "blue")) +~~

~~labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")~~

~~</syntaxhighlight>~~

~~= Time series =~~

* [http://ellisp.github.io/blog/2016/12/07/arima-prediction-intervals Why time series forecasts prediction intervals aren't as good as we'd hope]

~~== Structural change ==~~

~~[https://datascienceplus.com/structural-changes-in-global-warming/ Structural Changes in Global Warming]~~

~~== AR(1) processes and random walks ==~~

~~[https://fdabl.github.io/r/Spurious-Correlation.html Spurious correlations and random walks]~~

~~= Measurement Error model =~~

* [https://en.wikipedia.org/wiki/Errors-in-variables_models Errors-in-variables models/errors-in-variables models or measurement error models]

* [https://onlinelibrary.wiley.com/doi/10.1111/biom.13112 Simulation‐‐Selection‐‐Extrapolation: Estimation in High‐‐Dimensional Errors‐‐in‐‐Variables Models] Nghiem 2019

~~= Dictionary =~~

* '''Prognosis''' is the probability that an event or diagnosis will result in a particular outcome.

** For example, on the paper [http://clincancerres.aacrjournals.org/content/18/21/6065.figures-only Developing and Validating Continuous Genomic Signatures in Randomized Clinical Trials for Predictive Medicine] by Matsui 2012, the prognostic score .1 (0.9) represents a '''good (poor)''' prognosis.

** Prostate cancer has a much higher one-year overall survival rate than pancreatic cancer, and thus has a better prognosis. See [https://en.wikipedia.org/wiki/Survival_rate Survival rate] in wikipedia.

~~= Data =~~

~~== Eleven quick tips for finding research data ==~~

~~http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006038~~

~~= Books =~~

* [https://leanpub.com/biostatmethods Methods in Biostatistics with R] ($)

* [http://web.stanford.edu/class/bios221/book/ Modern Statistics for Modern Biology] (free)

* Principles of Applied Statistics, by David Cox & Christl Donnelly

* [https://www.amazon.com/Freedman-Robert-Pisani-Statistics-Hardcover/dp/B004QNRMDK/ Statistics] by David Freedman,Robert Pisani, Roger Purves

* [https://onlinelibrary.wiley.com/topic/browse/000113 Wiley Online Library -> Statistics] (Access by NIH Library)

* [https://web.stanford.edu/~hastie/CASI/ Computer Age Statistical Inference: Algorithms, Evidence and Data Science] by Efron and Hastie 2016

~~= Social =~~

~~== JSM ==~~

* 2019

** [https://minecr.shinyapps.io/jsm2019-schedule/ JSM 2019] and the [http://www.citizen-statistician.org/2019/07/shiny-for-jsm-2019/ post].

** [https://rviews.rstudio.com/2019/07/19/an-r-users-guide-to-jsm-2019/ An R Users Guide to JSM 2019]

~~== Following ==~~

* [http://jtleek.com/ Jeff Leek], https://twitter.com/jtleek

* Revolutions, http://blog.revolutionanalytics.com/

* RStudio Blog, https://blog.rstudio.com/

* Sean Davis, https://twitter.com/seandavis12, https://github.com/seandavi

* [http://stephenturner.us/post/ Stephen Turner], https://twitter.com/genetics_blog

@@ Line 1: / Line 1: @@
-== Use ==
+<math>\| \frac{c}{d} \|</math>
-When entropy was applied to the variable selection, we want to select a class variable which gives a largest entropy difference between without any class variable (compute entropy using response only) and with that class variable (entropy is computed by adding entropy in each class level) because this variable is most discriminative and it gives most '''information gain'''. For example,
-* entropy (without any class)=.94,
-* entropy(var 1) = .69,
-* entropy(var 2)=.91,
-* entropy(var 3)=.725.
-We will choose variable 1 since it gives the largest gain (.94 - .69) compared to the other variables (.94 -.91, .94 -.725).
-Why is picking the attribute with the most information gain beneficial? It ''reduces'' entropy, which increases predictability. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.
-Consider a split of a continuous variable. Where should we cut the continuous variable to create a binary partition with the highest gain? Suppose cut point c1 creates an entropy .9 and another cut point c2 creates an entropy .1. We should choose c2.
-== Related ==
-In addition to information gain, gini (dʒiːni) index is another metric used in decision tree. See [http://en.wikipedia.org/wiki/Decision_tree_learning wikipedia page] about decision tree learning.
-= Ensembles =
-* Combining classifiers. Pro: better classification performance. Con: time consuming.
-* Comic http://flowingdata.com/2017/09/05/xkcd-ensemble-model/
-* [http://www.win-vector.com/blog/2019/07/common-ensemble-models-can-be-biased/ Common Ensemble Models can be Biased]
-== Bagging ==
-Draw N bootstrap samples and summary the results (averaging for regression problem, majority vote for classification problem). Decrease variance without changing bias. Not help much with underfit or high bias models.
-=== Random forest ===
-'''Variance importance''': if you scramble the values of a variable, and the accuracy of your tree does not change much, then the variable is not very important.
-Why is it useful to compute variance importance? So the model's predictions are easier to interpret (not improve the prediction performance).
-Random forest has advantages of easier to run in parallel and suitable for small n large p problems.
-[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2264-5 Random forest versus logistic regression: a large-scale benchmark experiment] by Raphael Couronné, BMC Bioinformatics 2018
-[https://github.com/suiji/arborist Arborist]: Parallelized, Extensible Random Forests
-[https://academic.oup.com/bioinformatics/article-abstract/35/15/2701/5250706?redirectedFrom=fulltext On what to permute in test-based approaches for variable importance measures in Random Forests]
-== Boosting ==
-Instead of selecting data points randomly with the boostrap, it favors the misclassified points.
-Algorithm:
-* Initialize the weights
-* Repeat
-** resample with respect to weights
-** retrain the model
-** recompute weights
-Since boosting requires computation in iterative and bagging can be run in parallel, bagging has an advantage over boosting when the data is very large.
-== Time series ==
-[https://petolau.github.io/Ensemble-of-trees-for-forecasting-time-series/ Ensemble learning for time series forecasting in R]
-= p-values =
-==  p-values ==
-* Prob(Data | H0)
-* https://en.wikipedia.org/wiki/P-value
-* [https://amstat.tandfonline.com/toc/utas20/73/sup1 Statistical Inference in the 21st Century: A World Beyond p < 0.05] The American Statistician, 2019
-* [https://matloff.wordpress.com/2016/03/07/after-150-years-the-asa-says-no-to-p-values/ THE ASA SAYS NO TO P-VALUES] The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis. We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important.
-* [http://www.r-statistics.com/2016/03/its-not-the-p-values-fault-reflections-on-the-recent-asa-statement/ It’s not the p-values’ fault]
-* [https://stablemarkets.wordpress.com/2016/05/21/exploring-p-values-with-simulations-in-r/ Exploring P-values with Simulations in R] from Stable Markets.
-* p-value and [https://en.wikipedia.org/wiki/Effect_size effect size]. http://journals.sagepub.com/doi/full/10.1177/1745691614553988
-== Distribution of p values in medical abstracts ==
-* http://www.ncbi.nlm.nih.gov/pubmed/26608725
-* [https://github.com/jtleek/tidypvals An R package with several million published p-values in tidy data sets] by Jeff Leek.
-== nominal p-value and Empirical p-values ==
-* Nominal p-values are based on asymptotic null distributions
-* Empirical p-values are computed from simulations/permutations
-== (nominal) alpha level ==
-Conventional methodology for statistical testing is, in advance of undertaking the test, to set a NOMINAL ALPHA CRITERION LEVEL (often 0.05). The outcome is classified as showing STATISTICAL SIGNIFICANCE if the actual ALPHA (probability of the outcome under the null hypothesis) is no greater than this NOMINAL ALPHA CRITERION LEVEL.
-* http://www.translationdirectory.com/glossaries/glossary033.htm
-* http://courses.washington.edu/p209s07/lecturenotes/Week%205_Monday%20overheads.pdf
-== Normality assumption ==
-[https://www.biorxiv.org/content/early/2018/12/20/498931 Violating the normality assumption may be the lesser of two evils]
-= T-statistic =
-See [[T-test#T-statistic|T-statistic]].
-= ANOVA =
-See [[T-test#ANOVA|ANOVA]].
-= [https://en.wikipedia.org/wiki/Goodness_of_fit Goodness of fit] =
-== [https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test Chi-square tests] ==
-* [http://freakonometrics.hypotheses.org/20531 An application of chi-square tests]
-== Fitting distribution ==
-[https://magesblog.com/post/2011-12-01-fitting-distributions-with-r/ Fitting distributions with R]
-= Contingency Tables =
-== [https://en.wikipedia.org/wiki/Odds_ratio Odds ratio and Risk ratio] ==
-The ratio of the odds of an event occurring in one group to the odds of it occurring in another group
-<pre>
-         drawn   | not drawn |
--------------------------------------
-white |   A      |   B       | Wh
--------------------------------------
-black |   C      |   D       | Bk
-</pre>
-* Odds Ratio = (A / C) / (B / D) = (AD) / (BC)
-* Risk Ratio = (A / Wh) / (C / Bk)
-== Hypergeometric, One-tailed Fisher exact test ==
-* https://www.bioconductor.org/help/course-materials/2009/SeattleApr09/gsea/ (Are interesting features over-represented? or are selected genes more often in the ''GO category'' than expected by chance?)
-* https://en.wikipedia.org/wiki/Hypergeometric_distribution. '' In a test for over-representation of successes in the sample, the hypergeometric p-value is calculated as the probability of randomly drawing '''k''' or more successes from the population in '''n''' total draws. In a test for under-representation, the p-value is the probability of randomly drawing '''k''' or fewer successes.''
-* http://stats.stackexchange.com/questions/62235/one-tailed-fishers-exact-test-and-the-hypergeometric-distribution
-* Two sided hypergeometric test
-** http://stats.stackexchange.com/questions/155189/how-to-make-a-two-tailed-hypergeometric-test
-** http://stats.stackexchange.com/questions/140107/p-value-in-a-two-tail-test-with-asymmetric-null-distribution
-** http://stats.stackexchange.com/questions/19195/explaining-two-tailed-tests
-* https://www.biostars.org/p/90662/ When computing the p-value (tail probability), consider to use 1 - Prob(observed -1) instead of 1 - Prob(observed) for discrete distribution.
-* https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k).
-<pre>
-         drawn   | not drawn |
--------------------------------------
-white |   x      |           | m
--------------------------------------
-black |  k-x     |           | n
--------------------------------------
-      |   k      |           | m+n
-</pre>
-For example, k=100, m=100, m+n=1000,
-<syntaxhighlight lang='rsplus'>
-> 1 - phyper(10, 100, 10^3-100, 100, log.p=F)
-[1] 0.4160339
-> a <- dhyper(0:10, 100, 10^3-100, 100)
-> cumsum(rev(a))
-  [1] 1.566158e-140 1.409558e-135 3.136408e-131 3.067025e-127 1.668004e-123 5.739613e-120 1.355765e-116
-  [8] 2.325536e-113 3.018276e-110 3.058586e-107 2.480543e-104 1.642534e-101  9.027724e-99  4.175767e-96
- [15]  1.644702e-93  5.572070e-91  1.638079e-88  4.210963e-86  9.530281e-84  1.910424e-81  3.410345e-79
- [22]  5.447786e-77  7.821658e-75  1.013356e-72  1.189000e-70  1.267638e-68  1.231736e-66  1.093852e-64
- [29]  8.900857e-63  6.652193e-61  4.576232e-59  2.903632e-57  1.702481e-55  9.240350e-54  4.650130e-52
- [36]  2.173043e-50  9.442985e-49  3.820823e-47  1.441257e-45  5.074077e-44  1.669028e-42  5.134399e-41
- [43]  1.478542e-39  3.989016e-38  1.009089e-36  2.395206e-35  5.338260e-34  1.117816e-32  2.200410e-31
- [50]  4.074043e-30  7.098105e-29  1.164233e-27  1.798390e-26  2.617103e-25  3.589044e-24  4.639451e-23
- [57]  5.654244e-22  6.497925e-21  7.042397e-20  7.198582e-19  6.940175e-18  6.310859e-17  5.412268e-16
- [64]  4.377256e-15  3.338067e-14  2.399811e-13  1.626091e-12  1.038184e-11  6.243346e-11  3.535115e-10
- [71]  1.883810e-09  9.442711e-09  4.449741e-08  1.970041e-07  8.188671e-07  3.193112e-06  1.167109e-05
- [78]  3.994913e-05  1.279299e-04  3.828641e-04  1.069633e-03  2.786293e-03  6.759071e-03  1.525017e-02
- [85]  3.196401e-02  6.216690e-02  1.120899e-01  1.872547e-01  2.898395e-01  4.160339e-01  5.550192e-01
- [92]  6.909666e-01  8.079129e-01  8.953150e-01  9.511926e-01  9.811343e-01  9.942110e-01  9.986807e-01
- [99]  9.998018e-01  9.999853e-01  1.000000e+00
-# Density plot
-plot(0:100, dhyper(0:100, 100, 10^3-100, 100), type='h')
-</syntaxhighlight>
-[[:File:Dhyper.svg]]
-Moreover,
-<pre>
-- phyper(q=10, m, n, k)
-= 1 - sum_{x=0}^{x=10} phyper(x, m, n, k)
-= 1 - sum(a[1:11]) # R's index starts from 1.
-</pre>
-Another example is the data from [https://david.ncifcrf.gov/helps/functional_annotation.html#fisher the functional annotation tool] in DAVID.
-<pre>
-               | gene list | not gene list |
--------------------------------------------------------
-pathway        |   3  (q)  |               | 40 (m)
--------------------------------------------------------
-not in pathway |  297      |               | 29960 (n)
--------------------------------------------------------
-               |  300 (k)  |               | 30000
-</pre>
-The one-tailed p-value from the hypergeometric test is calculated as 1 - phyper(3-1, 40, 29960, 300) = 0.0074.
-== [https://en.wikipedia.org/wiki/Fisher%27s_exact_test Fisher's exact test] ==
-Following the above example from the DAVID website, the following R command calculates the Fisher exact test for independence in 2x2 contingency tables.
-<syntaxhighlight lang='rsplus'>
-> fisher.test(matrix(c(3, 40, 297, 29960), nr=2)) #  alternative = "two.sided" by default
-        Fisher's Exact Test for Count Data
-data:  matrix(c(3, 40, 297, 29960), nr = 2)
-p-value = 0.008853
-alternative hypothesis: true odds ratio is not equal to 1
-percent confidence interval:
-.488738 23.966741
-sample estimates:
-odds ratio
-.564602
-> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="greater")
-        Fisher's Exact Test for Count Data
-data:  matrix(c(3, 40, 297, 29960), nr = 2)
-p-value = 0.008853
-alternative hypothesis: true odds ratio is greater than 1
-percent confidence interval:
-.973   Inf
-sample estimates:
-odds ratio
-.564602
-> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="less")
-        Fisher's Exact Test for Count Data
-data:  matrix(c(3, 40, 297, 29960), nr = 2)
-p-value = 0.9991
-alternative hypothesis: true odds ratio is less than 1
-percent confidence interval:
-.00000 20.90259
-sample estimates:
-odds ratio
-.564602
-</syntaxhighlight>
-From the documentation of [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html fisher.test]
-<pre>
-Usage:
-     fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
-                 control = list(), or = 1, alternative = "two.sided",
-                 conf.int = TRUE, conf.level = 0.95,
-                 simulate.p.value = FALSE, B = 2000)
-</pre>
-* For 2 by 2 cases, p-values are obtained directly using the (central or non-central) hypergeometric distribution.
-* For 2 by 2 tables, the null of conditional independence is equivalent to the hypothesis that the odds ratio equals one.
-* The alternative for a one-sided test is based on the odds ratio, so ‘alternative = "greater"’ is a test of the odds ratio being bigger than ‘or’.
-* Two-sided tests are based on the probabilities of the tables, and take as ‘more extreme’ all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.
-== Chi-square independence test ==
-[https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/ Exploring the underlying theory of the chi-square test through simulation - part 2]
-== [https://en.wikipedia.org/wiki/Gene_set_enrichment_analysis GSEA] ==
-Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states
-* https://www.bioconductor.org/help/course-materials/2015/SeattleApr2015/E_GeneSetEnrichment.html
-* http://software.broadinstitute.org/gsea/index.jsp
-* [http://www.biorxiv.org/content/biorxiv/early/2017/09/08/186288.full.pdf Statistical power of gene-set enrichment analysis is a function of gene set correlation structure] by SWANSON 2017
-* [https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis], [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR] package
-Two categories of GSEA procedures:
-* Competitive:  compare genes in the test set relative to all other genes.
-* Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by [http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/MeV/manual/gsea.html Jiang & Gentleman Bioinformatics 2007]
-== McNemar’s test on paired nominal data ==
-https://en.wikipedia.org/wiki/McNemar%27s_test
-= Confidence vs Credibility Intervals =
-http://freakonometrics.hypotheses.org/18117
-= Power analysis/Sample Size determination =
-See [[Power|Power]].
-= Common covariance/correlation structures =
-See [https://onlinecourses.science.psu.edu/stat502/node/228 psu.edu]. Assume covariance <math>\Sigma = (\sigma_{ij})_{p\times p} </math>
-* Diagonal structure: <math>\sigma_{ij} = 0</math> if <math>i \neq j</math>.
-* Compound symmetry: <math>\sigma_{ij} = \rho</math> if <math>i \neq j</math>.
-* First-order autoregressive AR(1) structure: <math>\sigma_{ij} = \rho^{|i - j|}</math>. <syntaxhighlight lang='rsplus'>
-rho <- .8
-p <- 5
-blockMat <- rho ^ abs(matrix(1:p, p, p, byrow=T) - matrix(1:p, p, p))
-</syntaxhighlight>
-* Banded matrix: <math>\sigma_{ii}=1, \sigma_{i,i+1}=\sigma_{i+1,i} \neq 0, \sigma_{i,i+2}=\sigma_{i+2,i} \neq 0</math> and <math>\sigma_{ij}=0</math> for <math>|i-j| \ge 3</math>.
-* Spatial Power
-* Unstructured Covariance
-* [https://en.wikipedia.org/wiki/Toeplitz_matrix Toeplitz structure]
-To create blocks of correlation matrix, use the "%x%" operator. See [https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/kronecker kronecker()].
-<syntaxhighlight lang='rsplus'>
-covMat <- diag(n.blocks) %x% blockMat
-</syntaxhighlight>
-= Counter/Special Examples =
-== Correlated does not imply independence ==
-Suppose X is a normally-distributed random variable with zero mean.  Let Y = X^2.  Clearly X and Y are not independent: if you know X, you also know Y.  And if you know Y, you know the absolute value of X.
-The covariance of X and Y is
-<pre>
-  Cov(X,Y) = E(XY) - E(X)E(Y) = E(X^3) - 0*E(Y) = E(X^3)
-           = 0,
-</pre>
-because the distribution of X is symmetric around zero.  Thus the correlation r(X,Y) = Cov(X,Y)/Sqrt[Var(X)Var(Y)] = 0, and we have a situation where the variables are not independent, yet
-have (linear) correlation r(X,Y) = 0.
-This example shows how a linear correlation coefficient does not encapsulate anything about the quadratic dependence of Y upon X.
-== Spearman vs Pearson correlation ==
-Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship. https://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation
-<pre>
-x=(1:100);
-y=exp(x);
-cor(x,y, method='spearman') # 1
-cor(x,y, method='pearson')  # .25
-</pre>
-== Spearman vs Wilcoxon ==
-By [http://www.talkstats.com/threads/wilcoxon-signed-rank-test-or-spearmans-rho.42395/ this post]
-* Wilcoxon used to compare categorical versus non-normal continuous variable
-* Spearman's rho used to compare two continuous (including '''ordinal''') variables that one or both aren't normally distributed
-== Spearman vs [https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient Kendall correlation] ==
-* Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the '''ordinal''' association between two measured quantities.
-* [https://stats.stackexchange.com/questions/3943/kendall-tau-or-spearmans-rho Kendall Tau or Spearman's rho?]
-== [http://en.wikipedia.org/wiki/Anscombe%27s_quartet Anscombe quartet] ==
-Four datasets have almost same properties: same mean in X, same mean in Y, same variance in X, (almost) same variance in Y, same correlation in X and Y, same linear regression.
-[[:File:Anscombe quartet 3.svg]]
-== The real meaning of spurious correlations ==
-https://nsaunders.wordpress.com/2017/02/03/the-real-meaning-of-spurious-correlations/
-<syntaxhighlight lang='rsplus'>
-library(ggplot2)
-set.seed(123)
-spurious_data <- data.frame(x = rnorm(500, 10, 1),
-                            y = rnorm(500, 10, 1),
-                            z = rnorm(500, 30, 3))
-cor(spurious_data$x, spurious_data$y)
-# [1] -0.05943856
-spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) +
-theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")
-cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
-# [1] 0.4517972
-spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
- theme_bw() + geom_smooth(method = "lm") +
-scale_color_gradientn(colours = c("red", "white", "blue")) +
-labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")
-spurious_data$z <- rnorm(500, 30, 6)
-cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
-# [1] 0.8424597
-spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
-theme_bw() + geom_smooth(method = "lm") +
-scale_color_gradientn(colours = c("red", "white", "blue")) +
-labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")
-</syntaxhighlight>
-= Time series =
-* [http://ellisp.github.io/blog/2016/12/07/arima-prediction-intervals Why time series forecasts prediction intervals aren't as good as we'd hope]
-== Structural change ==
-[https://datascienceplus.com/structural-changes-in-global-warming/ Structural Changes in Global Warming]
-== AR(1) processes and random walks ==
-[https://fdabl.github.io/r/Spurious-Correlation.html Spurious correlations and random walks]
-= Measurement Error model =
-* [https://en.wikipedia.org/wiki/Errors-in-variables_models Errors-in-variables models/errors-in-variables models or measurement error models]
-* [https://onlinelibrary.wiley.com/doi/10.1111/biom.13112 Simulation‐‐Selection‐‐Extrapolation: Estimation in High‐‐Dimensional Errors‐‐in‐‐Variables Models] Nghiem 2019
-= Dictionary =
-* '''Prognosis''' is the probability that an event or diagnosis will result in a particular outcome.
-** For example, on the paper [http://clincancerres.aacrjournals.org/content/18/21/6065.figures-only Developing and Validating Continuous Genomic Signatures in Randomized Clinical Trials for Predictive Medicine] by Matsui 2012, the prognostic score .1 (0.9) represents a '''good (poor)''' prognosis.
-** Prostate cancer has a much higher one-year overall survival rate than pancreatic cancer, and thus has a better prognosis. See [https://en.wikipedia.org/wiki/Survival_rate Survival rate] in wikipedia.
-= Data =
-== Eleven quick tips for finding research data ==
-http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006038
-= Books =
-* [https://leanpub.com/biostatmethods Methods in Biostatistics with R] ($)
-* [http://web.stanford.edu/class/bios221/book/ Modern Statistics for Modern Biology] (free)
-* Principles of Applied Statistics, by David Cox & Christl Donnelly
-* [https://www.amazon.com/Freedman-Robert-Pisani-Statistics-Hardcover/dp/B004QNRMDK/ Statistics] by David Freedman,Robert Pisani, Roger Purves
-* [https://onlinelibrary.wiley.com/topic/browse/000113 Wiley Online Library -> Statistics] (Access by NIH Library)
-* [https://web.stanford.edu/~hastie/CASI/ Computer Age Statistical Inference: Algorithms, Evidence and Data Science] by Efron and Hastie 2016
-= Social =
-== JSM ==
-* 2019
-** [https://minecr.shinyapps.io/jsm2019-schedule/ JSM 2019] and the [http://www.citizen-statistician.org/2019/07/shiny-for-jsm-2019/ post].
-** [https://rviews.rstudio.com/2019/07/19/an-r-users-guide-to-jsm-2019/ An R Users Guide to JSM 2019]
-== Following ==
-* [http://jtleek.com/ Jeff Leek], https://twitter.com/jtleek
-* Revolutions, http://blog.revolutionanalytics.com/
-* RStudio Blog, https://blog.rstudio.com/
-* Sean Davis, https://twitter.com/seandavis12, https://github.com/seandavi
-* [http://stephenturner.us/post/ Stephen Turner], https://twitter.com/genetics_blog

Testmathj: Difference between revisions

Latest revision as of 23:07, 7 September 2019