Testmathj: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
= | == Use == | ||
When entropy was applied to the variable selection, we want to select a class variable which gives a largest entropy difference between without any class variable (compute entropy using response only) and with that class variable (entropy is computed by adding entropy in each class level) because this variable is most discriminative and it gives most '''information gain'''. For example, | |||
* entropy (without any class)=.94, | |||
* entropy(var 1) = .69, | |||
* entropy(var 2)=.91, | |||
* entropy(var 3)=.725. | |||
We will choose variable 1 since it gives the largest gain (.94 - .69) compared to the other variables (.94 -.91, .94 -.725). | |||
Why is picking the attribute with the most information gain beneficial? It ''reduces'' entropy, which increases predictability. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability. | |||
Consider a split of a continuous variable. Where should we cut the continuous variable to create a binary partition with the highest gain? Suppose cut point c1 creates an entropy .9 and another cut point c2 creates an entropy .1. We should choose c2. | |||
= | == Related == | ||
In addition to information gain, gini (dʒiːni) index is another metric used in decision tree. See [http://en.wikipedia.org/wiki/Decision_tree_learning wikipedia page] about decision tree learning. | |||
= Ensembles = | |||
* Combining classifiers. Pro: better classification performance. Con: time consuming. | |||
* Comic http://flowingdata.com/2017/09/05/xkcd-ensemble-model/ | |||
* [http://www.win-vector.com/blog/2019/07/common-ensemble-models-can-be-biased/ Common Ensemble Models can be Biased] | |||
== Bagging == | |||
Draw N bootstrap samples and summary the results (averaging for regression problem, majority vote for classification problem). Decrease variance without changing bias. Not help much with underfit or high bias models. | |||
=== Random forest === | |||
'''Variance importance''': if you scramble the values of a variable, and the accuracy of your tree does not change much, then the variable is not very important. | |||
Why is it useful to compute variance importance? So the model's predictions are easier to interpret (not improve the prediction performance). | |||
Random forest has advantages of easier to run in parallel and suitable for small n large p problems. | |||
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2264-5 Random forest versus logistic regression: a large-scale benchmark experiment] by Raphael Couronné, BMC Bioinformatics 2018 | |||
[https://github.com/suiji/arborist Arborist]: Parallelized, Extensible Random Forests | |||
[ | [https://academic.oup.com/bioinformatics/article-abstract/35/15/2701/5250706?redirectedFrom=fulltext On what to permute in test-based approaches for variable importance measures in Random Forests] | ||
== | == Boosting == | ||
Instead of selecting data points randomly with the boostrap, it favors the misclassified points. | |||
Algorithm: | |||
* Initialize the weights | |||
* Repeat | |||
** resample with respect to weights | |||
** retrain the model | |||
** recompute weights | |||
Since boosting requires computation in iterative and bagging can be run in parallel, bagging has an advantage over boosting when the data is very large. | |||
== Time series == | |||
[https://petolau.github.io/Ensemble-of-trees-for-forecasting-time-series/ Ensemble learning for time series forecasting in R] | |||
= p-values = | |||
== p-values == | |||
* Prob(Data | H0) | |||
* https://en.wikipedia.org/wiki/P-value | |||
* [https://amstat.tandfonline.com/toc/utas20/73/sup1 Statistical Inference in the 21st Century: A World Beyond p < 0.05] The American Statistician, 2019 | |||
* [https://matloff.wordpress.com/2016/03/07/after-150-years-the-asa-says-no-to-p-values/ THE ASA SAYS NO TO P-VALUES] The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis. We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important. | |||
* [http://www.r-statistics.com/2016/03/its-not-the-p-values-fault-reflections-on-the-recent-asa-statement/ It’s not the p-values’ fault] | |||
* [https://stablemarkets.wordpress.com/2016/05/21/exploring-p-values-with-simulations-in-r/ Exploring P-values with Simulations in R] from Stable Markets. | |||
* p-value and [https://en.wikipedia.org/wiki/Effect_size effect size]. http://journals.sagepub.com/doi/full/10.1177/1745691614553988 | |||
== Distribution of p values in medical abstracts == | |||
* http://www.ncbi.nlm.nih.gov/pubmed/26608725 | |||
* [https://github.com/jtleek/tidypvals An R package with several million published p-values in tidy data sets] by Jeff Leek. | |||
== nominal p-value and Empirical p-values == | |||
* Nominal p-values are based on asymptotic null distributions | |||
* Empirical p-values are computed from simulations/permutations | |||
== (nominal) alpha level == | |||
Conventional methodology for statistical testing is, in advance of undertaking the test, to set a NOMINAL ALPHA CRITERION LEVEL (often 0.05). The outcome is classified as showing STATISTICAL SIGNIFICANCE if the actual ALPHA (probability of the outcome under the null hypothesis) is no greater than this NOMINAL ALPHA CRITERION LEVEL. | |||
* http://www.translationdirectory.com/glossaries/glossary033.htm | |||
* http://courses.washington.edu/p209s07/lecturenotes/Week%205_Monday%20overheads.pdf | |||
== Normality assumption == | |||
[https://www.biorxiv.org/content/early/2018/12/20/498931 Violating the normality assumption may be the lesser of two evils] | |||
= T-statistic = | |||
See [[T-test#T-statistic|T-statistic]]. | |||
= ANOVA = | |||
See [[T-test#ANOVA|ANOVA]]. | |||
= [https://en.wikipedia.org/wiki/Goodness_of_fit Goodness of fit] = | |||
== [https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test Chi-square tests] == | |||
* [http://freakonometrics.hypotheses.org/20531 An application of chi-square tests] | |||
== Fitting distribution == | |||
[https://magesblog.com/post/2011-12-01-fitting-distributions-with-r/ Fitting distributions with R] | |||
- | = Contingency Tables = | ||
== [https://en.wikipedia.org/wiki/Odds_ratio Odds ratio and Risk ratio] == | |||
The ratio of the odds of an event occurring in one group to the odds of it occurring in another group | |||
<pre> | |||
drawn | not drawn | | |||
</ | ------------------------------------- | ||
white | A | B | Wh | |||
------------------------------------- | |||
black | C | D | Bk | |||
</pre> | |||
* Odds Ratio = (A / C) / (B / D) = (AD) / (BC) | |||
* Risk Ratio = (A / Wh) / (C / Bk) | |||
= | == Hypergeometric, One-tailed Fisher exact test == | ||
* | * https://www.bioconductor.org/help/course-materials/2009/SeattleApr09/gsea/ (Are interesting features over-represented? or are selected genes more often in the ''GO category'' than expected by chance?) | ||
* https://en.wikipedia.org/wiki/Hypergeometric_distribution. '' In a test for over-representation of successes in the sample, the hypergeometric p-value is calculated as the probability of randomly drawing '''k''' or more successes from the population in '''n''' total draws. In a test for under-representation, the p-value is the probability of randomly drawing '''k''' or fewer successes.'' | |||
* http://stats.stackexchange.com/questions/62235/one-tailed-fishers-exact-test-and-the-hypergeometric-distribution | |||
* Two sided hypergeometric test | |||
* https://en.wikipedia.org/wiki/ | ** http://stats.stackexchange.com/questions/155189/how-to-make-a-two-tailed-hypergeometric-test | ||
** http://stats.stackexchange.com/questions/140107/p-value-in-a-two-tail-test-with-asymmetric-null-distribution | |||
* | ** http://stats.stackexchange.com/questions/19195/explaining-two-tailed-tests | ||
* https://www.biostars.org/p/90662/ When computing the p-value (tail probability), consider to use 1 - Prob(observed -1) instead of 1 - Prob(observed) for discrete distribution. | |||
* https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k). | |||
<pre> | |||
* | drawn | not drawn | | ||
* | ------------------------------------- | ||
* | white | x | | m | ||
------------------------------------- | |||
black | k-x | | n | |||
* http:// | ------------------------------------- | ||
* http:// | | k | | m+n | ||
</pre> | |||
* | |||
* | |||
For example, k=100, m=100, m+n=1000, | |||
<syntaxhighlight lang='rsplus'> | <syntaxhighlight lang='rsplus'> | ||
> 1 - phyper(10, 100, 10^3-100, 100, log.p=F) | |||
[1] 0.4160339 | |||
> a <- dhyper(0:10, 100, 10^3-100, 100) | |||
> cumsum(rev(a)) | |||
[1] 1.566158e-140 1.409558e-135 3.136408e-131 3.067025e-127 1.668004e-123 5.739613e-120 1.355765e-116 | |||
[8] 2.325536e-113 3.018276e-110 3.058586e-107 2.480543e-104 1.642534e-101 9.027724e-99 4.175767e-96 | |||
[15] 1.644702e-93 5.572070e-91 1.638079e-88 4.210963e-86 9.530281e-84 1.910424e-81 3.410345e-79 | |||
[22] 5.447786e-77 7.821658e-75 1.013356e-72 1.189000e-70 1.267638e-68 1.231736e-66 1.093852e-64 | |||
[29] 8.900857e-63 6.652193e-61 4.576232e-59 2.903632e-57 1.702481e-55 9.240350e-54 4.650130e-52 | |||
[36] 2.173043e-50 9.442985e-49 3.820823e-47 1.441257e-45 5.074077e-44 1.669028e-42 5.134399e-41 | |||
[43] 1.478542e-39 3.989016e-38 1.009089e-36 2.395206e-35 5.338260e-34 1.117816e-32 2.200410e-31 | |||
[50] 4.074043e-30 7.098105e-29 1.164233e-27 1.798390e-26 2.617103e-25 3.589044e-24 4.639451e-23 | |||
[57] 5.654244e-22 6.497925e-21 7.042397e-20 7.198582e-19 6.940175e-18 6.310859e-17 5.412268e-16 | |||
[64] 4.377256e-15 3.338067e-14 2.399811e-13 1.626091e-12 1.038184e-11 6.243346e-11 3.535115e-10 | |||
[71] 1.883810e-09 9.442711e-09 4.449741e-08 1.970041e-07 8.188671e-07 3.193112e-06 1.167109e-05 | |||
[78] 3.994913e-05 1.279299e-04 3.828641e-04 1.069633e-03 2.786293e-03 6.759071e-03 1.525017e-02 | |||
[85] 3.196401e-02 6.216690e-02 1.120899e-01 1.872547e-01 2.898395e-01 4.160339e-01 5.550192e-01 | |||
[92] 6.909666e-01 8.079129e-01 8.953150e-01 9.511926e-01 9.811343e-01 9.942110e-01 9.986807e-01 | |||
[99] 9.998018e-01 9.999853e-01 1.000000e+00 | |||
# | # Density plot | ||
plot(0:100, dhyper(0:100, 100, 10^3-100, 100), type='h') | |||
</syntaxhighlight> | </syntaxhighlight> | ||
[[:File:Dhyper.svg]] | |||
= | Moreover, | ||
[ | <pre> | ||
1 - phyper(q=10, m, n, k) | |||
= 1 - sum_{x=0}^{x=10} phyper(x, m, n, k) | |||
= 1 - sum(a[1:11]) # R's index starts from 1. | |||
</pre> | |||
Another example is the data from [https://david.ncifcrf.gov/helps/functional_annotation.html#fisher the functional annotation tool] in DAVID. | |||
<pre> | |||
| gene list | not gene list | | |||
------------------------------------------------------- | |||
pathway | 3 (q) | | 40 (m) | |||
------------------------------------------------------- | |||
not in pathway | 297 | | 29960 (n) | |||
------------------------------------------------------- | |||
| 300 (k) | | 30000 | |||
</pre> | |||
The one-tailed p-value from the hypergeometric test is calculated as 1 - phyper(3-1, 40, 29960, 300) = 0.0074. | |||
== [https://en.wikipedia.org/wiki/Fisher%27s_exact_test Fisher's exact test] == | |||
Following the above example from the DAVID website, the following R command calculates the Fisher exact test for independence in 2x2 contingency tables. | |||
<syntaxhighlight lang='rsplus'> | <syntaxhighlight lang='rsplus'> | ||
> | > fisher.test(matrix(c(3, 40, 297, 29960), nr=2)) # alternative = "two.sided" by default | ||
Fisher's Exact Test for Count Data | |||
data: matrix(c(3, 40, 297, 29960), nr = 2) | |||
p-value = 0.008853 | |||
alternative hypothesis: true odds ratio is not equal to 1 | |||
95 percent confidence interval: | |||
1.488738 23.966741 | |||
sample estimates: | |||
odds ratio | |||
7.564602 | |||
== | > fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="greater") | ||
Fisher's Exact Test for Count Data | |||
data: matrix(c(3, 40, 297, 29960), nr = 2) | |||
p-value = 0.008853 | |||
alternative hypothesis: true odds ratio is greater than 1 | |||
95 percent confidence interval: | |||
1.973 Inf | |||
sample estimates: | |||
odds ratio | |||
7.564602 | |||
== | > fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="less") | ||
Fisher's Exact Test for Count Data | |||
data: matrix(c(3, 40, 297, 29960), nr = 2) | |||
p-value = 0.9991 | |||
alternative hypothesis: true odds ratio is less than 1 | |||
95 percent confidence interval: | |||
0.00000 20.90259 | |||
sample estimates: | |||
odds ratio | |||
7.564602 | |||
= | |||
</syntaxhighlight> | </syntaxhighlight> | ||
From the documentation of [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html fisher.test] | |||
<pre> | |||
Usage: | |||
fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE, | |||
control = list(), or = 1, alternative = "two.sided", | |||
conf.int = TRUE, conf.level = 0.95, | |||
< | simulate.p.value = FALSE, B = 2000) | ||
</pre> | |||
* For 2 by 2 cases, p-values are obtained directly using the (central or non-central) hypergeometric distribution. | |||
* For 2 by 2 tables, the null of conditional independence is equivalent to the hypothesis that the odds ratio equals one. | |||
* The alternative for a one-sided test is based on the odds ratio, so ‘alternative = "greater"’ is a test of the odds ratio being bigger than ‘or’. | |||
* Two-sided tests are based on the probabilities of the tables, and take as ‘more extreme’ all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities. | |||
= | |||
* | |||
* | |||
* | |||
* | |||
== | == Chi-square independence test == | ||
[https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/ Exploring the underlying theory of the chi-square test through simulation - part 2] | |||
== [https://en.wikipedia.org/wiki/Gene_set_enrichment_analysis GSEA] == | |||
Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states | |||
* https://www.bioconductor.org/help/course-materials/2015/SeattleApr2015/E_GeneSetEnrichment.html | |||
* http://software.broadinstitute.org/gsea/index.jsp | |||
* [http://www.biorxiv.org/content/biorxiv/early/2017/09/08/186288.full.pdf Statistical power of gene-set enrichment analysis is a function of gene set correlation structure] by SWANSON 2017 | |||
* [https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis], [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR] package | |||
Two categories of GSEA procedures: | |||
* Competitive: compare genes in the test set relative to all other genes. | |||
* Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by [http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/MeV/manual/gsea.html Jiang & Gentleman Bioinformatics 2007] | |||
== | == McNemar’s test on paired nominal data == | ||
https://en.wikipedia.org/wiki/McNemar%27s_test | |||
= Confidence vs Credibility Intervals = | |||
http://freakonometrics.hypotheses.org/18117 | |||
= | = Power analysis/Sample Size determination = | ||
See [[Power|Power]]. | |||
= | = Common covariance/correlation structures = | ||
See [https://onlinecourses.science.psu.edu/stat502/node/228 psu.edu]. Assume covariance <math>\Sigma = (\sigma_{ij})_{p\times p} </math> | |||
* Diagonal structure: <math>\sigma_{ij} = 0</math> if <math>i \neq j</math>. | |||
* Compound symmetry: <math>\sigma_{ij} = \rho</math> if <math>i \neq j</math>. | |||
* First-order autoregressive AR(1) structure: <math>\sigma_{ij} = \rho^{|i - j|}</math>. <syntaxhighlight lang='rsplus'> | |||
= | rho <- .8 | ||
p <- 5 | |||
blockMat <- rho ^ abs(matrix(1:p, p, p, byrow=T) - matrix(1:p, p, p)) | |||
* | |||
= | |||
* | |||
<syntaxhighlight lang='rsplus'> | |||
</syntaxhighlight> | </syntaxhighlight> | ||
* Banded matrix: <math>\sigma_{ii}=1, \sigma_{i,i+1}=\sigma_{i+1,i} \neq 0, \sigma_{i,i+2}=\sigma_{i+2,i} \neq 0</math> and <math>\sigma_{ij}=0</math> for <math>|i-j| \ge 3</math>. | |||
* Spatial Power | |||
* Unstructured Covariance | |||
* [https://en.wikipedia.org/wiki/Toeplitz_matrix Toeplitz structure] | |||
To create blocks of correlation matrix, use the "%x%" operator. See [https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/kronecker kronecker()]. | |||
<syntaxhighlight lang='rsplus'> | <syntaxhighlight lang='rsplus'> | ||
covMat <- diag(n.blocks) %x% blockMat | |||
</syntaxhighlight> | </syntaxhighlight> | ||
== | = Counter/Special Examples = | ||
== Correlated does not imply independence == | |||
Suppose X is a normally-distributed random variable with zero mean. Let Y = X^2. Clearly X and Y are not independent: if you know X, you also know Y. And if you know Y, you know the absolute value of X. | |||
= | The covariance of X and Y is | ||
<pre> | |||
Cov(X,Y) = E(XY) - E(X)E(Y) = E(X^3) - 0*E(Y) = E(X^3) | |||
= 0, | |||
</pre> | |||
because the distribution of X is symmetric around zero. Thus the correlation r(X,Y) = Cov(X,Y)/Sqrt[Var(X)Var(Y)] = 0, and we have a situation where the variables are not independent, yet | |||
have (linear) correlation r(X,Y) = 0. | |||
This example shows how a linear correlation coefficient does not encapsulate anything about the quadratic dependence of Y upon X. | |||
== | == Spearman vs Pearson correlation == | ||
Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship. https://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation | |||
<pre> | |||
x=(1:100); | |||
y=exp(x); | |||
cor(x,y, method='spearman') # 1 | |||
cor(x,y, method='pearson') # .25 | |||
</pre> | |||
== | == Spearman vs Wilcoxon == | ||
By [http://www.talkstats.com/threads/wilcoxon-signed-rank-test-or-spearmans-rho.42395/ this post] | |||
* Wilcoxon used to compare categorical versus non-normal continuous variable | |||
* Spearman's rho used to compare two continuous (including '''ordinal''') variables that one or both aren't normally distributed | |||
== | == Spearman vs [https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient Kendall correlation] == | ||
* Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the '''ordinal''' association between two measured quantities. | |||
* | * [https://stats.stackexchange.com/questions/3943/kendall-tau-or-spearmans-rho Kendall Tau or Spearman's rho?] | ||
= | == [http://en.wikipedia.org/wiki/Anscombe%27s_quartet Anscombe quartet] == | ||
Four datasets have almost same properties: same mean in X, same mean in Y, same variance in X, (almost) same variance in Y, same correlation in X and Y, same linear regression. | |||
[[:File:Anscombe quartet 3.svg]] | |||
= | == The real meaning of spurious correlations == | ||
https://nsaunders.wordpress.com/2017/02/03/the-real-meaning-of-spurious-correlations/ | |||
https:// | |||
<syntaxhighlight lang='rsplus'> | <syntaxhighlight lang='rsplus'> | ||
library(ggplot2) | |||
set.seed(123) | |||
spurious_data <- data.frame(x = rnorm(500, 10, 1), | |||
y = rnorm(500, 10, 1), | |||
z = rnorm(500, 30, 3)) | |||
cor(spurious_data$x, spurious_data$y) | |||
# [1] -0.05943856 | |||
spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) + | |||
theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)") | |||
== | cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z) | ||
# [1] 0.4517972 | |||
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) + | |||
theme_bw() + geom_smooth(method = "lm") + | |||
scale_color_gradientn(colours = c("red", "white", "blue")) + | |||
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)") | |||
spurious_data$z <- rnorm(500, 30, 6) | |||
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z) | |||
# [1] 0.8424597 | |||
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) + | |||
theme_bw() + geom_smooth(method = "lm") + | |||
scale_color_gradientn(colours = c("red", "white", "blue")) + | |||
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)") | |||
</syntaxhighlight> | |||
= Time series = | |||
* [http://ellisp.github.io/blog/2016/12/07/arima-prediction-intervals Why time series forecasts prediction intervals aren't as good as we'd hope] | |||
== Structural change == | |||
[https://datascienceplus.com/structural-changes-in-global-warming/ Structural Changes in Global Warming] | |||
== | |||
[https:// | |||
= | == AR(1) processes and random walks == | ||
[https://fdabl.github.io/r/Spurious-Correlation.html Spurious correlations and random walks] | |||
* [https:// | = Measurement Error model = | ||
* [ | * [https://en.wikipedia.org/wiki/Errors-in-variables_models Errors-in-variables models/errors-in-variables models or measurement error models] | ||
* [https://onlinelibrary.wiley.com/doi/10.1111/biom.13112 Simulation‐‐Selection‐‐Extrapolation: Estimation in High‐‐Dimensional Errors‐‐in‐‐Variables Models] Nghiem 2019 | |||
= | = Dictionary = | ||
* | * '''Prognosis''' is the probability that an event or diagnosis will result in a particular outcome. | ||
* | ** For example, on the paper [http://clincancerres.aacrjournals.org/content/18/21/6065.figures-only Developing and Validating Continuous Genomic Signatures in Randomized Clinical Trials for Predictive Medicine] by Matsui 2012, the prognostic score .1 (0.9) represents a '''good (poor)''' prognosis. | ||
* | ** Prostate cancer has a much higher one-year overall survival rate than pancreatic cancer, and thus has a better prognosis. See [https://en.wikipedia.org/wiki/Survival_rate Survival rate] in wikipedia. | ||
* | |||
* | |||
= Data = | |||
== Eleven quick tips for finding research data == | |||
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006038 | |||
= Books = | |||
* [https://leanpub.com/biostatmethods Methods in Biostatistics with R] ($) | |||
* [http://web.stanford.edu/class/bios221/book/ Modern Statistics for Modern Biology] (free) | |||
* Principles of Applied Statistics, by David Cox & Christl Donnelly | |||
* [https://www.amazon.com/Freedman-Robert-Pisani-Statistics-Hardcover/dp/B004QNRMDK/ Statistics] by David Freedman,Robert Pisani, Roger Purves | |||
* [https://onlinelibrary.wiley.com/topic/browse/000113 Wiley Online Library -> Statistics] (Access by NIH Library) | |||
* [https://web.stanford.edu/~hastie/CASI/ Computer Age Statistical Inference: Algorithms, Evidence and Data Science] by Efron and Hastie 2016 | |||
* [https:// | |||
* | |||
= | = Social = | ||
* | == JSM == | ||
* [ | * 2019 | ||
** [https://minecr.shinyapps.io/jsm2019-schedule/ JSM 2019] and the [http://www.citizen-statistician.org/2019/07/shiny-for-jsm-2019/ post]. | |||
* [https:// | ** [https://rviews.rstudio.com/2019/07/19/an-r-users-guide-to-jsm-2019/ An R Users Guide to JSM 2019] | ||
== | == Following == | ||
* https:// | * [http://jtleek.com/ Jeff Leek], https://twitter.com/jtleek | ||
* | * Revolutions, http://blog.revolutionanalytics.com/ | ||
* [http:// | * RStudio Blog, https://blog.rstudio.com/ | ||
* Sean Davis, https://twitter.com/seandavis12, https://github.com/seandavi | |||
* [http://stephenturner.us/post/ Stephen Turner], https://twitter.com/genetics_blog |
Revision as of 22:35, 7 September 2019
Use
When entropy was applied to the variable selection, we want to select a class variable which gives a largest entropy difference between without any class variable (compute entropy using response only) and with that class variable (entropy is computed by adding entropy in each class level) because this variable is most discriminative and it gives most information gain. For example,
- entropy (without any class)=.94,
- entropy(var 1) = .69,
- entropy(var 2)=.91,
- entropy(var 3)=.725.
We will choose variable 1 since it gives the largest gain (.94 - .69) compared to the other variables (.94 -.91, .94 -.725).
Why is picking the attribute with the most information gain beneficial? It reduces entropy, which increases predictability. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.
Consider a split of a continuous variable. Where should we cut the continuous variable to create a binary partition with the highest gain? Suppose cut point c1 creates an entropy .9 and another cut point c2 creates an entropy .1. We should choose c2.
Related
In addition to information gain, gini (dʒiːni) index is another metric used in decision tree. See wikipedia page about decision tree learning.
Ensembles
- Combining classifiers. Pro: better classification performance. Con: time consuming.
- Comic http://flowingdata.com/2017/09/05/xkcd-ensemble-model/
- Common Ensemble Models can be Biased
Bagging
Draw N bootstrap samples and summary the results (averaging for regression problem, majority vote for classification problem). Decrease variance without changing bias. Not help much with underfit or high bias models.
Random forest
Variance importance: if you scramble the values of a variable, and the accuracy of your tree does not change much, then the variable is not very important.
Why is it useful to compute variance importance? So the model's predictions are easier to interpret (not improve the prediction performance).
Random forest has advantages of easier to run in parallel and suitable for small n large p problems.
Random forest versus logistic regression: a large-scale benchmark experiment by Raphael Couronné, BMC Bioinformatics 2018
Arborist: Parallelized, Extensible Random Forests
On what to permute in test-based approaches for variable importance measures in Random Forests
Boosting
Instead of selecting data points randomly with the boostrap, it favors the misclassified points.
Algorithm:
- Initialize the weights
- Repeat
- resample with respect to weights
- retrain the model
- recompute weights
Since boosting requires computation in iterative and bagging can be run in parallel, bagging has an advantage over boosting when the data is very large.
Time series
Ensemble learning for time series forecasting in R
p-values
p-values
- Prob(Data | H0)
- https://en.wikipedia.org/wiki/P-value
- Statistical Inference in the 21st Century: A World Beyond p < 0.05 The American Statistician, 2019
- THE ASA SAYS NO TO P-VALUES The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis. We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important.
- It’s not the p-values’ fault
- Exploring P-values with Simulations in R from Stable Markets.
- p-value and effect size. http://journals.sagepub.com/doi/full/10.1177/1745691614553988
Distribution of p values in medical abstracts
- http://www.ncbi.nlm.nih.gov/pubmed/26608725
- An R package with several million published p-values in tidy data sets by Jeff Leek.
nominal p-value and Empirical p-values
- Nominal p-values are based on asymptotic null distributions
- Empirical p-values are computed from simulations/permutations
(nominal) alpha level
Conventional methodology for statistical testing is, in advance of undertaking the test, to set a NOMINAL ALPHA CRITERION LEVEL (often 0.05). The outcome is classified as showing STATISTICAL SIGNIFICANCE if the actual ALPHA (probability of the outcome under the null hypothesis) is no greater than this NOMINAL ALPHA CRITERION LEVEL.
- http://www.translationdirectory.com/glossaries/glossary033.htm
- http://courses.washington.edu/p209s07/lecturenotes/Week%205_Monday%20overheads.pdf
Normality assumption
Violating the normality assumption may be the lesser of two evils
T-statistic
See T-statistic.
ANOVA
See ANOVA.
Goodness of fit
Chi-square tests
Fitting distribution
Contingency Tables
Odds ratio and Risk ratio
The ratio of the odds of an event occurring in one group to the odds of it occurring in another group
drawn | not drawn | ------------------------------------- white | A | B | Wh ------------------------------------- black | C | D | Bk
- Odds Ratio = (A / C) / (B / D) = (AD) / (BC)
- Risk Ratio = (A / Wh) / (C / Bk)
Hypergeometric, One-tailed Fisher exact test
- https://www.bioconductor.org/help/course-materials/2009/SeattleApr09/gsea/ (Are interesting features over-represented? or are selected genes more often in the GO category than expected by chance?)
- https://en.wikipedia.org/wiki/Hypergeometric_distribution. In a test for over-representation of successes in the sample, the hypergeometric p-value is calculated as the probability of randomly drawing k or more successes from the population in n total draws. In a test for under-representation, the p-value is the probability of randomly drawing k or fewer successes.
- http://stats.stackexchange.com/questions/62235/one-tailed-fishers-exact-test-and-the-hypergeometric-distribution
- Two sided hypergeometric test
- https://www.biostars.org/p/90662/ When computing the p-value (tail probability), consider to use 1 - Prob(observed -1) instead of 1 - Prob(observed) for discrete distribution.
- https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k).
drawn | not drawn | ------------------------------------- white | x | | m ------------------------------------- black | k-x | | n ------------------------------------- | k | | m+n
For example, k=100, m=100, m+n=1000,
> 1 - phyper(10, 100, 10^3-100, 100, log.p=F)
[1] 0.4160339
> a <- dhyper(0:10, 100, 10^3-100, 100)
> cumsum(rev(a))
[1] 1.566158e-140 1.409558e-135 3.136408e-131 3.067025e-127 1.668004e-123 5.739613e-120 1.355765e-116
[8] 2.325536e-113 3.018276e-110 3.058586e-107 2.480543e-104 1.642534e-101 9.027724e-99 4.175767e-96
[15] 1.644702e-93 5.572070e-91 1.638079e-88 4.210963e-86 9.530281e-84 1.910424e-81 3.410345e-79
[22] 5.447786e-77 7.821658e-75 1.013356e-72 1.189000e-70 1.267638e-68 1.231736e-66 1.093852e-64
[29] 8.900857e-63 6.652193e-61 4.576232e-59 2.903632e-57 1.702481e-55 9.240350e-54 4.650130e-52
[36] 2.173043e-50 9.442985e-49 3.820823e-47 1.441257e-45 5.074077e-44 1.669028e-42 5.134399e-41
[43] 1.478542e-39 3.989016e-38 1.009089e-36 2.395206e-35 5.338260e-34 1.117816e-32 2.200410e-31
[50] 4.074043e-30 7.098105e-29 1.164233e-27 1.798390e-26 2.617103e-25 3.589044e-24 4.639451e-23
[57] 5.654244e-22 6.497925e-21 7.042397e-20 7.198582e-19 6.940175e-18 6.310859e-17 5.412268e-16
[64] 4.377256e-15 3.338067e-14 2.399811e-13 1.626091e-12 1.038184e-11 6.243346e-11 3.535115e-10
[71] 1.883810e-09 9.442711e-09 4.449741e-08 1.970041e-07 8.188671e-07 3.193112e-06 1.167109e-05
[78] 3.994913e-05 1.279299e-04 3.828641e-04 1.069633e-03 2.786293e-03 6.759071e-03 1.525017e-02
[85] 3.196401e-02 6.216690e-02 1.120899e-01 1.872547e-01 2.898395e-01 4.160339e-01 5.550192e-01
[92] 6.909666e-01 8.079129e-01 8.953150e-01 9.511926e-01 9.811343e-01 9.942110e-01 9.986807e-01
[99] 9.998018e-01 9.999853e-01 1.000000e+00
# Density plot
plot(0:100, dhyper(0:100, 100, 10^3-100, 100), type='h')
Moreover,
1 - phyper(q=10, m, n, k) = 1 - sum_{x=0}^{x=10} phyper(x, m, n, k) = 1 - sum(a[1:11]) # R's index starts from 1.
Another example is the data from the functional annotation tool in DAVID.
| gene list | not gene list | ------------------------------------------------------- pathway | 3 (q) | | 40 (m) ------------------------------------------------------- not in pathway | 297 | | 29960 (n) ------------------------------------------------------- | 300 (k) | | 30000
The one-tailed p-value from the hypergeometric test is calculated as 1 - phyper(3-1, 40, 29960, 300) = 0.0074.
Fisher's exact test
Following the above example from the DAVID website, the following R command calculates the Fisher exact test for independence in 2x2 contingency tables.
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2)) # alternative = "two.sided" by default
Fisher's Exact Test for Count Data
data: matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.008853
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.488738 23.966741
sample estimates:
odds ratio
7.564602
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="greater")
Fisher's Exact Test for Count Data
data: matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.008853
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
1.973 Inf
sample estimates:
odds ratio
7.564602
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="less")
Fisher's Exact Test for Count Data
data: matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.9991
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
0.00000 20.90259
sample estimates:
odds ratio
7.564602
From the documentation of fisher.test
Usage: fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE, control = list(), or = 1, alternative = "two.sided", conf.int = TRUE, conf.level = 0.95, simulate.p.value = FALSE, B = 2000)
- For 2 by 2 cases, p-values are obtained directly using the (central or non-central) hypergeometric distribution.
- For 2 by 2 tables, the null of conditional independence is equivalent to the hypothesis that the odds ratio equals one.
- The alternative for a one-sided test is based on the odds ratio, so ‘alternative = "greater"’ is a test of the odds ratio being bigger than ‘or’.
- Two-sided tests are based on the probabilities of the tables, and take as ‘more extreme’ all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.
Chi-square independence test
Exploring the underlying theory of the chi-square test through simulation - part 2
GSEA
Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states
- https://www.bioconductor.org/help/course-materials/2015/SeattleApr2015/E_GeneSetEnrichment.html
- http://software.broadinstitute.org/gsea/index.jsp
- Statistical power of gene-set enrichment analysis is a function of gene set correlation structure by SWANSON 2017
- Towards a gold standard for benchmarking gene set enrichment analysis, GSEABenchmarkeR package
Two categories of GSEA procedures:
- Competitive: compare genes in the test set relative to all other genes.
- Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by Jiang & Gentleman Bioinformatics 2007
McNemar’s test on paired nominal data
https://en.wikipedia.org/wiki/McNemar%27s_test
Confidence vs Credibility Intervals
http://freakonometrics.hypotheses.org/18117
Power analysis/Sample Size determination
See Power.
Common covariance/correlation structures
See psu.edu. Assume covariance \(\displaystyle \Sigma = (\sigma_{ij})_{p\times p} \)
- Diagonal structure: \(\displaystyle \sigma_{ij} = 0\) if \(\displaystyle i \neq j\).
- Compound symmetry: \(\displaystyle \sigma_{ij} = \rho\) if \(\displaystyle i \neq j\).
- First-order autoregressive AR(1) structure: \(\displaystyle \sigma_{ij} = \rho^{|i - j|}\).
rho <- .8 p <- 5 blockMat <- rho ^ abs(matrix(1:p, p, p, byrow=T) - matrix(1:p, p, p))
- Banded matrix: \(\displaystyle \sigma_{ii}=1, \sigma_{i,i+1}=\sigma_{i+1,i} \neq 0, \sigma_{i,i+2}=\sigma_{i+2,i} \neq 0\) and \(\displaystyle \sigma_{ij}=0\) for \(\displaystyle |i-j| \ge 3\).
- Spatial Power
- Unstructured Covariance
- Toeplitz structure
To create blocks of correlation matrix, use the "%x%" operator. See kronecker().
covMat <- diag(n.blocks) %x% blockMat
Counter/Special Examples
Suppose X is a normally-distributed random variable with zero mean. Let Y = X^2. Clearly X and Y are not independent: if you know X, you also know Y. And if you know Y, you know the absolute value of X.
The covariance of X and Y is
Cov(X,Y) = E(XY) - E(X)E(Y) = E(X^3) - 0*E(Y) = E(X^3) = 0,
because the distribution of X is symmetric around zero. Thus the correlation r(X,Y) = Cov(X,Y)/Sqrt[Var(X)Var(Y)] = 0, and we have a situation where the variables are not independent, yet have (linear) correlation r(X,Y) = 0.
This example shows how a linear correlation coefficient does not encapsulate anything about the quadratic dependence of Y upon X.
Spearman vs Pearson correlation
Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship. https://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation
x=(1:100);<br /> y=exp(x); <br /> cor(x,y, method='spearman') # 1 cor(x,y, method='pearson') # .25
Spearman vs Wilcoxon
By this post
- Wilcoxon used to compare categorical versus non-normal continuous variable
- Spearman's rho used to compare two continuous (including ordinal) variables that one or both aren't normally distributed
Spearman vs Kendall correlation
- Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities.
- Kendall Tau or Spearman's rho?
Anscombe quartet
Four datasets have almost same properties: same mean in X, same mean in Y, same variance in X, (almost) same variance in Y, same correlation in X and Y, same linear regression.
The real meaning of spurious correlations
https://nsaunders.wordpress.com/2017/02/03/the-real-meaning-of-spurious-correlations/
library(ggplot2)
set.seed(123)
spurious_data <- data.frame(x = rnorm(500, 10, 1),
y = rnorm(500, 10, 1),
z = rnorm(500, 30, 3))
cor(spurious_data$x, spurious_data$y)
# [1] -0.05943856
spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) +
theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.4517972
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
theme_bw() + geom_smooth(method = "lm") +
scale_color_gradientn(colours = c("red", "white", "blue")) +
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")
spurious_data$z <- rnorm(500, 30, 6)
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.8424597
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
theme_bw() + geom_smooth(method = "lm") +
scale_color_gradientn(colours = c("red", "white", "blue")) +
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")
Time series
Structural change
Structural Changes in Global Warming
AR(1) processes and random walks
Spurious correlations and random walks
Measurement Error model
- Errors-in-variables models/errors-in-variables models or measurement error models
- Simulation‐‐Selection‐‐Extrapolation: Estimation in High‐‐Dimensional Errors‐‐in‐‐Variables Models Nghiem 2019
Dictionary
- Prognosis is the probability that an event or diagnosis will result in a particular outcome.
- For example, on the paper Developing and Validating Continuous Genomic Signatures in Randomized Clinical Trials for Predictive Medicine by Matsui 2012, the prognostic score .1 (0.9) represents a good (poor) prognosis.
- Prostate cancer has a much higher one-year overall survival rate than pancreatic cancer, and thus has a better prognosis. See Survival rate in wikipedia.
Data
Eleven quick tips for finding research data
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006038
Books
- Methods in Biostatistics with R ($)
- Modern Statistics for Modern Biology (free)
- Principles of Applied Statistics, by David Cox & Christl Donnelly
- Statistics by David Freedman,Robert Pisani, Roger Purves
- Wiley Online Library -> Statistics (Access by NIH Library)
- Computer Age Statistical Inference: Algorithms, Evidence and Data Science by Efron and Hastie 2016
Social
JSM
- 2019
- JSM 2019 and the post.
- An R Users Guide to JSM 2019