Testmathj: Difference between revisions

From David's Wiki
No edit summary
No edit summary
Line 1: Line 1:
= Statisticians =
* [https://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] (1857-1936): chi-square, p-value, PCA
* [https://en.wikipedia.org/wiki/William_Sealy_Gosset William Sealy Gosset] (1876-1937): Student's t
* [https://en.wikipedia.org/wiki/Ronald_Fisher Ronald Fisher] (1890-1962): ANOVA
* [https://en.wikipedia.org/wiki/Egon_Pearson Egon Pearson] (1895-1980): son of Karl Pearson
* [https://en.wikipedia.org/wiki/Jerzy_Neyman Jerzy Neyman] (1894-1981): type 1 error


= Statistics for biologists =
== Use ==
http://www.nature.com/collections/qghhqm
When entropy was applied to the variable selection, we want to select a class variable which gives a largest entropy difference between without any class variable (compute entropy using response only) and with that class variable (entropy is computed by adding entropy in each class level) because this variable is most discriminative and it gives most '''information gain'''. For example,
* entropy (without any class)=.94,
* entropy(var 1) = .69,
* entropy(var 2)=.91,
* entropy(var 3)=.725.
We will choose variable 1 since it gives the largest gain (.94 - .69) compared to the other variables (.94 -.91, .94 -.725).  


= Transform sample values to their percentiles =
Why is picking the attribute with the most information gain beneficial? It ''reduces'' entropy, which increases predictability. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.
https://stackoverflow.com/questions/21219447/calculating-percentile-of-dataset-column
<syntaxhighlight lang='bash'>
set.seed(1234)
x <- rnorm(10)
x
# [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247  0.5060559
# [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378
ecdf(x)(x)
# [1] 0.2 0.7 1.0 0.1 0.8 0.9 0.4 0.6 0.5 0.3


rank(x)
Consider a split of a continuous variable. Where should we cut the continuous variable to create a binary partition with the highest gain? Suppose cut point c1 creates an entropy .9 and another cut point c2 creates an entropy .1. We should choose c2.
# [1]  2  7 10  1  8  9  4  6  5  3
</syntaxhighlight>


= Box(Box and whisker) plot in R =
== Related ==
See
In addition to information gain, gini (dʒiːni) index is another metric used in decision tree. See [http://en.wikipedia.org/wiki/Decision_tree_learning wikipedia page] about decision tree learning.
* https://en.wikipedia.org/wiki/Box_plot
* https://owi.usgs.gov/blog/boxplots/ (ggplot2 is used)
* https://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/
* [https://en.wikipedia.org/wiki/Quartile Quartile] from Wikipedia. The quartiles returned from R are the same as the method defined by Method 2 described in Wikipedia.


An example for a graphical explanation.
= Ensembles =
<syntaxhighlight lang='rsplus'>
* Combining classifiers. Pro: better classification performance. Con: time consuming.
> x=c(0,4,15, 1, 6, 3, 20, 5, 8, 1, 3)
* Comic http://flowingdata.com/2017/09/05/xkcd-ensemble-model/
> summary(x)
* [http://www.win-vector.com/blog/2019/07/common-ensemble-models-can-be-biased/ Common Ensemble Models can be Biased]
  Min. 1st Qu. Median    Mean 3rd Qu.   Max.  
      0      2      4      6      7      20
> sort(x)
[1] 0  1  1  3  3  4  5  6  8 15 20
> boxplot(x, col = 'grey')


# https://en.wikipedia.org/wiki/Quartile#Example_1
== Bagging ==
> summary(c(6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49))
Draw N bootstrap samples and summary the results (averaging for regression problem, majority vote for classification problem). Decrease variance without changing bias. Not help much with underfit or high bias models.
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  6.00  25.50  40.00  33.18  42.50  49.00
</syntaxhighlight>
[[:File:Boxplot.svg]]


* The lower and upper edges of box is determined by the first and 3rd '''quartiles''' (2 and 7 in the above example).
=== Random forest ===
** 2 = median(c(0,  1,  1,  3,  3,  4)) = (1+3)/2
'''Variance importance''': if you scramble the values of a variable, and the accuracy of your tree does not change much, then the variable is not very important.
** 7 = median(c(4,  5,  6,  8, 15, 20)) = (6+8)/2
** IQR = 7 - 2 = 5
* The thick dark horizon line is the '''median''' (4 in the example).
* '''Outliers''' are defined by (the empty circles in the plot)
** Observations larger than 3rd quartile + 1.5 * IQR (7+1.5*5=14.5) and  
** smaller than 1st quartile - 1.5 * IQR (2-1.5*5=-5.5). 
** Note that ''the cutoffs are not shown in the Box plot''.
* Whisker (defined using the cutoffs used to define outliers)
** '''Upper whisker''' is defined by '''the largest "data" below 3rd quartile + 1.5 * IQR''' (8 in this example), and
** '''Lower whisker''' is defined by '''the smallest "data" greater than 1st quartile - 1.5 * IQR''' (0 in this example).
** See another example below where we can see the whiskers fall on observations.


Note the [http://en.wikipedia.org/wiki/Box_plot wikipedia] lists several possible definitions of a whisker. R uses the 2nd method (Tukey boxplot) to define whiskers.
Why is it useful to compute variance importance? So the model's predictions are easier to interpret (not improve the prediction performance).


== Create boxplots from a list object ==
Random forest has advantages of easier to run in parallel and suitable for small n large p problems.
Normally we use a vector to create a single boxplot or a formula on a data to create boxplots.  


But we can also use [https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/split split()] to create a list and then make boxplots.
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2264-5 Random forest versus logistic regression: a large-scale benchmark experiment] by Raphael Couronné, BMC Bioinformatics 2018


== Dot-box plot ==
[https://github.com/suiji/arborist Arborist]: Parallelized, Extensible Random Forests
* http://civilstat.com/2012/09/the-grammar-of-graphics-notes-on-first-reading/
* http://www.r-graph-gallery.com/89-box-and-scatter-plot-with-ggplot2/
* http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization
* [https://designdatadecisions.wordpress.com/2015/06/09/graphs-in-r-overlaying-data-summaries-in-dotplots/ Graphs in R – Overlaying Data Summaries in Dotplots]. Note that for some reason, the boxplot will cover the dots when we save the plot to an svg or a png file. So an alternative solution is to change the order <syntaxhighlight lang='rsplus'>
par(cex.main=0.9,cex.lab=0.8,font.lab=2,cex.axis=0.8,font.axis=2,col.axis="grey50")
boxplot(weight ~ feed, data = chickwts, range=0, whisklty = 0, staplelty = 0)
par(new = TRUE)
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method="stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")
</syntaxhighlight>


[[:File:Boxdot.svg]]
[https://academic.oup.com/bioinformatics/article-abstract/35/15/2701/5250706?redirectedFrom=fulltext On what to permute in test-based approaches for variable importance measures in Random Forests]


== Other boxplots ==
== Boosting ==
Instead of selecting data points randomly with the boostrap, it favors the misclassified points.


[[:File:Lotsboxplot.png]]
Algorithm:
* Initialize the weights
* Repeat
** resample with respect to weights
** retrain the model
** recompute weights


= stem and leaf plot =
Since boosting requires computation in iterative and bagging can be run in parallel, bagging has an advantage over boosting when the data is very large.
[https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/stem.html stem()]. See [http://www.r-tutor.com/elementary-statistics/quantitative-data/stem-and-leaf-plot R Tutorial].


Note that stem plot is useful when there are outliers.
== Time series ==
<syntaxhighlight lang='rsplus'>
[https://petolau.github.io/Ensemble-of-trees-for-forecasting-time-series/ Ensemble learning for time series forecasting in R]
> stem(x)


  The decimal point is 10 digit(s) to the right of the |
= p-values =
==  p-values ==
* Prob(Data | H0)
* https://en.wikipedia.org/wiki/P-value
* [https://amstat.tandfonline.com/toc/utas20/73/sup1 Statistical Inference in the 21st Century: A World Beyond p < 0.05] The American Statistician, 2019
* [https://matloff.wordpress.com/2016/03/07/after-150-years-the-asa-says-no-to-p-values/ THE ASA SAYS NO TO P-VALUES] The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis. We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important.
* [http://www.r-statistics.com/2016/03/its-not-the-p-values-fault-reflections-on-the-recent-asa-statement/ It’s not the p-values’ fault]
* [https://stablemarkets.wordpress.com/2016/05/21/exploring-p-values-with-simulations-in-r/ Exploring P-values with Simulations in R] from Stable Markets.
* p-value and [https://en.wikipedia.org/wiki/Effect_size effect size]. http://journals.sagepub.com/doi/full/10.1177/1745691614553988


  0 | 00000000000000000000000000000000000000000000000000000000000000000000+419
== Distribution of p values in medical abstracts ==
  1 |
* http://www.ncbi.nlm.nih.gov/pubmed/26608725
  2 |
* [https://github.com/jtleek/tidypvals An R package with several million published p-values in tidy data sets] by Jeff Leek.
  3 |
  4 |
  5 |
  6 |
  7 |
  8 |
  9 |
  10 |
  11 |
  12 | 9


> max(x)
== nominal p-value and Empirical p-values ==
[1] 129243100275
* Nominal p-values are based on asymptotic null distributions
> max(x)/1e10
* Empirical p-values are computed from simulations/permutations
[1] 12.92431


> stem(y)
== (nominal) alpha level ==
Conventional methodology for statistical testing is, in advance of undertaking the test, to set a NOMINAL ALPHA CRITERION LEVEL (often 0.05). The outcome is classified as showing STATISTICAL SIGNIFICANCE if the actual ALPHA (probability of the outcome under the null hypothesis) is no greater than this NOMINAL ALPHA CRITERION LEVEL.
* http://www.translationdirectory.com/glossaries/glossary033.htm
* http://courses.washington.edu/p209s07/lecturenotes/Week%205_Monday%20overheads.pdf


  The decimal point is at the |
== Normality assumption ==
[https://www.biorxiv.org/content/early/2018/12/20/498931 Violating the normality assumption may be the lesser of two evils]


  0 | 014478
= T-statistic =
  1 | 0
See [[T-test#T-statistic|T-statistic]].
  2 | 1
  3 | 9
  4 | 8


> y
= ANOVA =
[1] 3.8667356428 0.0001762708 0.7993462430 0.4181079732 0.9541728562
See [[T-test#ANOVA|ANOVA]].
[6] 4.7791262101 0.6899313108 2.1381289177 0.0541736818 0.3868776083


> set.seed(1234)
= [https://en.wikipedia.org/wiki/Goodness_of_fit Goodness of fit] =
> z <- rnorm(10)*10
== [https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test Chi-square tests] ==
> z
* [http://freakonometrics.hypotheses.org/20531 An application of chi-square tests]
[1] -12.070657  2.774292  10.844412 -23.456977  4.291247  5.060559
[7]  -5.747400  -5.466319  -5.644520  -8.900378
> stem(z)


  The decimal point is 1 digit(s) to the right of the |
== Fitting distribution ==
[https://magesblog.com/post/2011-12-01-fitting-distributions-with-r/ Fitting distributions with R]


   -2 | 3
= Contingency Tables =
   -1 | 2
== [https://en.wikipedia.org/wiki/Odds_ratio Odds ratio and Risk ratio] ==
  -0 | 9665
The ratio of the odds of an event occurring in one group to the odds of it occurring in another group
  0 | 345
<pre>
  1 | 1
        drawn   | not drawn |
</syntaxhighlight>
-------------------------------------
white |  A      B      | Wh
-------------------------------------
black |   C      |   D      | Bk
</pre>
* Odds Ratio = (A / C) / (B / D) = (AD) / (BC)
* Risk Ratio = (A / Wh) / (C / Bk)


= Box-Cox transformation =
== Hypergeometric, One-tailed Fisher exact test ==
* [https://en.wikipedia.org/wiki/Power_transform#Box%E2%80%93Cox_transformation Power transformation]
* https://www.bioconductor.org/help/course-materials/2009/SeattleApr09/gsea/ (Are interesting features over-represented? or are selected genes more often in the ''GO category'' than expected by chance?)
* [http://denishaine.wordpress.com/2013/03/11/veterinary-epidemiologic-research-linear-regression-part-3-box-cox-and-matrix-representation/ Finding transformation for normal distribution]
* https://en.wikipedia.org/wiki/Hypergeometric_distribution. '' In a test for over-representation of successes in the sample, the hypergeometric p-value is calculated as the probability of randomly drawing '''k''' or more successes from the population in '''n''' total draws. In a test for under-representation, the p-value is the probability of randomly drawing '''k''' or fewer successes.''
 
* http://stats.stackexchange.com/questions/62235/one-tailed-fishers-exact-test-and-the-hypergeometric-distribution
= the Holy Trinity (LRT, Wald, Score tests) =
* Two sided hypergeometric test
* https://en.wikipedia.org/wiki/Likelihood_function which includes '''profile likelihood''' and '''partial likelihood'''
** http://stats.stackexchange.com/questions/155189/how-to-make-a-two-tailed-hypergeometric-test
* [http://data.princeton.edu/wws509/notes/a1.pdf Review of the likelihood theory]
** http://stats.stackexchange.com/questions/140107/p-value-in-a-two-tail-test-with-asymmetric-null-distribution
* [http://www.tandfonline.com/doi/full/10.1080/00031305.2014.955212#abstract?ai=rv&mi=3be122&af=R The “Three Plus One” Likelihood-Based Test Statistics: Unified Geometrical and Graphical Interpretations]
** http://stats.stackexchange.com/questions/19195/explaining-two-tailed-tests
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5969114/ Variable selection – A review and recommendations for the practicing statistician] by Heinze et al 2018.
* https://www.biostars.org/p/90662/ When computing the p-value (tail probability), consider to use 1 - Prob(observed -1) instead of 1 - Prob(observed) for discrete distribution.
** Score test is step-up. Score test is typically used in forward steps to screen covariates currently not included in a model for their ability to improve model.
* https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Hypergeometric.html p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k).
** Wald test is step-down. Wald test starts at the full model. It evaluate the significance of a variable by comparing the ratio of its estimate and its standard error with an appropriate t distribution (for linear models) or standard normal distribution (for logistic or Cox regression).
<pre>
** Likelihood ratio tests provide the best control over nuisance parameters by maximizing the likelihood over them both in H0 model and H1 model. In particular, if several coefficients are being tested simultaneously, LRTs for model comparison are preferred over Wald or score tests.
        drawn  | not drawn |
* R packages
-------------------------------------
** lmtest package, [https://www.rdocumentation.org/packages/lmtest/versions/0.9-37/topics/waldtest waldtest()] and [https://www.rdocumentation.org/packages/lmtest/versions/0.9-37/topics/lrtest lrtest()].
white |  x      |          | m
 
-------------------------------------
= Don't invert that matrix  =
black |  k-x    |          | n
* http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
-------------------------------------
* http://civilstat.com/2015/07/dont-invert-that-matrix-why-and-how/
      |  k      |          | m+n
 
</pre>
== Different matrix decompositions/factorizations ==
 
* [https://en.wikipedia.org/wiki/QR_decomposition QR decomposition], [https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/qr qr()]
* [https://en.wikipedia.org/wiki/LU_decomposition LU decomposition], [https://www.rdocumentation.org/packages/Matrix/versions/1.2-14/topics/lu lu()] from the 'Matrix' package
* [https://en.wikipedia.org/wiki/Cholesky_decomposition Cholesky decomposition], [https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/chol chol()]
* [https://en.wikipedia.org/wiki/Singular-value_decomposition Singular value decomposition], [https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/svd svd()]


For example, k=100, m=100, m+n=1000,
<syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='rsplus'>
set.seed(1234)
> 1 - phyper(10, 100, 10^3-100, 100, log.p=F)
x <- matrix(rnorm(10*2), nr= 10)
[1] 0.4160339
cmat <- cov(x); cmat
> a <- dhyper(0:10, 100, 10^3-100, 100)
# [,1]       [,2]
> cumsum(rev(a))
# [1,0.9915928 -0.1862983
  [1] 1.566158e-140 1.409558e-135 3.136408e-131 3.067025e-127 1.668004e-123 5.739613e-120 1.355765e-116
# [2,] -0.1862983 1.1392095
  [8] 2.325536e-113 3.018276e-110 3.058586e-107 2.480543e-104 1.642534e-101  9.027724e-99  4.175767e-96
 
[15] 1.644702e-93  5.572070e-91  1.638079e-88  4.210963e-86  9.530281e-84  1.910424e-81  3.410345e-79
# cholesky decom
[22]  5.447786e-77  7.821658e-75  1.013356e-72  1.189000e-70  1.267638e-68  1.231736e-66  1.093852e-64
d1 <- chol(cmat)
[298.900857e-63  6.652193e-61  4.576232e-59  2.903632e-57  1.702481e-55  9.240350e-54  4.650130e-52
t(d1) %*% d1 # equal to cmat
[36]  2.173043e-50  9.442985e-49  3.820823e-47  1.441257e-45  5.074077e-44  1.669028e-42  5.134399e-41
d1 # upper triangle
[43] 1.478542e-39  3.989016e-38  1.009089e-36  2.395206e-35  5.338260e-34 1.117816e-32  2.200410e-31
# [,1]      [,2]
[50]  4.074043e-30  7.098105e-29  1.164233e-27  1.798390e-26  2.617103e-25  3.589044e-24  4.639451e-23
# [1,] 0.9957875 -0.1870864
[57]  5.654244e-22  6.497925e-21  7.042397e-20  7.198582e-19  6.940175e-18  6.310859e-17  5.412268e-16
# [2,] 0.0000000 1.0508131
[64]  4.377256e-15  3.338067e-14  2.399811e-13  1.626091e-12  1.038184e-11  6.243346e-11  3.535115e-10
  [71]  1.883810e-09  9.442711e-09  4.449741e-08  1.970041e-07  8.188671e-07  3.193112e-06  1.167109e-05
  [78]  3.994913e-05  1.279299e-04  3.828641e-04  1.069633e-03  2.786293e-03  6.759071e-03  1.525017e-02
[85]  3.196401e-02  6.216690e-02  1.120899e-01  1.872547e-01  2.898395e-01  4.160339e-01  5.550192e-01
[92] 6.909666e-01  8.079129e-01  8.953150e-01  9.511926e-01  9.811343e-01  9.942110e-01  9.986807e-01
[99] 9.998018e-01  9.999853e-01 1.000000e+00


# svd
# Density plot
d2 <- svd(cmat)
plot(0:100, dhyper(0:100, 100, 10^3-100, 100), type='h')
d2$u %*% diag(d2$d) %*% t(d2$v) # equal to cmat
d2$u %*% diag(sqrt(d2$d))
# [,1]      [,2]
# [1,] -0.6322816 0.7692937
# [2,]  0.9305953 0.5226872
</syntaxhighlight>
</syntaxhighlight>
[[:File:Dhyper.svg]]


= Linear Regression =
Moreover,
[https://leanpub.com/regmods Regression Models for Data Science in R] by Brian Caffo
<pre>
  1 - phyper(q=10, m, n, k)
= 1 - sum_{x=0}^{x=10} phyper(x, m, n, k)
= 1 - sum(a[1:11]) # R's index starts from 1.
</pre>


Comic https://xkcd.com/1725/
Another example is the data from [https://david.ncifcrf.gov/helps/functional_annotation.html#fisher the functional annotation tool] in DAVID.
 
<pre>
== Different models (in R) ==
              | gene list | not gene list |
http://www.quantide.com/raccoon-ch-1-introduction-to-linear-models-with-r/
-------------------------------------------------------
 
pathway        |  3  (q)  |              | 40 (m)
== dummy.coef.lm() in R ==
-------------------------------------------------------
Extracts coefficients in terms of the original levels of the coefficients rather than the coded variables.
not in pathway |  297      |              | 29960 (n)
 
-------------------------------------------------------
== model.matrix, design matrix ==
              |  300 (k)  |              | 30000
[https://github.com/csoneson/ExploreModelMatrix ExploreModelMatrix]: Explore design matrices interactively with R/Shiny
</pre>
 
The one-tailed p-value from the hypergeometric test is calculated as 1 - phyper(3-1, 40, 29960, 300) = 0.0074.
== Contrasts in linear regression ==
* Page 147 of Modern Applied Statistics with S (4th ed)
* https://biologyforfun.wordpress.com/2015/01/13/using-and-interpreting-different-contrasts-in-linear-models-in-r/ This explains the meanings of 'treatment', 'helmert' and 'sum' contrasts.
* [http://rstudio-pubs-static.s3.amazonaws.com/65059_586f394d8eb84f84b1baaf56ffb6b47f.html A (sort of) Complete Guide to Contrasts in R] by Rose Maier <syntaxhighlight lang='rsplus'>
mat


##      constant NLvMH  NvL  MvH
== [https://en.wikipedia.org/wiki/Fisher%27s_exact_test Fisher's exact test] ==
## [1,]        1  -0.5  0.5  0.0
Following the above example from the DAVID website, the following R command calculates the Fisher exact test for independence in 2x2 contingency tables.
## [2,]       1  -0.5 -0.5  0.0
## [3,]        1  0.5  0.0  0.5
## [4,]        1  0.5  0.0 -0.5
mat <- mat[ , -1]
 
model7 <- lm(y ~ dose, data=data, contrasts=list(dose=mat) )
summary(model7)
 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  118.578      1.076 110.187  < 2e-16 ***
## doseNLvMH      3.179      2.152  1.477  0.14215   
## doseNvL      -8.723      3.044  -2.866  0.00489 **
## doseMvH      13.232      3.044  4.347 2.84e-05 ***
 
# double check your contrasts
attributes(model7$qr$qr)$contrasts
## $dose
##      NLvMH  NvL  MvH
## None  -0.5  0.5  0.0
## Low  -0.5 -0.5  0.0
## Med    0.5  0.0  0.5
## High  0.5  0.0 -0.5
 
library(dplyr)
dose.means <- summarize(group_by(data, dose), y.mean=mean(y))
dose.means
## Source: local data frame [4 x 2]
##
##  dose  y.mean
## 1 None 112.6267
## 2  Low 121.3500
## 3  Med 126.7839
## 4 High 113.5517
 
# The coefficient estimate for the first contrast (3.18) equals the average of
# the last two groups (126.78 + 113.55 /2 = 120.17) minus the average of
# the first two groups (112.63 + 121.35 /2 = 116.99).
</syntaxhighlight>
 
== Multicollinearity ==
* [https://datascienceplus.com/multicollinearity-in-r/ Multicollinearity in R]
* [https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/alias alias]: Find Aliases (Dependencies) In A Model
<syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='rsplus'>
> op <- options(contrasts = c("contr.helmert", "contr.poly"))
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2)) #  alternative = "two.sided" by default
> npk.aov <- aov(yield ~ block + N*P*K, npk)
> alias(npk.aov)
Model :
yield ~ block + N * P * K


Complete :
        Fisher's Exact Test for Count Data
        (Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1 N1:K1 P1:K1
N1:P1:K1    0          1    1/3    1/6  -3/10  -1/5      0    0    0    0    0    0


> options(op)
data:  matrix(c(3, 40, 297, 29960), nr = 2)
</syntaxhighlight>
p-value = 0.008853
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  1.488738 23.966741
sample estimates:
odds ratio
  7.564602


== Exposure ==
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="greater")
https://en.mimi.hu/mathematics/exposure_variable.html


Independent variable = predictor = explanatory = exposure variable
        Fisher's Exact Test for Count Data


== Confounders, confounding ==
data:  matrix(c(3, 40, 297, 29960), nr = 2)
* https://en.wikipedia.org/wiki/Confounding
p-value = 0.008853
** [https://academic.oup.com/jamia/article/21/2/308/723853 A method for controlling complex confounding effects in the detection of adverse drug reactions using electronic health records]. It provides a rule to identify a confounder.
alternative hypothesis: true odds ratio is greater than 1
* http://anythingbutrbitrary.blogspot.com/2016/01/how-to-create-confounders-with.html (R example)
95 percent confidence interval:
* [http://www.cantab.net/users/filimon/cursoFCDEF/will/logistic_confound.pdf Logistic Regression: Confounding and Colinearity]
1.973  Inf
* [https://stats.stackexchange.com/questions/192591/identifying-a-confounder?rq=1 Identifying a confounder]
sample estimates:
* [https://stats.stackexchange.com/questions/38326/is-it-possible-to-have-a-variable-that-acts-as-both-an-effect-modifier-and-a-con Is it possible to have a variable that acts as both an effect modifier and a confounder?]
odds ratio
* [https://stats.stackexchange.com/questions/34644/which-test-to-use-to-check-if-a-possible-confounder-impacts-a-0-1-result Which test to use to check if a possible confounder impacts a 0 / 1 result?]
  7.564602
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1700-9 Addressing confounding artifacts in reconstruction of gene co-expression networks] Parsana 2019


== Causal inference ==
> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="less")
* https://en.wikipedia.org/wiki/Causal_inference
* [http://www.rebeccabarter.com/blog/2017-07-05-confounding/ Confounding in causal inference: what is it, and what to do about it?]


== Confidence interval vs prediction interval ==
        Fisher's Exact Test for Count Data
Confidence intervals tell you about how well you have determined the mean E(Y). Prediction intervals tell you where you can expect to see the next data point sampled. That is, CI is computed using Var(E(Y|X)) and PI is computed using Var(E(Y|X) + e).


* http://www.graphpad.com/support/faqid/1506/
data: matrix(c(3, 40, 297, 29960), nr = 2)
* http://en.wikipedia.org/wiki/Prediction_interval
p-value = 0.9991
* http://robjhyndman.com/hyndsight/intervals/
alternative hypothesis: true odds ratio is less than 1
* https://stat.duke.edu/courses/Fall13/sta101/slides/unit7lec3H.pdf
95 percent confidence interval:
* https://datascienceplus.com/prediction-interval-the-wider-sister-of-confidence-interval/
   0.00000 20.90259
 
sample estimates:
== Heteroskedasticity ==
odds ratio
[http://www.brodrigues.co/blog/2018-07-08-rob_stderr/ Dealing with heteroskedasticity; regression with robust standard errors using R]
   7.564602
 
== Linear regression with Map Reduce ==
https://freakonometrics.hypotheses.org/53269
 
== Relationship between multiple variables ==
[https://statisticaloddsandends.wordpress.com/2019/08/24/visualizing-the-relationship-between-multiple-variables/ Visualizing the relationship between multiple variables]
 
= Quantile regression =
* https://en.wikipedia.org/wiki/Quantile_regression
* [https://insightr.wordpress.com/2019/08/13/basic-quantile-regression/ Basic Quantile Regression]
 
= Non- and semi-parametric regression =
* [https://mathewanalytics.com/2018/03/05/semiparametric-regression-in-r/ Semiparametric Regression in R]
* https://socialsciences.mcmaster.ca/jfox/Courses/Oxford-2005/R-nonparametric-regression.html
 
== Mean squared error ==
* [https://www.statworx.com/de/blog/simulating-the-bias-variance-tradeoff-in-r/ Simulating the bias-variance tradeoff in R]
* [https://alemorales.info/post/variance-estimators/ Estimating variance: should I use n or n - 1? The answer is not what you think]
 
== Splines ==
* https://en.wikipedia.org/wiki/B-spline
* [https://www.r-bloggers.com/cubic-and-smoothing-splines-in-r/ Cubic and Smoothing Splines in R]. '''bs()''' is for cubic spline and '''smooth.spline()''' is for smoothing spline.
* [https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/ Can we use B-splines to generate non-linear data?]
* [https://stats.stackexchange.com/questions/29400/spline-fitting-in-r-how-to-force-passing-two-data-points How to force passing two data points?] ([https://cran.r-project.org/web/packages/cobs/index.html cobs] package)
* https://www.rdocumentation.org/packages/cobs/versions/1.3-3/topics/cobs
 
== k-Nearest neighbor regression ==
* k-NN regression in practice: boundary problem, discontinuities problem.
* Weighted k-NN regression: want weight to be small when distance is large. Common choices - weight = kernel(xi, x)
 
== Kernel regression ==
* Instead of weighting NN, weight ALL points. Nadaraya-Watson kernel weighted average:
<math>\hat{y}_q = \sum c_{qi} y_i/\sum c_{qi} = \frac{\sum \text{Kernel}_\lambda(\text{distance}(x_i, x_q))*y_i}{\sum \text{Kernel}_\lambda(\text{distance}(x_i, x_q))} </math>.
* Choice of bandwidth <math>\lambda</math> for bias, variance trade-off. Small <math>\lambda</math> is over-fitting. Large <math>\lambda</math> can get an over-smoothed fit. '''Cross-validation'''.
* Kernel regression leads to locally constant fit.
* Issues with high dimensions, data scarcity and computational complexity.
 
= Principal component analysis =
See [[PCA|PCA]].
 
= Partial Least Squares (PLS) =
* https://en.wikipedia.org/wiki/Partial_least_squares_regression. The general underlying model of multivariate PLS is
:<math>X = T P^\mathrm{T} + E</math>
:<math>Y = U Q^\mathrm{T} + F</math>
where {{mvar|X}} is an <math>n \times m</math> matrix of predictors, {{mvar|Y}} is an <math>n \times p</math> matrix of responses; {{mvar|T}} and {{mvar|U}} are <math>n \times l</math> matrices that are, respectively, '''projections''' of {{mvar|X}} (the X '''score''', ''component'' or '''factor matrix''') and projections of {{mvar|Y}} (the ''Y scores''); {{mvar|P}} and {{mvar|Q}} are, respectively, <math>m \times l</math> and <math>p \times l</math> orthogonal '''loading matrices'''; and matrices {{mvar|E}} and {{mvar|F}} are the error terms, assumed to be independent and identically distributed random normal variables. The decompositions of {{mvar|X}} and {{mvar|Y}} are made so as to maximise the '''covariance''' between {{mvar|T}} and {{mvar|U}} (projection matrices).
* [https://www.gokhanciflikli.com/post/learning-brexit/ Supervised vs. Unsupervised Learning: Exploring Brexit with PLS and PCA]
* [https://cran.r-project.org/web/packages/pls/index.html pls] R package
* [https://cran.r-project.org/web/packages/plsRcox/index.html plsRcox] R package (archived). See [[R#install_a_tar.gz_.28e.g._an_archived_package.29_from_a_local_directory|here]] for the installation.
 
[https://web.stanford.edu/~hastie/ElemStatLearn//printings/ESLII_print12.pdf#page=101 PLS, PCR (principal components regression) and ridge regression tend to behave similarly]. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps.
 
= High dimension =
[https://projecteuclid.org/euclid.aos/1547197242 Partial least squares prediction in high-dimensional regression] Cook and Forzani, 2019
 
== Feature selection ==
https://en.wikipedia.org/wiki/Feature_selection
 
= [https://en.wikipedia.org/wiki/Independent_component_analysis Independent component analysis] =
ICA is another dimensionality reduction method.
 
== ICA vs PCA ==
 
== ICS vs FA ==
 
= [https://en.wikipedia.org/wiki/Correspondence_analysis Correspondence analysis] =
https://francoishusson.wordpress.com/2017/07/18/multiple-correspondence-analysis-with-factominer/ and the book [https://www.crcpress.com/Exploratory-Multivariate-Analysis-by-Example-Using-R-Second-Edition/Husson-Le-Pages/p/book/9781138196346?tab=rev Exploratory Multivariate Analysis by Example Using R]
 
= t-SNE =
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
 
* https://distill.pub/2016/misread-tsne/
* https://lvdmaaten.github.io/tsne/
* Application to [http://amp.pharm.mssm.edu/archs4/data.html ARCHS4]
* [https://www.codeproject.com/tips/788739/visualization-of-high-dimensional-data-using-t-sne Visualization of High Dimensional Data using t-SNE with R]
* http://blog.thegrandlocus.com/2018/08/a-tutorial-on-t-sne-1
* [https://intobioinformatics.wordpress.com/2019/05/30/quick-and-easy-t-sne-analysis-in-r/ Quick and easy t-SNE analysis in R]
 
= Visualize the random effects =
http://www.quantumforest.com/2012/11/more-sense-of-random-effects/
 
= [https://en.wikipedia.org/wiki/Calibration_(statistics) Calibration] =
* [https://stats.stackexchange.com/questions/43053/how-to-determine-calibration-accuracy-uncertainty-of-a-linear-regression How to determine calibration accuracy/uncertainty of a linear regression?]
* [https://chem.libretexts.org/Textbook_Maps/Analytical_Chemistry/Book%3A_Analytical_Chemistry_2.0_(Harvey)/05_Standardizing_Analytical_Methods/5.4%3A_Linear_Regression_and_Calibration_Curves Linear Regression and Calibration Curves]
* [https://www.webdepot.umontreal.ca/Usagers/sauves/MonDepotPublic/CHM%203103/LCGC%20Eur%20Burke%202001%20-%202%20de%204.pdf Regression and calibration] Shaun Burke
* [https://cran.r-project.org/web/packages/calibrate calibrate] package
* [https://diagnprognres.biomedcentral.com/articles/10.1186/s41512-018-0029-2 The index of prediction accuracy: an intuitive measure useful for evaluating risk prediction models] by Kattan and Gerds 2018. The following code demonstrates Figure 2. <syntaxhighlight lang='rsplus'>
# Odds ratio =1 and calibrated model
set.seed(666)
x = rnorm(1000)         
z1 = 1 + 0*x       
pr1 = 1/(1+exp(-z1))
y1 = rbinom(1000,1,pr1) 
mean(y1) # .724, marginal prevalence of the outcome
dat1 <- data.frame(x=x, y=y1)
newdat1 <- data.frame(x=rnorm(1000), y=rbinom(1000, 1, pr1))
 
# Odds ratio =1 and severely miscalibrated model
set.seed(666)
x = rnorm(1000)         
z2 =  -2 + 0*x       
pr2 = 1/(1+exp(-z2)) 
y2 = rbinom(1000,1,pr2) 
mean(y2) # .12
dat2 <- data.frame(x=x, y=y2)
newdat2 <- data.frame(x=rnorm(1000), y=rbinom(1000, 1, pr2))
 
library(riskRegression)
lrfit1 <- glm(y ~ x, data = dat1, family = 'binomial')
IPA(lrfit1, newdata = newdat1)
#    Variable    Brier          IPA    IPA.gain
# 1 Null model 0.1984710  0.000000e+00 -0.003160010
# 2 Full model 0.1990982 -3.160010e-03  0.000000000
# 3          x 0.1984800 -4.534668e-05 -0.003114664
1 - 0.1990982/0.1984710
# [1] -0.003160159
 
lrfit2 <- glm(y ~ x, family = 'binomial')
IPA(lrfit2, newdata = newdat1)
#    Variable    Brier      IPA    IPA.gain
# 1 Null model 0.1984710  0.000000 -1.859333763
# 2 Full model 0.5674948 -1.859334  0.000000000
# 3          x 0.5669200 -1.856437 -0.002896299
1 - 0.5674948/0.1984710
# [1] -1.859334
</syntaxhighlight> From the simulated data, we see IPA = -3.16e-3 for a calibrated model and IPA = -1.86 for a severely miscalibrated model.
 
= ROC curve =
See [[ROC|ROC]].
 
= [https://en.wikipedia.org/wiki/Net_reclassification_improvement NRI] (Net reclassification improvement) =
 
= Maximum likelihood =
[http://stats.stackexchange.com/questions/622/what-is-the-difference-between-a-partial-likelihood-profile-likelihood-and-marg Difference of partial likelihood, profile likelihood and marginal likelihood]
 
= Generalized Linear Model =
Lectures from a course in [http://people.stat.sfu.ca/~raltman/stat851.html Simon Fraser University Statistics].
 
[https://petolau.github.io/Analyzing-double-seasonal-time-series-with-GAM-in-R/ Doing magic and analyzing seasonal time series with GAM (Generalized Additive Model) in R]
 
== Link function ==
[http://www.win-vector.com/blog/2019/07/link-functions-versus-data-transforms/ Link Functions versus Data Transforms]
 
== Quasi Likelihood ==
Quasi-likelihood is like log-likelihood. The quasi-score function (first derivative of quasi-likelihood function) is the estimating equation.
 
* [http://www.stat.uchicago.edu/~pmcc/pubs/paper6.pdf Original paper] by Peter McCullagh.
* [http://people.stat.sfu.ca/~raltman/stat851/851L20.pdf Lecture 20] from SFU.
* [http://courses.washington.edu/b571/lectures/notes131-181.pdf U. Washington] and  [http://faculty.washington.edu/heagerty/Courses/b571/handouts/OverdispQL.pdf another lecture] focuses on overdispersion.
* [http://www.maths.usyd.edu.au/u/jchan/GLM/QuasiLikelihood.pdf This lecture] contains a table of quasi likelihood from common distributions.
 
== Plot ==
https://strengejacke.wordpress.com/2015/02/05/sjplot-package-and-related-online-manuals-updated-rstats-ggplot/
 
== [https://en.wikipedia.org/wiki/Deviance_(statistics) Deviance], stats::deviance() and glmnet::deviance.glmnet() from R ==
* '''It is a generalization of the idea of using the sum of squares of residuals (RSS) in ordinary least squares''' to cases where model-fitting is achieved by maximum likelihood. See [https://stats.stackexchange.com/questions/6581/what-is-deviance-specifically-in-cart-rpart What is Deviance? (specifically in CART/rpart)] to manually compute deviance and compare it with the returned value of the '''deviance()''' function from a linear regression. Summary: deviance() = RSS in linear models.
* https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/deviance
* Likelihood ratio tests and the deviance http://data.princeton.edu/wws509/notes/a2.pdf#page=6
* Deviance(y,muhat) = 2*(loglik_saturated - loglik_proposed)
* [https://stats.stackexchange.com/questions/108995/interpreting-residual-and-null-deviance-in-glm-r Interpreting Residual and Null Deviance in GLM R]
** Null Deviance = 2(LL(Saturated Model) - LL(Null Model)) on df = df_Sat - df_Null. The '''null deviance''' shows how well the response variable is predicted by a model that includes only the intercept (grand mean).
** '''Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model)) = <math>2(LL(y|y) - LL(\hat{\mu}|y))</math>, df = df_Sat - df_Proposed=n-p'''. ==> deviance() has returned.
** Null deviance > Residual deviance. Null deviance df = n-1. Residual deviance df = n-p.
<syntaxhighlight lang='rsplus'>
## an example with offsets from Venables & Ripley (2002, p.189)
utils::data(anorexia, package = "MASS")
 
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
                family = gaussian, data = anorexia)
summary(anorex.1)
 
# Call:
#   glm(formula = Postwt ~ Prewt + Treat + offset(Prewt), family = gaussian,
#      data = anorexia)
#
# Deviance Residuals:
#  Min        1Q    Median        3Q      Max 
# -14.1083  -4.2773  -0.5484    5.4838  15.2922 
#
# Coefficients:
#  Estimate Std. Error t value Pr(>|t|)   
# (Intercept)  49.7711    13.3910  3.717 0.000410 ***
#  Prewt        -0.5655    0.1612  -3.509 0.000803 ***
#  TreatCont    -4.0971    1.8935  -2.164 0.033999 * 
#  TreatFT      4.5631    2.1333  2.139 0.036035 * 
#  ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for gaussian family taken to be 48.69504)
#
# Null deviance: 4525.4  on 71  degrees of freedom
# Residual deviance: 3311.3  on 68  degrees of freedom
# AIC: 489.97
#
# Number of Fisher Scoring iterations: 2
 
deviance(anorex.1)
# [1] 3311.263
</syntaxhighlight>
* In glmnet package. The deviance is defined to be 2*(loglike_sat - loglike), where loglike_sat is the log-likelihood for the saturated model (a model with a free parameter per observation). Null deviance is defined to be 2*(loglike_sat -loglike(Null)); The NULL model refers to the intercept model, except for the Cox, where it is the 0 model. Hence dev.ratio=1-deviance/nulldev, and this deviance method returns (1-dev.ratio)*nulldev.
** [https://stats.stackexchange.com/questions/134694/what-deviance-is-glmnet-using-to-compare-values-of-lambda What deviance is glmnet using to compare values of λ?]
<syntaxhighlight lang='rsplus'>
x=matrix(rnorm(100*2),100,2)
y=rnorm(100)
fit1=glmnet(x,y)
deviance(fit1)  # one for each lambda
#  [1] 98.83277 98.53893 98.29499 98.09246 97.92432 97.78472 97.66883
#  [8] 97.57261 97.49273 97.41327 97.29855 97.20332 97.12425 97.05861
# ...
# [57] 96.73772 96.73770
fit2 <- glmnet(x, y, lambda=.1) # fix lambda
deviance(fit2)
# [1] 98.10212
deviance(glm(y ~ x))
# [1] 96.73762
sum(residuals(glm(y ~ x))^2)
# [1] 96.73762
</syntaxhighlight>
</syntaxhighlight>


== Saturated model ==
From the documentation of [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/fisher.test.html fisher.test]
* The saturated model always has n parameters where n is the sample size.
<pre>
* [https://stats.stackexchange.com/questions/114073/logistic-regression-how-to-obtain-a-saturated-model Logistic Regression : How to obtain a saturated model]
Usage:
 
    fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
= Simulate data =
                control = list(), or = 1, alternative = "two.sided",
== Density plot ==
                conf.int = TRUE, conf.level = 0.95,
<syntaxhighlight lang='rsplus'>
                simulate.p.value = FALSE, B = 2000)
# plot a Weibull distribution with shape and scale
</pre>
func <- function(x) dweibull(x, shape = 1, scale = 3.38)
* For 2 by 2 cases, p-values are obtained directly using the (central or non-central) hypergeometric distribution.
curve(func, .1, 10)
* For 2 by 2 tables, the null of conditional independence is equivalent to the hypothesis that the odds ratio equals one.
 
* The alternative for a one-sided test is based on the odds ratio, so ‘alternative = "greater"’ is a test of the odds ratio being bigger than ‘or’.
func <- function(x) dweibull(x, shape = 1.1, scale = 3.38)
* Two-sided tests are based on the probabilities of the tables, and take as ‘more extreme’ all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.
curve(func, .1, 10)
</syntaxhighlight>
 
The shape parameter plays a role on the shape of the density function and the failure rate.
 
* Shape <=1: density is convex, not a hat shape.  
* Shape =1: failure rate (hazard function) is constant. [https://en.wikipedia.org/wiki/Exponential_distribution Exponential distribution].
* Shape >1: failure rate increases with time
 
== Simulate data from a specified density ==
* http://stackoverflow.com/questions/16134786/simulate-data-from-non-standard-density-function
 
== Signal to noise ratio ==
* https://en.wikipedia.org/wiki/Signal-to-noise_ratio
* https://stats.stackexchange.com/questions/31158/how-to-simulate-signal-noise-ratio
: <math>\frac{\sigma^2_{signal}}{\sigma^2_{noise}} = \frac{Var(f(X))}{Var(e)} </math> if Y = f(X) + e
* Page 401 of ESLII (https://web.stanford.edu/~hastie/ElemStatLearn//) 12th print.
 
Some examples of signal to noise ratio
* ESLII_print12.pdf: .64, 5, 4
* Yuan and Lin 2006: 1.8, 3
* [https://academic.oup.com/biostatistics/article/19/3/263/4093306#123138354 A framework for estimating and testing qualitative interactions with applications to predictive biomarkers] Roth, Biostatistics, 2018
 
== Effect size, Cohen's d and volcano plot ==
* https://en.wikipedia.org/wiki/Effect_size (See also the estimation by the [[#Two_sample_test_assuming_equal_variance|pooled sd]])
 
: <math>\theta = \frac{\mu_1 - \mu_2} \sigma,</math>
 
* [https://learningstatisticswithr.com/book/hypothesistesting.html#effectsize Effect size, sample size and power] from Learning statistics with R: A tutorial for psychology students and other beginners.
* [https://en.wikipedia.org/wiki/Effect_size#t-test_for_mean_difference_between_two_independent_groups t-statistic and Cohen's d] for the case of mean difference between two independent groups
* [http://www.win-vector.com/blog/2019/06/cohens-d-for-experimental-planning/ Cohen’s D for Experimental Planning]
* [https://en.wikipedia.org/wiki/Volcano_plot_(statistics) Volcano plot]
** Y-axis: -log(p)
** X-axis: log2 fold change OR effect size (Cohen's D). [https://twitter.com/biobenkj/status/1072141825568329728 An example] from RNA-Seq data.
 
= Multiple comparisons =
* If you perform experiments over and over, you's bound to find something. So significance level must be adjusted down when performing multiple hypothesis tests.
* http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf
* Book 'Multiple Comparison Using R' by Bretz, Hothorn and Westfall, 2011.
* [http://varianceexplained.org/statistics/interpreting-pvalue-histogram/ Plot a histogram of p-values], a post from varianceexplained.org. The anti-conservative histogram (tail on the RHS) is what we have typically seen in e.g. microarray gene expression data.
* [http://statistic-on-air.blogspot.com/2015/01/adjustment-for-multiple-comparison.html Comparison of different ways of multiple-comparison] in R.
 
Take an example, Suppose 550 out of 10,000 genes are significant at .05 level
# P-value < .05 ==> Expect .05*10,000=500 false positives
# False discovery rate < .05 ==> Expect .05*550 =27.5 false positives
# Family wise error rate < .05 ==> The probablity of at least 1 false positive <.05
 
According to [https://www.cancer.org/cancer/cancer-basics/lifetime-probability-of-developing-or-dying-from-cancer.html Lifetime Risk of Developing or Dying From Cancer], there is a 39.7% risk of developing a cancer for male during his lifetime (in other words, 1 out of every 2.52 men in US will develop some kind of cancer during his lifetime) and 37.6% for female. So the probability of getting at least one cancer patient in a 3-generation family is 1-.6**3 - .63**3 = 0.95.
 
== False Discovery Rate ==
* https://en.wikipedia.org/wiki/False_discovery_rate
* Paper [http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FDR.pdf Definition] by Benjamini and Hochberg in JRSS B 1995.
* A [http://xkcd.com/882/ comic]
* P-value vs false discovery rate vs family wise error rate. See [http://jtleek.com/talks 10 statistics tip] or [http://www.biostat.jhsph.edu/~jleek/teaching/2011/genomics/mt140688.pdf#page=14 Statistics for Genomics (140.688)] from Jeff Leek. Suppose 550 out of 10,000 genes are significant at .05 level
** P-value < .05 implies expecting .05*10000 = 500 false positives
** False discovery rate < .05 implies expecting .05*550 = 27.5 false positives
** Family wise error rate (P (# of false positives ≥ 1)) < .05. See [https://riffyn.com/riffyn-blog/2017/10/29/family-wise-error-rate Understanding Family-Wise Error Rate]
* [http://www.pnas.org/content/100/16/9440.full Statistical significance for genomewide studies] by Storey and Tibshirani.
* [http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/ What’s the probability that a significant p-value indicates a true effect?]
* http://onetipperday.sterding.com/2015/12/my-note-on-multiple-testing.html
* [https://www.biorxiv.org/content/early/2018/10/31/458786 A practical guide to methods controlling false discoveries in computational biology] by Korthauer, et al 2018, [https://rdcu.be/bFEt2 BMC Genome Biology] 2019
* [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz191/5380770 onlineFDR]: an R package to control the false discovery rate for growing data repositories
 
Suppose <math>p_1 \leq p_2 \leq ... \leq p_n</math>. Then
: <math>
\text{FDR}_i = \text{min}(1, n* p_i/i)
</math>.
So if the number of tests (<math>n</math>) is large and/or the original p value (<math>p_i</math>) is large, then FDR can hit the value 1.
 
However, the simple formula above does not guarantee the monotonicity property from the FDR. So the calculation in R is more complicated. See [https://stackoverflow.com/questions/29992944/how-does-r-calculate-the-false-discovery-rate How Does R Calculate the False Discovery Rate].
 
Below is the histograms of p-values and FDR (BH adjusted) from a real data (Pomeroy in BRB-ArrayTools).
 
[[:File:Hist bh.svg]]
 
And the next is a scatterplot w/ histograms on the margins from a null data.
 
[[:File:Scatterhist.svg]]
 
== q-value ==
q-value is defined as the minimum FDR that can be attained when calling that '''feature''' significant (i.e., expected proportion of false positives incurred when calling that feature significant).
 
If gene X has a q-value of 0.013 it means that 1.3% of genes that show p-values at least as small as gene X are false positives.


== SAM/Significance Analysis of Microarrays ==
== Chi-square independence test ==
The percentile option is used to define the number of falsely called genes based on 'B' permutations. If we use the 90-th percentile, the number of significant genes will be less than if we use the 50-th percentile/median.
[https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/ Exploring the underlying theory of the chi-square test through simulation - part 2]


In BRCA dataset, using the 90-th percentile will get 29 genes vs 183 genes if we use median.
== [https://en.wikipedia.org/wiki/Gene_set_enrichment_analysis GSEA] ==
Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states


== Multivariate permutation test ==
* https://www.bioconductor.org/help/course-materials/2015/SeattleApr2015/E_GeneSetEnrichment.html
In BRCA dataset, using 80% confidence gives 116 genes vs 237 genes if we use 50% confidence (assuming maximum proportion of false discoveries is 10%). The method is published on [http://www.sciencedirect.com/science/article/pii/S0378375803002118 EL Korn, JF Troendle, LM McShane and R Simon, ''Controlling the number of false discoveries: Application to high dimensional genomic data'', Journal of Statistical Planning and Inference, vol 124, 379-398 (2004)].
* http://software.broadinstitute.org/gsea/index.jsp
* [http://www.biorxiv.org/content/biorxiv/early/2017/09/08/186288.full.pdf Statistical power of gene-set enrichment analysis is a function of gene set correlation structure] by SWANSON 2017
* [https://www.biorxiv.org/content/10.1101/674267v1 Towards a gold standard for benchmarking gene set enrichment analysis], [http://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html GSEABenchmarkeR] package


== String Permutations Algorithm ==
Two categories of GSEA procedures:
https://youtu.be/nYFd7VHKyWQ
* Competitive:  compare genes in the test set relative to all other genes.
* Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by [http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/MeV/manual/gsea.html Jiang & Gentleman Bioinformatics 2007]


== Empirical Bayes Normal Means Problem with Correlated Noise ==
== McNemar’s test on paired nominal data ==
[https://arxiv.org/abs/1812.07488 Solving the Empirical Bayes Normal Means Problem with Correlated Noise] Sun 2018
https://en.wikipedia.org/wiki/McNemar%27s_test


The package [https://github.com/LSun/cashr cashr] and the [https://github.com/LSun/cashr_paper source code of the paper]
= Confidence vs Credibility Intervals =
http://freakonometrics.hypotheses.org/18117


= Bayes =
= Power analysis/Sample Size determination =
== Bayes factor ==
See [[Power|Power]].
* http://www.nicebread.de/what-does-a-bayes-factor-feel-like/


== Empirical Bayes method ==
= Common covariance/correlation structures =
* http://en.wikipedia.org/wiki/Empirical_Bayes_method
See [https://onlinecourses.science.psu.edu/stat502/node/228 psu.edu]. Assume covariance <math>\Sigma = (\sigma_{ij})_{p\times p} </math>
* [http://varianceexplained.org/r/empirical-bayes-book/ Introduction to Empirical Bayes: Examples from Baseball Statistics]


== Naive Bayes classifier ==
* Diagonal structure: <math>\sigma_{ij} = 0</math> if <math>i \neq j</math>.
[http://r-posts.com/understanding-naive-bayes-classifier-using-r/ Understanding Naïve Bayes Classifier Using R]
* Compound symmetry: <math>\sigma_{ij} = \rho</math> if <math>i \neq j</math>.
 
* First-order autoregressive AR(1) structure: <math>\sigma_{ij} = \rho^{|i - j|}</math>. <syntaxhighlight lang='rsplus'>
== MCMC ==
rho <- .8
[https://stablemarkets.wordpress.com/2018/03/16/speeding-up-metropolis-hastings-with-rcpp/ Speeding up Metropolis-Hastings with Rcpp]
p <- 5
 
blockMat <- rho ^ abs(matrix(1:p, p, p, byrow=T) - matrix(1:p, p, p))
= offset() function =
* An '''offset''' is a term to be added to a linear predictor, such as in a generalised linear model, with known coefficient 1 rather than an estimated coefficient.
* https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/offset
 
== Offset in Poisson regression ==
* http://rfunction.com/archives/223
* https://stats.stackexchange.com/questions/11182/when-to-use-an-offset-in-a-poisson-regression
 
# We need to model '''rates''' instead of '''counts'''
# More generally, you use offsets because the '''units''' of observation are different in some dimension (different populations, different geographic sizes) and the outcome is proportional to that dimension.
 
An example from [http://rfunction.com/archives/223 here]
<syntaxhighlight lang='rsplus'>
<- c(15,  7, 36,  4, 16, 12, 41, 15)
<- c(4949, 3534, 12210, 344, 6178, 4883, 11256, 7125)
x1 <- c(-0.1, 0, 0.2, 0, 1, 1.1, 1.1, 1)
x2 <- c(2.2, 1.5, 4.5, 7.2, 4.5, 3.2, 9.1, 5.2)
 
glm(Y ~ offset(log(N)) + (x1 + x2), family=poisson) # two variables
# Coefficients:
# (Intercept)          x1          x2
#    -6.172      -0.380        0.109
#
# Degrees of Freedom: 7 Total (i.e. Null);  5 Residual
# Null Deviance:     10.56
# Residual Deviance: 4.559 AIC: 46.69
glm(Y ~ offset(log(N)) + I(x1+x2), family=poisson)  # one variable
# Coefficients:
# (Intercept)   I(x1 + x2)
#  -6.12652      0.04746
#
# Degrees of Freedom: 7 Total (i.e. Null);  6 Residual
# Null Deviance:     10.56
# Residual Deviance: 8.001 AIC: 48.13
</syntaxhighlight>
</syntaxhighlight>
* Banded matrix: <math>\sigma_{ii}=1, \sigma_{i,i+1}=\sigma_{i+1,i} \neq 0, \sigma_{i,i+2}=\sigma_{i+2,i} \neq 0</math> and <math>\sigma_{ij}=0</math> for <math>|i-j| \ge 3</math>.
* Spatial Power
* Unstructured Covariance
* [https://en.wikipedia.org/wiki/Toeplitz_matrix Toeplitz structure]


== Offset in Cox regression ==
To create blocks of correlation matrix, use the "%x%" operator. See [https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/kronecker kronecker()].
An example from [https://github.com/cran/biospear/blob/master/R/PCAlasso.R biospear::PCAlasso()]
<syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='rsplus'>
coxph(Surv(time, status) ~ offset(off.All), data = data)
covMat <- diag(n.blocks) %x% blockMat
# Call:  coxph(formula = Surv(time, status) ~ offset(off.All), data = data)
#
# Null model
#  log likelihood= -2391.736
#  n= 500
 
# versus without using offset()
coxph(Surv(time, status) ~ off.All, data = data)
# Call:
# coxph(formula = Surv(time, status) ~ off.All, data = data)
#
#          coef exp(coef) se(coef)    z    p
# off.All 0.485    1.624    0.658 0.74 0.46
#
# Likelihood ratio test=0.54  on 1 df, p=0.5
# n= 500, number of events= 438
coxph(Surv(time, status) ~ off.All, data = data)$loglik
# [1] -2391.702 -2391.430    # initial coef estimate, final coef
</syntaxhighlight>
</syntaxhighlight>


== Offset in linear regression ==
= Counter/Special Examples =
* https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/lm
== Correlated does not imply independence ==
* https://stackoverflow.com/questions/16920628/use-of-offset-in-lm-regression-r
Suppose X is a normally-distributed random variable with zero mean. Let Y = X^2. Clearly X and Y are not independent: if you know X, you also know Y. And if you know Y, you know the absolute value of X.


= Overdispersion =
The covariance of X and Y is
https://en.wikipedia.org/wiki/Overdispersion
<pre>
  Cov(X,Y) = E(XY) - E(X)E(Y) = E(X^3) - 0*E(Y) = E(X^3)
          = 0,
</pre>
because the distribution of X is symmetric around zero.  Thus the correlation r(X,Y) = Cov(X,Y)/Sqrt[Var(X)Var(Y)] = 0, and we have a situation where the variables are not independent, yet
have (linear) correlation r(X,Y) = 0.


Var(Y) = phi * E(Y). If phi > 1, then it is overdispersion relative to Poisson. If phi <1, we have under-dispersion (rare).
This example shows how a linear correlation coefficient does not encapsulate anything about the quadratic dependence of Y upon X.


== Heterogeneity ==
== Spearman vs Pearson correlation ==
The Poisson model fit is not good; residual deviance/df >> 1. The lack of fit maybe due to missing data, covariates or overdispersion.
Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship. https://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation


Subjects within each covariate combination still differ greatly.
<pre>
 
x=(1:100); 
*https://onlinecourses.science.psu.edu/stat504/node/169.
y=exp(x);                       
* https://onlinecourses.science.psu.edu/stat504/node/162
cor(x,y, method='spearman') # 1
 
cor(x,y, method='pearson')  # .25
Consider Quasi-Poisson or negative binomial.
</pre>
 
== Test of overdispersion or underdispersion in Poisson models ==
https://stats.stackexchange.com/questions/66586/is-there-a-test-to-determine-whether-glm-overdispersion-is-significant


== Negative Binomial ==
== Spearman vs Wilcoxon ==
The mean of the Poisson distribution can itself be thought of as a random variable drawn from the gamma distribution thereby introducing an additional free parameter.
By [http://www.talkstats.com/threads/wilcoxon-signed-rank-test-or-spearmans-rho.42395/ this post]
* Wilcoxon used to compare categorical versus non-normal continuous variable
* Spearman's rho used to compare two continuous (including '''ordinal''') variables that one or both aren't normally distributed


== Binomial ==
== Spearman vs [https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient Kendall correlation] ==
* [https://www.rdatagen.net/post/overdispersed-binomial-data/ Generating and modeling over-dispersed binomial data]
* Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the '''ordinal''' association between two measured quantities.
* [https://cran.r-project.org/web/packages/simstudy/index.html simstudy] package. The final data sets can represent data from '''randomized control trials''', '''repeated measure (longitudinal) designs''', and cluster randomized trials. Missingness can be generated using various mechanisms (MCAR, MAR, NMAR). [https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/ Analyzing a binary outcome arising out of within-cluster, pair-matched randomization].
* [https://stats.stackexchange.com/questions/3943/kendall-tau-or-spearmans-rho Kendall Tau or Spearman's rho?]


= Count data =
== [http://en.wikipedia.org/wiki/Anscombe%27s_quartet Anscombe quartet] ==
== Zero counts ==
* [https://doi.org/10.1080/00031305.2018.1444673 A Method to Handle Zero Counts in the Multinomial Model]


== Bias ==
Four datasets have almost same properties: same mean in X, same mean in Y, same variance in X, (almost) same variance in Y, same correlation in X and Y, same linear regression.  
[https://amstat.tandfonline.com/doi/full/10.1080/00031305.2018.1564699 Bias in Small-Sample Inference With Count-Data Models] Blackburn 2019


= Survival data analysis =
[[:File:Anscombe quartet 3.svg]]
See [[Survival_data|Survival data analysis]]


= Logistic regression =
== The real meaning of spurious correlations ==
 
https://nsaunders.wordpress.com/2017/02/03/the-real-meaning-of-spurious-correlations/
== Simulate binary data from the logistic model ==
https://stats.stackexchange.com/questions/46523/how-to-simulate-artificial-data-for-logistic-regression
<syntaxhighlight lang='rsplus'>
<syntaxhighlight lang='rsplus'>
set.seed(666)
library(ggplot2)
x1 = rnorm(1000)          # some continuous variables
x2 = rnorm(1000)
z = 1 + 2*x1 + 3*x2        # linear combination with a bias
pr = 1/(1+exp(-z))        # pass through an inv-logit function
y = rbinom(1000,1,pr)      # bernoulli response variable
   
   
#now feed it to glm:
set.seed(123)
df = data.frame(y=y,x1=x1,x2=x2)
spurious_data <- data.frame(x = rnorm(500, 10, 1),
glm( y~x1+x2,data=df,family="binomial")
                            y = rnorm(500, 10, 1),
</syntaxhighlight>
                            z = rnorm(500, 30, 3))
cor(spurious_data$x, spurious_data$y)
# [1] -0.05943856
spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) +
theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")


== Building a Logistic Regression model from scratch ==
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
https://www.analyticsvidhya.com/blog/2015/10/basics-logistic-regression
# [1] 0.4517972
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
theme_bw() + geom_smooth(method = "lm") +
scale_color_gradientn(colours = c("red", "white", "blue")) +
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")


== Odds ratio ==
spurious_data$z <- rnorm(500, 30, 6)
Calculate the odds ratio from the coefficient estimates; see [https://stats.stackexchange.com/questions/8661/logistic-regression-in-r-odds-ratio this post].
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
<syntaxhighlight lang='rsplus'>
# [1] 0.8424597
require(MASS)
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
N  <- 100              # generate some data
theme_bw() + geom_smooth(method = "lm") +
X1 <- rnorm(N, 175, 7)
scale_color_gradientn(colours = c("red", "white", "blue")) +
X2 <- rnorm(N,  30, 8)
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")
X3 <- abs(rnorm(N, 60, 30))
</syntaxhighlight>
Y  <- 0.5*X1 - 0.3*X2 - 0.4*X3 + 10 + rnorm(N, 0, 12)


# dichotomize Y and do logistic regression
= Time series =
Yfac  <- cut(Y, breaks=c(-Inf, median(Y), Inf), labels=c("lo", "hi"))
* [http://ellisp.github.io/blog/2016/12/07/arima-prediction-intervals Why time series forecasts prediction intervals aren't as good as we'd hope]
glmFit <- glm(Yfac ~ X1 + X2 + X3, family=binomial(link="logit"))


exp(cbind(coef(glmFit), confint(glmFit))) 
== Structural change ==
</syntaxhighlight>
[https://datascienceplus.com/structural-changes-in-global-warming/ Structural Changes in Global Warming]
 
== AUC ==
[https://hopstat.wordpress.com/2014/12/19/a-small-introduction-to-the-rocr-package/ A small introduction to the ROCR package]
<pre>
      predict.glm()            ROCR::prediction()    ROCR::performance()
glmobj ------------> predictTest -----------------> ROCPPred ---------> AUC
newdata                labels
</pre>


= Medical applications =
== AR(1) processes and random walks ==
== Subgroup analysis ==
[https://fdabl.github.io/r/Spurious-Correlation.html Spurious correlations and random walks]
Other related keywords: recursive partitioning, randomized clinical trials (RCT)


* [https://www.rdatagen.net/post/sub-group-analysis-in-rct/ Thinking about different ways to analyze sub-groups in an RCT]
= Measurement Error model =
* [http://onlinelibrary.wiley.com/doi/10.1002/sim.7064/full Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials] I Lipkovich, A Dmitrienko - Statistics in medicine, 2017
* [https://en.wikipedia.org/wiki/Errors-in-variables_models Errors-in-variables models/errors-in-variables models or measurement error models]
* Personalized medicine:Four perspectives of tailored medicine SJ Ruberg, L Shen - Statistics in Biopharmaceutical Research, 2015
* [https://onlinelibrary.wiley.com/doi/10.1111/biom.13112 Simulation‐‐Selection‐‐Extrapolation: Estimation in High‐‐Dimensional Errors‐‐in‐‐Variables Models] Nghiem 2019
* Berger, J. O., Wang, X., and Shen, L. (2014), “A Bayesian Approach to Subgroup Identification,” Journal of Biopharmaceutical Statistics, 24, 110–129.
* [https://rpsychologist.com/treatment-response-subgroup Change over time is not "treatment response"]


== Interaction analysis ==
= Dictionary =
* Goal: '''assessing the predictiveness of biomarkers''' by testing their '''interaction (strength) with the treatment'''.  
* '''Prognosis''' is the probability that an event or diagnosis will result in a particular outcome.
* [https://onlinelibrary.wiley.com/doi/epdf/10.1002/sim.7608 Evaluation of biomarkers for treatment selection usingindividual participant data from multiple clinical trials] Kang et al 2018
** For example, on the paper [http://clincancerres.aacrjournals.org/content/18/21/6065.figures-only Developing and Validating Continuous Genomic Signatures in Randomized Clinical Trials for Predictive Medicine] by Matsui 2012, the prognostic score .1 (0.9) represents a '''good (poor)''' prognosis.
* http://www.stat.purdue.edu/~ghobbs/STAT_512/Lecture_Notes/ANOVA/Topic_27.pdf#page=15. For survival data, y-axis is the survival time and B1=treatment, B2=control and X-axis is treatment-effect modifying score. But as seen on [http://www.stat.purdue.edu/~ghobbs/STAT_512/Lecture_Notes/ANOVA/Topic_27.pdf#page=16 page16], the effects may not be separated.
** Prostate cancer has a much higher one-year overall survival rate than pancreatic cancer, and thus has a better prognosis. See [https://en.wikipedia.org/wiki/Survival_rate Survival rate] in wikipedia.
* [http://onlinelibrary.wiley.com/doi/10.1002/bimj.201500234/full Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces] N Ternès, F Rotolo, G Heinze, S Michiels - Biometrical Journal, 2017
* [https://onlinelibrary.wiley.com/doi/epdf/10.1002/sim.6564 Designing a study to evaluate the benefitof a biomarker for selectingpatient treatment] Janes 2015
* [https://onlinelibrary.wiley.com/doi/epdf/10.1002/pst.1728 A visualization method measuring theperformance of biomarkers for guidingtreatment decisions] Yang et al 2015. Predictiveness curves were used a lot.
* [https://onlinelibrary.wiley.com/doi/epdf/10.1111/biom.12191 Combining Biomarkers to Optimize Patient TreatmentRecommendations] Kang et al 2014. Several simulations are conducted.
* [https://www.ncbi.nlm.nih.gov/pubmed/24695044 An approach to evaluating and comparing biomarkers for patient treatment selection] Janes et al 2014
* [http://journals.sagepub.com/doi/pdf/10.1177/0272989X13493147 A Framework for Evaluating Markers Used to Select Patient Treatment] Janes et al 2014
* Tian, L., Alizaden, A. A., Gentles, A. J., and Tibshirani, R. (2014) “A Simple Method for Detecting Interactions Between a Treatment and a Large Number of Covariates,” and the [https://books.google.com/books?hl=en&lr=&id=2gG3CgAAQBAJ&oi=fnd&pg=PA79&ots=y5LqF3vk-T&sig=r2oaOxf9gcjK-1bvFHVyfvwscP8#v=onepage&q&f=true book chapter].
* [https://biostats.bepress.com/cgi/viewcontent.cgi?article=1228&context=uwbiostat Statistical Methods for Evaluating and Comparing Biomarkers for Patient Treatment Selection] Janes et al 2013
* [https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1541-0420.2011.01722.x Assessing Treatment-Selection Markers using a Potential Outcomes Framework] Huang et al 2012
* [https://biostats.bepress.com/cgi/viewcontent.cgi?article=1223&context=uwbiostat Methods for Evaluating Prediction Performance of Biomarkers and Tests] Pepe et al 2012
* Measuring the performance of markers for guiding treatment decisions by Janes, et al 2011. <syntaxhighlight lang='rsplus'>
cf <- c(2, 1, .5, 0)
f1 <- function(x) { z <- cf[1] + cf[3] + (cf[2]+cf[4])*x; 1/ (1 + exp(-z)) }
f0 <- function(x) { z <- cf[1] + cf[2]*x; 1/ (1 + exp(-z)) }
par(mfrow=c(1,3))
curve(f1, -3, 3, col = 'red', ylim = c(0, 1),
      ylab = '5-year DFS Rate', xlab = 'Marker A/D Value',
      main = 'Predictiveness Curve', lwd = 2)
curve(f0, -3, 3, col = 'black', ylim = c(0, 1),
      xlab = '', ylab = '', lwd = 2, add = TRUE)
legend(.5, .4, c("control", "treatment"),
      col = c("black", "red"), lwd = 2)


cf <- c(.1, 1, -.1, .5)
= Data =
curve(f1, -3, 3, col = 'red', ylim = c(0, 1),
== Eleven quick tips for finding research data ==
      ylab = '5-year DFS Rate', xlab = 'Marker G Value',
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006038
      main = 'Predictiveness Curve', lwd = 2)
curve(f0, -3, 3, col = 'black', ylim = c(0, 1),
      xlab = '', ylab = '', lwd = 2, add = TRUE)
legend(.5, .4, c("control", "treatment"),
      col = c("black", "red"), lwd = 2)
abline(v= - cf[3]/cf[4], lty = 2)


cf <- c(1, -1, 1, 2)
= Books =
curve(f1, -3, 3, col = 'red', ylim = c(0, 1),
* [https://leanpub.com/biostatmethods Methods in Biostatistics with R] ($)
      ylab = '5-year DFS Rate', xlab = 'Marker B Value',
* [http://web.stanford.edu/class/bios221/book/ Modern Statistics for Modern Biology] (free)
      main = 'Predictiveness Curve', lwd = 2)
* Principles of Applied Statistics, by David Cox & Christl Donnelly
curve(f0, -3, 3, col = 'black', ylim = c(0, 1),
* [https://www.amazon.com/Freedman-Robert-Pisani-Statistics-Hardcover/dp/B004QNRMDK/ Statistics] by David Freedman,Robert Pisani, Roger Purves
      xlab = '', ylab = '', lwd = 2, add = TRUE)
* [https://onlinelibrary.wiley.com/topic/browse/000113 Wiley Online Library -> Statistics] (Access by NIH Library)
legend(.5, .85, c("control", "treatment"),
* [https://web.stanford.edu/~hastie/CASI/ Computer Age Statistical Inference: Algorithms, Evidence and Data Science] by Efron and Hastie 2016
      col = c("black", "red"), lwd = 2)
abline(v= - cf[3]/cf[4], lty = 2)
</syntaxhighlight> [[:File:PredcurveLogit.svg]]
* [https://www.degruyter.com/downloadpdf/j/ijb.2014.10.issue-1/ijb-2012-0052/ijb-2012-0052.pdf An Approach to Evaluating and Comparing Biomarkers for Patient Treatment Selection] The International Journal of Biostatistics by Janes, 2014. Y-axis is risk given marker, not P(T > t0|X). Good details.
* Gunter, L., Zhu, J., and Murphy, S. (2011), “Variable Selection for Qualitative Interactions in Personalized Medicine While Controlling the Family-Wise Error Rate,” Journal of Biopharmaceutical Statistics, 21, 1063–1078.


= Statistical Learning =
= Social =
* [http://statweb.stanford.edu/~tibs/ElemStatLearn/ Elements of Statistical Learning] Book homepage
== JSM ==
* [http://heather.cs.ucdavis.edu/draftregclass.pdf From Linear Models to Machine Learning] by Norman Matloff
* 2019
* [http://www.kdnuggets.com/2017/04/10-free-must-read-books-machine-learning-data-science.html 10 Free Must-Read Books for Machine Learning and Data Science]
** [https://minecr.shinyapps.io/jsm2019-schedule/ JSM 2019] and the [http://www.citizen-statistician.org/2019/07/shiny-for-jsm-2019/ post].
* [https://towardsdatascience.com/the-10-statistical-techniques-data-scientists-need-to-master-1ef6dbd531f7 10 Statistical Techniques Data Scientists Need to Master]
** [https://rviews.rstudio.com/2019/07/19/an-r-users-guide-to-jsm-2019/ An R Users Guide to JSM 2019]
*# Linear regression
*# Classification: Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis
*# Resampling methods: Bootstrapping and Cross-Validation
*# Subset selection: Best-Subset Selection, Forward Stepwise Selection, Backward Stepwise Selection, Hybrid Methods
*# Shrinkage/regularization: Ridge regression, Lasso
*# Dimension reduction: Principal Components Regression, Partial least squares
*# Nonlinear models: Piecewise function, Spline, generalized additive model
*# Tree-based methods: Bagging, Boosting, Random Forest
*# Support vector machine
*# Unsupervised learning: PCA, k-means, Hierarchical
* [https://www.listendata.com/2018/03/regression-analysis.html?m=1 15 Types of Regression you should know]


== LDA (Fisher's linear discriminant), QDA ==
== Following ==
* https://en.wikipedia.org/wiki/Linear_discriminant_analysis
* [http://jtleek.com/ Jeff Leek], https://twitter.com/jtleek
* [https://datascienceplus.com/how-to-perform-logistic-regression-lda-qda-in-r/ How to perform Logistic Regression, LDA, & QDA in R]
* Revolutions, http://blog.revolutionanalytics.com/
* [http://r-posts.com/discriminant-analysis-statistics-all-the-way/ Discriminant Analysis: Statistics All The Way]
* RStudio Blog, https://blog.rstudio.com/
* [https://onlinelibrary.wiley.com/doi/10.1111/biom.13065 Multiclass linear discriminant analysis with ultrahigh‐dimensional features] Li 2019
* Sean Davis, https://twitter.com/seandavis12, https://github.com/seandavi
* [http://stephenturner.us/post/ Stephen Turner], https://twitter.com/genetics_blog

Revision as of 22:35, 7 September 2019

Use

When entropy was applied to the variable selection, we want to select a class variable which gives a largest entropy difference between without any class variable (compute entropy using response only) and with that class variable (entropy is computed by adding entropy in each class level) because this variable is most discriminative and it gives most information gain. For example,

  • entropy (without any class)=.94,
  • entropy(var 1) = .69,
  • entropy(var 2)=.91,
  • entropy(var 3)=.725.

We will choose variable 1 since it gives the largest gain (.94 - .69) compared to the other variables (.94 -.91, .94 -.725).

Why is picking the attribute with the most information gain beneficial? It reduces entropy, which increases predictability. A decrease in entropy signifies an decrease in unpredictability, which also means an increase in predictability.

Consider a split of a continuous variable. Where should we cut the continuous variable to create a binary partition with the highest gain? Suppose cut point c1 creates an entropy .9 and another cut point c2 creates an entropy .1. We should choose c2.

Related

In addition to information gain, gini (dʒiːni) index is another metric used in decision tree. See wikipedia page about decision tree learning.

Ensembles

Bagging

Draw N bootstrap samples and summary the results (averaging for regression problem, majority vote for classification problem). Decrease variance without changing bias. Not help much with underfit or high bias models.

Random forest

Variance importance: if you scramble the values of a variable, and the accuracy of your tree does not change much, then the variable is not very important.

Why is it useful to compute variance importance? So the model's predictions are easier to interpret (not improve the prediction performance).

Random forest has advantages of easier to run in parallel and suitable for small n large p problems.

Random forest versus logistic regression: a large-scale benchmark experiment by Raphael Couronné, BMC Bioinformatics 2018

Arborist: Parallelized, Extensible Random Forests

On what to permute in test-based approaches for variable importance measures in Random Forests

Boosting

Instead of selecting data points randomly with the boostrap, it favors the misclassified points.

Algorithm:

  • Initialize the weights
  • Repeat
    • resample with respect to weights
    • retrain the model
    • recompute weights

Since boosting requires computation in iterative and bagging can be run in parallel, bagging has an advantage over boosting when the data is very large.

Time series

Ensemble learning for time series forecasting in R

p-values

p-values

Distribution of p values in medical abstracts

nominal p-value and Empirical p-values

  • Nominal p-values are based on asymptotic null distributions
  • Empirical p-values are computed from simulations/permutations

(nominal) alpha level

Conventional methodology for statistical testing is, in advance of undertaking the test, to set a NOMINAL ALPHA CRITERION LEVEL (often 0.05). The outcome is classified as showing STATISTICAL SIGNIFICANCE if the actual ALPHA (probability of the outcome under the null hypothesis) is no greater than this NOMINAL ALPHA CRITERION LEVEL.

Normality assumption

Violating the normality assumption may be the lesser of two evils

T-statistic

See T-statistic.

ANOVA

See ANOVA.

Goodness of fit

Chi-square tests

Fitting distribution

Fitting distributions with R

Contingency Tables

Odds ratio and Risk ratio

The ratio of the odds of an event occurring in one group to the odds of it occurring in another group

         drawn   | not drawn | 
-------------------------------------
white |   A      |   B       | Wh
-------------------------------------
black |   C      |   D       | Bk
  • Odds Ratio = (A / C) / (B / D) = (AD) / (BC)
  • Risk Ratio = (A / Wh) / (C / Bk)

Hypergeometric, One-tailed Fisher exact test

         drawn   | not drawn | 
-------------------------------------
white |   x      |           | m
-------------------------------------
black |  k-x     |           | n
-------------------------------------
      |   k      |           | m+n

For example, k=100, m=100, m+n=1000,

> 1 - phyper(10, 100, 10^3-100, 100, log.p=F)
[1] 0.4160339
> a <- dhyper(0:10, 100, 10^3-100, 100)
> cumsum(rev(a))
  [1] 1.566158e-140 1.409558e-135 3.136408e-131 3.067025e-127 1.668004e-123 5.739613e-120 1.355765e-116
  [8] 2.325536e-113 3.018276e-110 3.058586e-107 2.480543e-104 1.642534e-101  9.027724e-99  4.175767e-96
 [15]  1.644702e-93  5.572070e-91  1.638079e-88  4.210963e-86  9.530281e-84  1.910424e-81  3.410345e-79
 [22]  5.447786e-77  7.821658e-75  1.013356e-72  1.189000e-70  1.267638e-68  1.231736e-66  1.093852e-64
 [29]  8.900857e-63  6.652193e-61  4.576232e-59  2.903632e-57  1.702481e-55  9.240350e-54  4.650130e-52
 [36]  2.173043e-50  9.442985e-49  3.820823e-47  1.441257e-45  5.074077e-44  1.669028e-42  5.134399e-41
 [43]  1.478542e-39  3.989016e-38  1.009089e-36  2.395206e-35  5.338260e-34  1.117816e-32  2.200410e-31
 [50]  4.074043e-30  7.098105e-29  1.164233e-27  1.798390e-26  2.617103e-25  3.589044e-24  4.639451e-23
 [57]  5.654244e-22  6.497925e-21  7.042397e-20  7.198582e-19  6.940175e-18  6.310859e-17  5.412268e-16
 [64]  4.377256e-15  3.338067e-14  2.399811e-13  1.626091e-12  1.038184e-11  6.243346e-11  3.535115e-10
 [71]  1.883810e-09  9.442711e-09  4.449741e-08  1.970041e-07  8.188671e-07  3.193112e-06  1.167109e-05
 [78]  3.994913e-05  1.279299e-04  3.828641e-04  1.069633e-03  2.786293e-03  6.759071e-03  1.525017e-02
 [85]  3.196401e-02  6.216690e-02  1.120899e-01  1.872547e-01  2.898395e-01  4.160339e-01  5.550192e-01
 [92]  6.909666e-01  8.079129e-01  8.953150e-01  9.511926e-01  9.811343e-01  9.942110e-01  9.986807e-01
 [99]  9.998018e-01  9.999853e-01  1.000000e+00

# Density plot
plot(0:100, dhyper(0:100, 100, 10^3-100, 100), type='h')

File:Dhyper.svg

Moreover,

  1 - phyper(q=10, m, n, k) 
= 1 - sum_{x=0}^{x=10} phyper(x, m, n, k)
= 1 - sum(a[1:11]) # R's index starts from 1.

Another example is the data from the functional annotation tool in DAVID.

               | gene list | not gene list | 
-------------------------------------------------------
pathway        |   3  (q)  |               | 40 (m)
-------------------------------------------------------
not in pathway |  297      |               | 29960 (n)
-------------------------------------------------------
               |  300 (k)  |               | 30000

The one-tailed p-value from the hypergeometric test is calculated as 1 - phyper(3-1, 40, 29960, 300) = 0.0074.

Fisher's exact test

Following the above example from the DAVID website, the following R command calculates the Fisher exact test for independence in 2x2 contingency tables.

> fisher.test(matrix(c(3, 40, 297, 29960), nr=2)) #  alternative = "two.sided" by default

        Fisher's Exact Test for Count Data

data:  matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.008853
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  1.488738 23.966741
sample estimates:
odds ratio
  7.564602

> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="greater")

        Fisher's Exact Test for Count Data

data:  matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.008853
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 1.973   Inf
sample estimates:
odds ratio
  7.564602

> fisher.test(matrix(c(3, 40, 297, 29960), nr=2), alternative="less")

        Fisher's Exact Test for Count Data

data:  matrix(c(3, 40, 297, 29960), nr = 2)
p-value = 0.9991
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
  0.00000 20.90259
sample estimates:
odds ratio
  7.564602

From the documentation of fisher.test

Usage:
     fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
                 control = list(), or = 1, alternative = "two.sided",
                 conf.int = TRUE, conf.level = 0.95,
                 simulate.p.value = FALSE, B = 2000)
  • For 2 by 2 cases, p-values are obtained directly using the (central or non-central) hypergeometric distribution.
  • For 2 by 2 tables, the null of conditional independence is equivalent to the hypothesis that the odds ratio equals one.
  • The alternative for a one-sided test is based on the odds ratio, so ‘alternative = "greater"’ is a test of the odds ratio being bigger than ‘or’.
  • Two-sided tests are based on the probabilities of the tables, and take as ‘more extreme’ all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.

Chi-square independence test

Exploring the underlying theory of the chi-square test through simulation - part 2

GSEA

Determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states

Two categories of GSEA procedures:

  • Competitive: compare genes in the test set relative to all other genes.
  • Self-contained: whether the gene-set is more DE than one were to expect under the null of no association between two phenotype conditions (without reference to other genes in the genome). For example the method by Jiang & Gentleman Bioinformatics 2007

McNemar’s test on paired nominal data

https://en.wikipedia.org/wiki/McNemar%27s_test

Confidence vs Credibility Intervals

http://freakonometrics.hypotheses.org/18117

Power analysis/Sample Size determination

See Power.

Common covariance/correlation structures

See psu.edu. Assume covariance \(\displaystyle \Sigma = (\sigma_{ij})_{p\times p} \)

  • Diagonal structure: \(\displaystyle \sigma_{ij} = 0\) if \(\displaystyle i \neq j\).
  • Compound symmetry: \(\displaystyle \sigma_{ij} = \rho\) if \(\displaystyle i \neq j\).
  • First-order autoregressive AR(1) structure: \(\displaystyle \sigma_{ij} = \rho^{|i - j|}\).
    rho <- .8
    p <- 5
    blockMat <- rho ^ abs(matrix(1:p, p, p, byrow=T) - matrix(1:p, p, p))
  • Banded matrix: \(\displaystyle \sigma_{ii}=1, \sigma_{i,i+1}=\sigma_{i+1,i} \neq 0, \sigma_{i,i+2}=\sigma_{i+2,i} \neq 0\) and \(\displaystyle \sigma_{ij}=0\) for \(\displaystyle |i-j| \ge 3\).
  • Spatial Power
  • Unstructured Covariance
  • Toeplitz structure

To create blocks of correlation matrix, use the "%x%" operator. See kronecker().

covMat <- diag(n.blocks) %x% blockMat

Counter/Special Examples

Correlated does not imply independence

Suppose X is a normally-distributed random variable with zero mean. Let Y = X^2. Clearly X and Y are not independent: if you know X, you also know Y. And if you know Y, you know the absolute value of X.

The covariance of X and Y is

  Cov(X,Y) = E(XY) - E(X)E(Y) = E(X^3) - 0*E(Y) = E(X^3)
           = 0, 

because the distribution of X is symmetric around zero. Thus the correlation r(X,Y) = Cov(X,Y)/Sqrt[Var(X)Var(Y)] = 0, and we have a situation where the variables are not independent, yet have (linear) correlation r(X,Y) = 0.

This example shows how a linear correlation coefficient does not encapsulate anything about the quadratic dependence of Y upon X.

Spearman vs Pearson correlation

Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship. https://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation

x=(1:100);<br />
y=exp(x);                      <br />
cor(x,y, method='spearman') # 1
cor(x,y, method='pearson')  # .25

Spearman vs Wilcoxon

By this post

  • Wilcoxon used to compare categorical versus non-normal continuous variable
  • Spearman's rho used to compare two continuous (including ordinal) variables that one or both aren't normally distributed

Spearman vs Kendall correlation

  • Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities.
  • Kendall Tau or Spearman's rho?

Anscombe quartet

Four datasets have almost same properties: same mean in X, same mean in Y, same variance in X, (almost) same variance in Y, same correlation in X and Y, same linear regression.

File:Anscombe quartet 3.svg

The real meaning of spurious correlations

https://nsaunders.wordpress.com/2017/02/03/the-real-meaning-of-spurious-correlations/

library(ggplot2)
 
set.seed(123)
spurious_data <- data.frame(x = rnorm(500, 10, 1),
                            y = rnorm(500, 10, 1),
                            z = rnorm(500, 30, 3))
cor(spurious_data$x, spurious_data$y)
# [1] -0.05943856
spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) + 
theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")

cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.4517972
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
 theme_bw() + geom_smooth(method = "lm") + 
scale_color_gradientn(colours = c("red", "white", "blue")) + 
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")

spurious_data$z <- rnorm(500, 30, 6)
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.8424597
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) + 
theme_bw() + geom_smooth(method = "lm") + 
scale_color_gradientn(colours = c("red", "white", "blue")) + 
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")

Time series

Structural change

Structural Changes in Global Warming

AR(1) processes and random walks

Spurious correlations and random walks

Measurement Error model

Dictionary

Data

Eleven quick tips for finding research data

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006038

Books

Social

JSM

Following