Probability: Difference between revisions
No edit summary |
|||
(49 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Calculus-based Probability | Calculus-based Probability | ||
This is content covered in STAT410 and STAT700 at UMD. | |||
==Basics== | ==Basics== | ||
Line 6: | Line 8: | ||
* <math>P(S) = 1</math> where <math>S</math> is your sample space | * <math>P(S) = 1</math> where <math>S</math> is your sample space | ||
* For mutually exclusive events <math>E_1, E_2, ...</math>, <math>P\left(\bigcup_i^\infty E_i\right) = \sum_i^\infty P(E_i)</math> | * For mutually exclusive events <math>E_1, E_2, ...</math>, <math>P\left(\bigcup_i^\infty E_i\right) = \sum_i^\infty P(E_i)</math> | ||
===Monotonicity=== | ===Monotonicity=== | ||
* For all events <math>A</math> | * For all events <math>A</math> and <math>B</math>, <math>A \subset B \implies P(A) \leq P(B)</math> | ||
{{hidden | Proof | }} | {{hidden | Proof | }} | ||
===Conditional Probability=== | |||
<math>P(A|B)</math> is the probability of event A given event B.<br> | |||
Mathematically, this is defined as <math>P(A|B) = P(A,B) / P(B)</math>.<br> | |||
Note that this can also be written as <math>P(A|B)P(B) = P(A, B)</math> | |||
With some additional substitution, we get '''Baye's Theorem''': | |||
<math> | |||
P(A|B) = \frac{P(B|A)P(A)}{P(B)} | |||
</math> | |||
==Random Variables== | |||
A random variable is a variable which takes on a distribution rather than a value. | |||
===PMF, PDF, CDF=== | |||
For discrete distributions, we call <math>p_{X}(x)=P(X=x)</math> the probability mass function (PMF).<br> | |||
For continuous distributions, we have the probability density function (PDF) <math>f(x)</math>.<br> | |||
The comulative distribution function (CDF) is <math>F(x) = P(X \leq x)</math>.<br> | |||
The CDF is the prefix sum of the PMF or the integral of the PDF. Likewise, the PDF is the derivative of the CDF. | |||
===Joint Random Variables=== | |||
Two random variables are independant iff <math>f_{X,Y}(x,y) = f_X(x) f_Y(y)</math>.<br> | |||
Otherwise, the marginal distribution is <math>f_X(x) = \int f_{X,Y}(x,y) dy</math>. | |||
===Change of variables=== | |||
Let <math>g</math> be a monotonic increasing function and <math>Y = g(X)</math>.<br> | |||
Then <math>F_Y(y) = P(Y \leq y) = P(X \leq g^{-1}(y)) = F_X(g^{-1}(y))</math>.<br> | |||
And <math>f_Y(y) = \frac{d}{dy} F_Y(y) = \frac{d}{dy} F_X(g^{-1}(y)) = f_X(g^{-1}(y)) \frac{d}{dy}g^{-1}(y)</math><br> | |||
Hence: | |||
<math display="block"> | |||
f_Y(y) = f_x(g^{-1}(y)) \frac{d}{dy} g^{-1}(y) | |||
</math> | |||
==Expectation and Variance== | ==Expectation and Variance== | ||
Line 17: | Line 51: | ||
* <math>E(X) = \sum_S xp(x)</math> or <math>\int_S xp(x)dx</math> | * <math>E(X) = \sum_S xp(x)</math> or <math>\int_S xp(x)dx</math> | ||
* <math>Var(X) = E[(X-E(X))^2] = E(X^2) - (E(X))^2</math> | * <math>Var(X) = E[(X-E(X))^2] = E(X^2) - (E(X))^2</math> | ||
===Total Expection=== | ===Total Expection=== | ||
<math> | <math>E_{X}(X) = E_{Y}(E_{X|Y}(X|Y))</math><br> | ||
Dr. Xu refers to this as the smooth property. | Dr. Xu refers to this as the smooth property. | ||
{{hidden | Proof | | {{hidden | Proof | | ||
<math> | <math> | ||
E(X) = \int_S | \begin{aligned} | ||
= \int_x x \int_y p(x,y)dy dx | E(X) &= \int_S x p(x)dx \\ | ||
= \int_x x \int_y p(x|y)p(y)dy dx | &= \int_x x \int_y p(x,y)dy dx \\ | ||
= \int_y\int_x x p(x|y)dxp(y)dy | &= \int_x x \int_y p(x|y)p(y)dy dx \\ | ||
&= \int_y\int_x x p(x|y)dxp(y)dy | |||
\end{aligned} | |||
</math> | </math> | ||
}} | }} | ||
===Total Variance=== | ===Total Variance=== | ||
<math>Var(Y) = E(Var(Y|X)) + Var(E(Y | X)</math><br> | <math>Var(Y) = E(Var(Y|X)) + Var(E(Y | X))</math><br> | ||
This one is not used as often on tests as total expectation | This one is not used as often on tests as total expectation | ||
{{hidden | Proof | | {{hidden | Proof | | ||
<math> | |||
\begin{aligned} | |||
Var(Y) &= E(Y^2) - E(Y)^2 \\ | |||
&= E(E(Y^2|X)) - E(E(Y|X))^2\\ | |||
&= E(Var(Y|X) + E(Y|X)^2) - E(E(Y|X))^2\\ | |||
&= E((Var(Y|X)) + E(E(Y|X)^2) - E(E(Y|X))^2\\ | |||
&= E((Var(Y|X)) + Var(E(Y|X))\\ | |||
\end{aligned} | |||
</math> | |||
}} | |||
}} | ===Sample Mean and Variance=== | ||
The sample mean is <math>\bar{X} = \frac{1}{n}\sum_{i=1}^{n}X_i</math>.<br> | |||
The unbiased sample variance is <math>S^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2</math>. | |||
====Student's Theorem==== | |||
Let <math>X_1,...,X_n</math> be from <math>N(\mu, \sigma^2)</math>.<br> | |||
Then the following results about the sample mean <math>\bar{X}</math> | |||
and the unbiased sample variance <math>S^2</math> hold: | |||
* <math>\bar{X}</math> and <math>S^2</math> are independent | |||
* <math>\bar{X} \sim N(\mu, \sigma^2 / n)</math> | |||
* <math>(n-1)S^2 / \sigma^2 \sim \chi^2(n-1)</math> | |||
===Jensen's Inequality=== | |||
{{main | Wikipedia: Jensen's inequality}} | |||
Let g be a convex function (i.e. second derivative is positive). | |||
Then <math>g(E(x)) \leq E(g(x))</math>. | |||
==Moments and Moment Generating Functions== | ==Moments and Moment Generating Functions== | ||
===Definitions=== | ===Definitions=== | ||
{{main | Wikipedia: Moment (mathematics) | Wikipedia: Central moment | Wikipedia: Moment-generating function}} | |||
* <math>E(X^n)</math> the n'th moment | |||
* <math>E((X-\mu)^n)</math> the n'th central moment | |||
* <math>E(((X-\mu) / \sigma)^n)</math> the n'th standardized moment | |||
Expectation is the first moment and variance is the second central moment.<br> | |||
Additionally, ''skew'' is the third standardized moment and ''kurtosis'' is the fourth standardized moment. | |||
===Moment Generating Functions=== | ===Moment Generating Functions=== | ||
<math>E(e^{tX})</math | To compute moments, we can use a moment generating function (MGF): | ||
<math display="block">M_X(t) = E(e^{tX})</math> | |||
With the MGF, we can get any order moments by taking n derivatives and setting <math display="inline">t=0</math>. | |||
; Notes | ; Notes | ||
* The | * The MGF, if it exists, uniquely defines the distribution. | ||
* The | * The MGF of <math>X+Y</math> is <math>MGF_{X+Y}(t) = E(e^{t(X+Y)})=E(e^{tX})E(e^{tY}) = MGF_X(t) * MGF_Y(t)</math> | ||
===Characteristic function=== | ===Characteristic function=== | ||
==Convergence== | ==Convergence== | ||
There are 4 types of convergence | {{main | Wikipedia: Convergence of random variables}} | ||
There are 4 common types of convergence. | |||
===Almost Surely=== | ===Almost Surely=== | ||
* <math>P(\lim X_i = X) = 1</math> | * <math>P(\lim X_i = X) = 1</math> | ||
Line 72: | Line 140: | ||
==Delta Method== | ==Delta Method== | ||
{{main | Wikipedia:Delta method}} | |||
Suppose <math>\sqrt{n}(X_n - \theta) \xrightarrow{D} N(0, \sigma^2)</math>.<br> | Suppose <math>\sqrt{n}(X_n - \theta) \xrightarrow{D} N(0, \sigma^2)</math>.<br> | ||
Let <math>g</math> be a function such that <math>g'</math> exists and <math>g'(\theta) \neq 0</math><br> | Let <math>g</math> be a function such that <math>g'</math> exists and <math>g'(\theta) \neq 0</math><br> | ||
Then <math>\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{D} N(0, \sigma^2 g'(\theta)^2)</math> | Then <math>\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{D} N(0, \sigma^2 g'(\theta)^2)</math> | ||
Multivariate:<br> | Multivariate:<br> | ||
<math>\sqrt{n}(B - \beta) \xrightarrow{D} N(0, \Sigma) \implies \sqrt{n}(h(B)-h(\beta)) \xrightarrow{D} N(0, h'(\theta)^T \Sigma h'(\theta))</math> | <math>\sqrt{n}(B - \beta) \xrightarrow{D} N(0, \Sigma) \implies \sqrt{n}(h(B)-h(\beta)) \xrightarrow{D} N(0, h'(\theta)^T \Sigma h'(\theta))</math> | ||
;Notes | ;Notes | ||
* You can think of this like the Mean Value theorem for random variables. | * You can think of this like the Mean Value theorem for random variables. | ||
** <math>(g(X_n) - g(\theta)) \approx g'(\theta)(X_n - \theta)</math> | |||
==Order Statistics== | |||
Consider iid random variables <math>X_1, ..., X_n</math>.<br> | |||
Then the order statistics are <math>X_{(1)}, ..., X_{(n)}</math> where <math>X_{(i)}</math> represents the i'th smallest number. | |||
===Min and Max=== | |||
The easiest to reason about are the minimum and maximum order statistics: | |||
<math>P(X_{(1)} <= x) = P(\text{min}(X_i) <= x) = 1 - P(X_1 > x, ..., X_n > x)</math> | |||
<math>P(X_{(n)} <= x) = P(\text{max}(X_i) <= x) = P(X_1 <= x, ..., X_n <= x)</math> | |||
===Joint PDF=== | |||
If <math>X_i</math> has pdf <math>f</math>, the joint pdf of <math>X_{(1)}, ..., X_{(n)}</math> is: | |||
<math> | |||
g(x_1, ...) = n!*f(x_1)*...*f(x_n) | |||
</math> | |||
since there are n! ways perform a change of variables. | |||
===Individual PDF=== | |||
<math> | |||
f_{X(i)}(x) = \frac{n!}{(i-1)!(n-i)!} F(x)^{i-1} f(x) [1-F(x)]^{n-1} | |||
</math> | |||
==Inequalities and Limit Theorems== | ==Inequalities and Limit Theorems== | ||
Line 88: | Line 181: | ||
{{hidden | Proof | | {{hidden | Proof | | ||
<math> | <math> | ||
\begin{aligned} | |||
E(X) | E(X) | ||
= \int_{0}^{\infty}xf(x)dx | &= \int_{0}^{\infty}xf(x)dx \\ | ||
= \int_{0}^{a}xf(x)dx + \int_{a}^{\infty}xf(x)dx | &= \int_{0}^{a}xf(x)dx + \int_{a}^{\infty}xf(x)dx\\ | ||
\geq \int_{a}^{\infty}xf(x)dx | &\geq \int_{a}^{\infty}xf(x)dx\\ | ||
\geq \int_{a}^{\infty}af(x)dx | &\geq \int_{a}^{\infty}af(x)dx\\ | ||
=a \int_{a}^{\infty}f(x)dx | &=a \int_{a}^{\infty}f(x)dx\\ | ||
=a*P(X \geq a)\\ | &=a * P(X \geq a)\\ | ||
\implies P( | \implies& P(X \geq a) \leq \frac{E(X)}{a} | ||
\end{aligned} | |||
</math> | </math> | ||
}} | }} | ||
===Chebyshev's Inequality=== | ===Chebyshev's Inequality=== | ||
* <math>P(|X - \mu| \geq k \sigma) \leq \frac{1}{k^2}</math> | * <math>P(|X - \mu| \geq k \sigma) \leq \frac{1}{k^2}</math> | ||
Line 103: | Line 199: | ||
{{hidden | Proof | | {{hidden | Proof | | ||
Apply Markov's inequality:<br> | Apply Markov's inequality:<br> | ||
Let <math>Y = |X - \mu|</math> | Let <math>Y = |X - \mu|</math><br> | ||
<math>P(|X - \mu| \geq k) = P(Y \geq k) | Then:<br> | ||
<math> | |||
\begin{aligned} | |||
P(|X - \mu| \geq k) &= P(Y \geq k) \\ | |||
&= P(Y^2 \geq k^2) \\ | |||
&\leq \frac{E(Y^2)}{k^2} \\ | |||
&= \frac{E((X - \mu)^2)}{k^2} | |||
\end{aligned} | |||
</math> | |||
}} | }} | ||
* Usually used to prove convergence in probability | * Usually used to prove convergence in probability | ||
Line 124: | Line 228: | ||
* The sample mean converges to the population mean almost surely. | * The sample mean converges to the population mean almost surely. | ||
==Relationships between distributions== | ==Properties and Relationships between distributions== | ||
{{main | Wikipedia: Relationships among probability distributions}} | |||
;This is important for exams. | |||
===Poisson | ===Poisson Distribution=== | ||
* If <math>X_i \sim Poisson(\lambda_i)</math> then <math>\sum X_i \sim Poisson(\sum \lambda_i)</math> | |||
===Normal | ===Normal Distribution=== | ||
* If <math>X_1 \sim N(\mu_1, \sigma_1^2)</math> and <math>X_2 \sim N(\mu_2, \sigma_2^2)</math> then <math>\lambda_1 X_1 + \lambda_2 X_2 \sim N(\lambda_1 \mu_1 + \lambda_2 X_2, \lambda_1^2 \sigma_1^2 + \lambda_2^2 + \sigma_2^2)</math> for any <math>\lambda_1, \lambda_2 \in \mathbb{R}</math> | * If <math>X_1 \sim N(\mu_1, \sigma_1^2)</math> and <math>X_2 \sim N(\mu_2, \sigma_2^2)</math> then <math>\lambda_1 X_1 + \lambda_2 X_2 \sim N(\lambda_1 \mu_1 + \lambda_2 X_2, \lambda_1^2 \sigma_1^2 + \lambda_2^2 + \sigma_2^2)</math> for any <math>\lambda_1, \lambda_2 \in \mathbb{R}</math> | ||
===Gamma | ===Exponential Distribution=== | ||
* <math>\operatorname{Exp}(\lambda)</math> is equivalent to <math>\Gamma(1, 1/\lambda)</math> | |||
** Note that some conventions flip the second parameter of gamma, so it would be <math>\Gamma(1, \lambda)</math> | |||
* If <math>\epsilon_1, ..., \epsilon_n</math> are exponential distributions then <math>\min\{\epsilon_i\} \sim \exp(\sum \lambda_i)</math> | |||
* Note that the maximum is not exponentially distributed | |||
** However, if <math>X_1, ..., X_n \sim \exp(1)</math> then <math>Z_n=n\exp(\max\{\epsilon_i\}) \rightarrow \exp(1)</math> | |||
===Gamma Distribution=== | |||
Note exponential distributions are also Gamma distrubitions | Note exponential distributions are also Gamma distrubitions | ||
* If <math>X \sim \Gamma(k, \theta)</math> then <math>\lambda X \sim \Gamma(k, c\theta)</math>.<br> | * If <math>X \sim \Gamma(k, \theta)</math> then <math>\lambda X \sim \Gamma(k, c\theta)</math>.<br> | ||
Line 141: | Line 252: | ||
===T-distribution=== | ===T-distribution=== | ||
Ratio of normal and squared-root of Chi-sq distribution yields T-distribution. | * Ratio of standard normal and squared-root of Chi-sq distribution yields T-distribution. | ||
** If <math>Z \sim N(0,1)</math> and <math> V \sim \Chi^2(v)</math> then <math>\frac{Z}{\sqrt{V/v}} \sim \text{t-dist}(v)</math> | |||
===Chi-Sq Distribution=== | ===Chi-Sq Distribution=== | ||
The ratio of two normalized Chi-sq is an F-distributions | * The ratio of two normalized Chi-sq is an F-distributions | ||
** If <math>X \sim \chi^2_{d1}</math> and <math>Y \sim \chi^2_{d2}</math> then <math>\frac{X/d1}{Y/d2} \sim F(d1,d2)</math> | |||
* If <math>Z_1,...,Z_k \sim N(0,1)</math> then <math>Z_1^2 + ... + Z_k^2 \sim \Chi^2(k)</math> | |||
* If <math>X_i \sim \Chi^2(k_i)</math> then <math>X_1 + ... + X_n \sim \Chi^2(k_1 +...+ k_n)</math> | |||
* <math>\Chi^2(k)</math> is equivalent to <math>\Gamma(k/2, 2)</math> | |||
===F Distribution=== | ===F Distribution=== | ||
Too many. See [ | Too many to list. See [[Wikipedia: F-distribution]]. | ||
Most important are Chi-sq and T distribution | |||
Most important are Chi-sq and T distribution: | |||
* If <math>X \sim \chi^2_{d1}</math> and <math>Y \sim \chi^2_{d2}</math> then <math>\frac{X/d1}{Y/d2} \sim F(d1,d2)</math> | |||
* If <math>X \sim t_{(n)}</math> then <math>X^2 \sim F(1, n)</math> and <math>X^{-2} \sim F(n, 1)</math> | |||
==Textbooks== | ==Textbooks== | ||
* Sheldon Ross' A First Course in Probability | * [https://smile.amazon.com/dp/032179477X Sheldon Ross' A First Course in Probability] | ||
* [https://smile.amazon.com | * [https://smile.amazon.com/dp/0321795431 Hogg and Craig's Mathematical Statistics] | ||
* Casella and Burger's Statistical Inference | * [https://smile.amazon.com/dp/0534243126 Casella and Burger's Statistical Inference] |