Machine Learning: Difference between revisions
(75 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Machine Learning | Machine Learning | ||
Here are some selected topics from the ML classes CMSC422 and CMSC726 at UMD. | |||
==Loss functions== | |||
===(Mean) Squared Error=== | |||
The squared error is:<br> | |||
<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br> | |||
If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br> | |||
{{hidden|Proof| | |||
<math> | |||
\begin{aligned} | |||
\nabla_{w} J(w) &= \nabla_{w} \sum (w^tx^{(i)} - y^{(i)})^2\\ | |||
&= 2\sum (w^t x^{(i)} - y^{(i)})x \\ | |||
\implies \nabla_{w}^2 J(w) &= \nabla 2\sum (w^T x^{(i)} - y^{(i)})x^{(i)}\\ | |||
&= 2 \sum x^{(i)}(x^{(i)})^T | |||
\end{aligned} | |||
</math><br> | |||
so the hessian is positive semi-definite | |||
}} | |||
===Cross Entropy=== | |||
The cross entropy loss is | |||
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math> | |||
;Notes | |||
* This is the sum of the log probabilities of picking the correct class (i.e. p if y=1 or 1-p if y=0). | |||
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex | |||
{{hidden | Proof | | |||
We show that the Hessian is positive semi definite.<br> | |||
<math> | |||
\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))] | |||
</math><br> | |||
<math> | |||
= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}] | |||
</math><br> | |||
<math> | |||
= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}] | |||
</math><br> | |||
<math> | |||
= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}] | |||
</math><br> | |||
<math> | |||
= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}] | |||
</math><br> | |||
<math> | |||
\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}] | |||
</math><br> | |||
<math> | |||
= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T | |||
</math><br> | |||
which is a PSD matrix | |||
}} | |||
== | ===Hinge Loss=== | ||
== | |||
==Optimization== | ==Optimization== | ||
Line 22: | Line 73: | ||
update using above gradient | update using above gradient | ||
</pre> | </pre> | ||
;Batch Size | |||
* [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size] | |||
===Coordinate Block Descent=== | ===Coordinate Block Descent=== | ||
===Learning Rate=== | ===Learning Rate=== | ||
===Hessian=== | |||
Some optimization methods may rely on the Hessian.<br> | |||
;How to calculate <math>Hv</math><br> | |||
Use finite differencing with directional derivatives:<br> | |||
*<math> H_{\theta} \mathbf{v} = \lim_{\epsilon \rightarrow 0} \frac{g(\theta + \epsilon \mathbf{v}) - g(\theta)}{\epsilon}</math> | |||
;How to calculate <math>H_{\theta}^{-1}v</math>? | |||
Calculate H. Then use gradient descent to minimize <math>\frac{1}{2}x^T H x - v^T x</math>.<br> | |||
By first order optimality, the minimum occurs when the gradient is zero.<br> | |||
* <math>\nabla_{x} (\frac{1}{2}x^T H x - v^T x) = (\frac{1}{2}H + \frac{1}{2}H^T)x - v^T = Hx - v^T</math> | |||
* <math>\implies x^* = H^{-1}v</math> | |||
* Using [https://math.stackexchange.com/questions/222894/how-to-take-the-gradient-of-the-quadratic-form Gradient of quadratic form] | |||
==SVM== | ==SVM== | ||
[ | [https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf Andrew Ng Notes]<br> | ||
Support Vector Machine<br> | Support Vector Machine<br> | ||
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br> | This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br> | ||
Line 41: | Line 109: | ||
;Geometric Margin | ;Geometric Margin | ||
The geometric margin is the actual distance.<br> | The geometric margin is the actual distance.<br> | ||
<math>\hat{\gamma}^{(i)} = y^{(i)}((\frac{w}{\Vert w \Vert})^Tx^{(i)}+\frac{b}{ | <math>\hat{\gamma}^{(i)} = y^{(i)}((\frac{w}{\Vert w \Vert})^Tx^{(i)}+\frac{b}{\Vert w \Vert})</math><br> | ||
;Notes | ;Notes | ||
* <math>\mathbf{w}</math> is the normal vector of our hyperplane so | * <math>\mathbf{w}</math> is the normal vector of our hyperplane so \((\frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector. | ||
: This is the distance to our hyperplane. | : This is the distance to our hyperplane. | ||
===Lagrangians=== | ===Lagrangians=== | ||
The goal for svm is to maximize the margin:<br> | |||
<math> | |||
\begin{aligned} | |||
\max_{\hat{\gamma}, w, b} &\frac{\hat{\gamma}}{\Vert w \Vert}\\ | |||
\text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq \hat{\gamma} \quad \forall i | |||
\end{aligned} | |||
</math><br> | |||
which is equivalent to, by setting <math>\hat{\gamma}=1</math>,<br> | |||
<math> | |||
\begin{aligned} | |||
\min_{\gamma, w, b} &\Vert w \Vert ^2\\ | |||
\text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq 1 \quad \forall i | |||
\end{aligned} | |||
</math><br><br> | |||
In general, given an optimization in the (primal) form:<br> | |||
<math> | |||
\begin{aligned} | |||
\min_w & f(w)\\ | |||
\text{s.t. }& h_i(w) \leq 0 \quad \forall i\\ | |||
& g_i(w) = 0 | |||
\end{aligned} | |||
</math><br> | |||
we can rewrite the optimization as <br> | |||
<math> | |||
\min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) | |||
</math><br> | |||
where <math>\mathcal{L}(w, \alpha, \beta) = f(w) + \sum \alpha_i g_i(w) + \sum \beta_i h_i(w)</math> is called the lagrangian.<br> | |||
Since <math>\max \min \leq \min \max</math>,<br> | |||
we have:<br> | |||
<math> | |||
\max_{\alpha, \beta \mid \alpha \geq 0}\min_{w} \mathcal{L}(w, \alpha, \beta) \leq \min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) | |||
</math><br> | |||
The left term is called the dual problem.<br> | |||
If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the solution to the original problem. | |||
===Kernel Trick=== | ===Kernel Trick=== | ||
Line 53: | Line 155: | ||
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br> | In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br> | ||
For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math> | For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math> | ||
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>< | \(\DeclareMathOperator{\sign}{sign}\) | ||
Suppose our model is <math>\hat{y}=\sign \sum_{i=1}^{n} w_i y_i \langle x_i, z \rangle</math>. | |||
In this case, our model is a linear combination of the training data \(y\) where <math>\langle x_i, z \rangle</math> represents a similarity between \(z\) and \(x_i\). | |||
Since we only use <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math> to simulate a non-linear processing of the data. | |||
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>. | |||
Ideally, our kernel function should be able to be computed without having to compute the actual <math>\phi</math> and the full dot product. | |||
====Identifying if a function is a kernel==== | ====Identifying if a function is a kernel==== | ||
Basic check: | Basic check: | ||
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel. | Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel. | ||
====Mercer's Theorem==== | ====Mercer's Theorem==== | ||
Let our kernel function be <math>K(z,x)</math>. | Let our kernel function be <math>K(z,x)</math>. | ||
Line 65: | Line 175: | ||
Positive Definite:<br> | Positive Definite:<br> | ||
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br> | Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br> | ||
Then | Then <br> | ||
<math> | <math> | ||
\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j] | |||
\mathbf{v}^T \mathbf{K} \mathbf{v} | = \sum_i \sum_j v_{i}K_{ij}v_{j} | ||
</math><br> | |||
<math> | |||
= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j} | |||
</math><br> | |||
<math> | |||
= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j} | |||
</math><br> | |||
<math> | |||
= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j} | |||
</math><br> | |||
<math> | |||
= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j} | |||
</math><br> | |||
<math> | |||
= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2 | |||
\geq 0 | |||
</math> | </math> | ||
}} | }} | ||
====Common Kernels==== | |||
; Polynomial Kernel | |||
* See [[Wikipedia:Polynomial kernel]] | |||
* \(K(x,z) = (c+x^Tz)^d\) | |||
* For \(d=2\), we have | |||
\[ | |||
\begin{align} | |||
(1+x^Tz)^2 &= 1 + 2(x^Tz) + (x^Tz)^2\\ | |||
&= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)\\ | |||
&= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2\\ | |||
&= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2 | |||
\end{align} | |||
\] | |||
* The dimension of the feature map associated with this kernel is exponential in d | |||
** There are \(1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)\) terms | |||
==Learning Theory== | ==Learning Theory== | ||
Line 88: | Line 221: | ||
===Uniform Convergence=== | ===Uniform Convergence=== | ||
If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br> | If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br> | ||
Then | Then if a training set is <math>\epsilon / 2</math>-representative,<br> | ||
<math> | <math> | ||
L_D(h_s) | L_D(h_s) | ||
Line 101: | Line 234: | ||
[https://www.stat.berkeley.edu/~bartlett/courses/2013spring-stat210b/notes/10notes.pdf Some slides] | [https://www.stat.berkeley.edu/~bartlett/courses/2013spring-stat210b/notes/10notes.pdf Some slides] | ||
====Shattering==== | ====Shattering==== | ||
A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if there exists <math>\theta</math> such that <math>f</math> makes no errors. | A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if for every possible set of binary labellings <math>\{y_1,...,y_n\}</math> there exists <math>\theta</math> such that <math>f</math> makes no errors. | ||
====Definition==== | ====Definition==== | ||
Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.<br> | |||
Stolen from wikipedia:<br> | Stolen from wikipedia:<br> | ||
The VC dimension of a model <math>f</math> is the maximum number of points that can be arranged so that <math>f</math> shatters them. | The VC dimension of a model <math>f</math> is the maximum number of points that can be arranged so that <math>f</math> shatters them. | ||
Line 110: | Line 245: | ||
* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class. | * To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class. | ||
* Number of parameters do not necessarily correspond to VC dimension. | * Number of parameters do not necessarily correspond to VC dimension. | ||
: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter | *: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter | ||
====Theory==== | ====Theory==== | ||
[https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]<br> | |||
In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have with probability <math>1-\delta</math> | |||
* <math>|L_D(h) - L_S(h)| < \sqrt{ | |||
\frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}} | |||
</math> | |||
: where <math>m</math> is the size of the sample. | |||
For all h in H, | For all h in H, | ||
* <math>|L_D(h) - L_S(h)| < K_1 \sqrt{ | * <math>|L_D(h) - L_S(h)| < K_1 \sqrt{ | ||
\frac{VCdim + K_2 log(2/\delta)}{2m}} | \frac{VCdim + K_2 \log(2/\delta)}{2m}} | ||
</math> | </math> | ||
: for some constants <math>K_1, K_2</math> | |||
====Growth Function==== | |||
The growth function is maximum number of ways <math>m</math> examples can be labelled using hypotheses from <math>\mathcal{H}</math> | |||
* <math>\tau_H(m) = \max_{|C| = m} |H_C|</math> | |||
;Notes | |||
* If <math>m \leq VCdim(H)</math>, then <math>\tau_H(m) = 2^m</math> | |||
====Sauer's Lemma==== | |||
[https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf Reference]<br> | |||
After the VCdim, the growth function grows as a polynomial | |||
* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math> | |||
* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>. | |||
===Bias-Variance Tradeoff=== | ===Bias-Variance Tradeoff=== | ||
* Let <math>L_D(h)</math> be the true loss of hypothesis h | * Let <math>L_D(h)</math> be the true loss of hypothesis <math>h</math> and <math>L_s(h)</math> be the empirical loss of hypothesis <math>h</math>. | ||
** Here D is the true distribution and s is the training sample. | |||
* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math> | * <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math> | ||
* The term <math>L_D(h_D^*)</math> | * The term <math>L_D(h_D^*)</math> is called the bias | ||
* The term <math>[L_D(h_s^*) - L_D(h_D^*)]</math> is called variance. | * The term <math>[L_D(h_s^*) - L_D(h_D^*)]</math> is called variance. | ||
* Larger hypothesis class will get smaller bias but larger variance. | * Larger hypothesis class will get smaller bias but larger variance. | ||
* Overfitting vs. underfitting | * Overfitting vs. underfitting | ||
===Rademacher Complexity=== | |||
====Definition==== | |||
The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.<br> | |||
In this case, we see how well we can fit random noise.<br> | |||
Given a set <math>A \subset \mathbb{R}^m</math> the Rademacher complexity is:<br> | |||
<math>R(A) = \frac{1}{m}E_{\sigma} [\sup_{a \in A} \sum_{i=1}^{m} \sigma_i a_i]</math><br> | |||
where each <math>\sigma_i</math> are from a discrete uniform distribution <math>\{-1, 1\}</math><br> | |||
Given a sample <math>S=\{z_1,...,z_n\}</math> and a function class <math>F</math>, the empirical rademacher complexity is:<br> | |||
<math>R(F \circ S)</math><br> | |||
where <math>F \circ S = \{(f(z_1),...,f(z_n)) \mid f \in F\}</math><br> | |||
;Notes | |||
===Concentration Bounds=== | |||
====Hoeffding's inequality==== | |||
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br> | |||
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math> | |||
==See Also== | |||
* [[Supervised Learning]] | |||
* [[Unsupervised Learning]] | |||
* [[Deep Learning]] |