Machine Learning: Difference between revisions

 
(71 intermediate revisions by the same user not shown)
Line 1: Line 1:
Machine Learning
Machine Learning


Here are some selected topics from the ML classes CMSC422 and CMSC726 at UMD.
==Loss functions==
===(Mean) Squared Error===
The squared error is:<br>
<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br>
If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br>
{{hidden|Proof|
<math>
\begin{aligned}
\nabla_{w} J(w) &= \nabla_{w} \sum (w^tx^{(i)} - y^{(i)})^2\\
&= 2\sum (w^t x^{(i)} - y^{(i)})x \\
\implies \nabla_{w}^2 J(w) &= \nabla 2\sum (w^T x^{(i)} - y^{(i)})x^{(i)}\\
&= 2 \sum x^{(i)}(x^{(i)})^T
\end{aligned}
</math><br>
so the hessian is positive semi-definite
}}
===Cross Entropy===
The cross entropy loss is
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>




==Hyperparameters==
 
===Batch Size===
;Notes
[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
* This is the sum of the log probabilities of picking the correct class (i.e. p if y=1 or 1-p if y=0).
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
 
{{hidden | Proof |
We show that the Hessian is positive semi definite.<br>
<math>
\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]
</math><br>
<math>
= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]
</math><br>
<math>
= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T
</math><br>
which is a PSD matrix
}}
 
===Hinge Loss===


==Optimization==
==Optimization==
Line 22: Line 73:
     update using above gradient
     update using above gradient
</pre>
</pre>
;Batch Size
* [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
===Coordinate Block Descent===
===Coordinate Block Descent===




===Learning Rate===
===Learning Rate===
===Hessian===
Some optimization methods may rely on the Hessian.<br>
;How to calculate <math>Hv</math><br>
Use finite differencing with directional derivatives:<br>
*<math> H_{\theta} \mathbf{v} = \lim_{\epsilon \rightarrow 0} \frac{g(\theta + \epsilon \mathbf{v}) - g(\theta)}{\epsilon}</math>
;How to calculate <math>H_{\theta}^{-1}v</math>?
Calculate H. Then use gradient descent to minimize <math>\frac{1}{2}x^T H x - v^T x</math>.<br>
By first order optimality, the minimum occurs when the gradient is zero.<br>
* <math>\nabla_{x} (\frac{1}{2}x^T H x - v^T x) = (\frac{1}{2}H + \frac{1}{2}H^T)x - v^T = Hx - v^T</math>
* <math>\implies x^* = H^{-1}v</math>
* Using [https://math.stackexchange.com/questions/222894/how-to-take-the-gradient-of-the-quadratic-form Gradient of quadratic form]
==SVM==
==SVM==
[http://cs229.stanford.edu/notes/cs229-notes3.pdf Andrew Ng Notes]<br>
[https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf Andrew Ng Notes]<br>
Support Vector Machine<br>
Support Vector Machine<br>
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
Line 44: Line 112:


;Notes
;Notes
* <math>\mathbf{w}</math> is the normal vector of our hyperplane so <math>\frac{w}{\Vert w \Vert})^Tx^{(i)}</math> is the length of the projection of x onto our normal vector.  
* <math>\mathbf{w}</math> is the normal vector of our hyperplane so \((\frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector.  
: This is the distance to our hyperplane.
: This is the distance to our hyperplane.


Line 55: Line 123:
\end{aligned}
\end{aligned}
</math><br>
</math><br>
which is equivalent to by setting <math>\hat{\gamma}=1</math>
which is equivalent to, by setting <math>\hat{\gamma}=1</math>,<br>
<math>
<math>
\begin{aligned}
\begin{aligned}
Line 75: Line 143:
</math><br>
</math><br>
where <math>\mathcal{L}(w, \alpha, \beta) = f(w) + \sum \alpha_i g_i(w) + \sum \beta_i h_i(w)</math> is called the lagrangian.<br>
where <math>\mathcal{L}(w, \alpha, \beta) = f(w) + \sum \alpha_i g_i(w) + \sum \beta_i h_i(w)</math> is called the lagrangian.<br>
Since <math>\min \max f \leq \max \min f</math>,<br>
Since <math>\max \min \leq \min \max</math>,<br>
we have:<br>
we have:<br>
<math>
<math>
\min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) \leq \max_{\alpha, \beta \mid \alpha \geq 0}\min_{w} \mathcal{L}(w, \alpha, \beta)
\max_{\alpha, \beta \mid \alpha \geq 0}\min_{w} \mathcal{L}(w, \alpha, \beta) \leq \min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta)
</math><br>
</math><br>
The left term is called the dual problem.<br>
The left term is called the dual problem.<br>
If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the same as the original problem.
If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the solution to the original problem.


===Kernel Trick===
===Kernel Trick===
Line 87: Line 155:
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
 
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
\(\DeclareMathOperator{\sign}{sign}\)
Suppose our model is <math>\hat{y}=\sign \sum_{i=1}^{n} w_i y_i \langle x_i, z \rangle</math>. 
In this case, our model is a linear combination of the training data \(y\) where <math>\langle x_i, z \rangle</math> represents a similarity between \(z\) and \(x_i\). 
Since we only use <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math> to simulate a non-linear processing of the data.
 
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>.
Ideally, our kernel function should be able to be computed without having to compute the actual <math>\phi</math> and the full dot product.
 
====Identifying if a function is a kernel====
====Identifying if a function is a kernel====
Basic check:
Basic check:
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.
 
====Mercer's Theorem====
====Mercer's Theorem====
Let our kernel function be <math>K(z,x)</math>.
Let our kernel function be <math>K(z,x)</math>.
Line 99: Line 175:
Positive Definite:<br>
Positive Definite:<br>
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
Then  
Then <br>
<math>
\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j]
= \sum_i \sum_j v_{i}K_{ij}v_{j}
</math><br>
<math>
= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}
</math><br>
<math>
= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
</math><br>
<math>
= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
</math><br>
<math>
= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}
</math><br>
<math>
<math>
\begin{aligned}
= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2
\mathbf{v}^T \mathbf{K} \mathbf{v}
\geq 0
&= v^T [\sum_j K_{ij}v_j]\\
&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
&= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\
&= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
&\geq 0
\end{aligned}
</math>
</math>
}}
}}
====Common Kernels====
; Polynomial Kernel
* See [[Wikipedia:Polynomial kernel]]
* \(K(x,z) = (c+x^Tz)^d\)
* For \(d=2\), we have
\[
\begin{align}
(1+x^Tz)^2 &= 1 + 2(x^Tz) + (x^Tz)^2\\
&= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)\\
&= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2\\
&= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2
\end{align}
\]
* The dimension of the feature map associated with this kernel is exponential in d
** There are \(1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)\) terms


==Learning Theory==
==Learning Theory==
Line 122: Line 221:
===Uniform Convergence===
===Uniform Convergence===
If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br>
If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br>
Then  
Then if a training set is <math>\epsilon / 2</math>-representative,<br>
<math>
<math>
L_D(h_s)  
L_D(h_s)  
Line 135: Line 234:
[https://www.stat.berkeley.edu/~bartlett/courses/2013spring-stat210b/notes/10notes.pdf Some slides]
[https://www.stat.berkeley.edu/~bartlett/courses/2013spring-stat210b/notes/10notes.pdf Some slides]
====Shattering====
====Shattering====
A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if there exists <math>\theta</math> such that <math>f</math> makes no errors.
A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if for every possible set of binary labellings <math>\{y_1,...,y_n\}</math> there exists <math>\theta</math> such that <math>f</math> makes no errors.
 
====Definition====
====Definition====
Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.<br>
Stolen from wikipedia:<br>
Stolen from wikipedia:<br>
The VC dimension of a model <math>f</math> is the maximum number of points that can be arranged so that <math>f</math> shatters them.
The VC dimension of a model <math>f</math> is the maximum number of points that can be arranged so that <math>f</math> shatters them.
Line 144: Line 245:
* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
* Number of parameters do not necessarily correspond to VC dimension.  
* Number of parameters do not necessarily correspond to VC dimension.  
: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter
*: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter


====Theory====
====Theory====
[https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]<br>
In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have with probability <math>1-\delta</math>
* <math>|L_D(h) - L_S(h)| < \sqrt{
\frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}
</math>
: where <math>m</math> is the size of the sample.
For all h in H,
For all h in H,
* <math>|L_D(h) - L_S(h)| < K_1 \sqrt{
* <math>|L_D(h) - L_S(h)| < K_1 \sqrt{
\frac{VCdim + K_2 log(2/\delta)}{2m}}
\frac{VCdim + K_2 \log(2/\delta)}{2m}}
</math>
</math>
: for some constants <math>K_1, K_2</math>
====Growth Function====
The growth function is maximum number of ways <math>m</math> examples can be labelled using hypotheses from <math>\mathcal{H}</math>
* <math>\tau_H(m) = \max_{|C| = m} |H_C|</math>
;Notes
* If <math>m \leq VCdim(H)</math>, then <math>\tau_H(m) = 2^m</math>
====Sauer's Lemma====
[https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf Reference]<br>
After the VCdim, the growth function grows as a polynomial
* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>
* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>.


===Bias-Variance Tradeoff===
===Bias-Variance Tradeoff===
* Let <math>L_D(h)</math> be the true loss of hypothesis h
* Let <math>L_D(h)</math> be the true loss of hypothesis <math>h</math> and <math>L_s(h)</math> be the empirical loss of hypothesis <math>h</math>.
: and <math>L_S(h)</math> be the true loss of hypothesis h
** Here D is the true distribution and s is the training sample.
* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
* The term <math>L_D(h_D^*)</math>
* The term <math>L_D(h_D^*)</math> is called the bias
* The term <math>[L_D(h_s^*) - L_D(h_D^*)]</math> is called variance.
* The term <math>[L_D(h_s^*) - L_D(h_D^*)]</math> is called variance.
* Larger hypothesis class will get smaller bias but larger variance.
* Larger hypothesis class will get smaller bias but larger variance.
* Overfitting vs. underfitting
* Overfitting vs. underfitting
===Rademacher Complexity===
====Definition====
The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.<br>
In this case, we see how well we can fit random noise.<br>
Given a set <math>A \subset \mathbb{R}^m</math> the Rademacher complexity is:<br>
<math>R(A) = \frac{1}{m}E_{\sigma} [\sup_{a \in A} \sum_{i=1}^{m} \sigma_i a_i]</math><br>
where each <math>\sigma_i</math> are from a discrete uniform distribution <math>\{-1, 1\}</math><br>
Given a sample <math>S=\{z_1,...,z_n\}</math> and a function class <math>F</math>, the empirical rademacher complexity is:<br>
<math>R(F \circ S)</math><br>
where <math>F \circ S = \{(f(z_1),...,f(z_n)) \mid f \in F\}</math><br>
;Notes
===Concentration Bounds===
====Hoeffding's inequality====
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>
==See Also==
* [[Supervised Learning]]
* [[Unsupervised Learning]]
* [[Deep Learning]]