Machine Learning: Difference between revisions

 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
Machine Learning
Machine Learning
Here are some selected topics from the ML classes CMSC422 and CMSC726 at UMD.


==Loss functions==
==Loss functions==
Line 21: Line 23:
The cross entropy loss is
The cross entropy loss is
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
;Notes
;Notes
* This is the sum of the log probabilities of picking the correct class (i.e. p if y=1 or 1-p if y=0).
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex


Line 89: Line 95:


==SVM==
==SVM==
[http://cs229.stanford.edu/notes/cs229-notes3.pdf Andrew Ng Notes]<br>
[https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf Andrew Ng Notes]<br>
Support Vector Machine<br>
Support Vector Machine<br>
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
Line 106: Line 112:


;Notes
;Notes
* <math>\mathbf{w}</math> is the normal vector of our hyperplane so <math>\frac{w}{\Vert w \Vert})^Tx^{(i)}</math> is the length of the projection of x onto our normal vector.  
* <math>\mathbf{w}</math> is the normal vector of our hyperplane so \((\frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector.  
: This is the distance to our hyperplane.
: This is the distance to our hyperplane.


Line 149: Line 155:
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
 
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
\(\DeclareMathOperator{\sign}{sign}\)
Suppose our model is <math>\hat{y}=\sign \sum_{i=1}^{n} w_i y_i \langle x_i, z \rangle</math>. 
In this case, our model is a linear combination of the training data \(y\) where <math>\langle x_i, z \rangle</math> represents a similarity between \(z\) and \(x_i\). 
Since we only use <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math> to simulate a non-linear processing of the data.
 
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>.
Ideally, our kernel function should be able to be computed without having to compute the actual <math>\phi</math> and the full dot product.
 
====Identifying if a function is a kernel====
====Identifying if a function is a kernel====
Basic check:
Basic check:
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.
 
====Mercer's Theorem====
====Mercer's Theorem====
Let our kernel function be <math>K(z,x)</math>.
Let our kernel function be <math>K(z,x)</math>.
Line 186: Line 200:
====Common Kernels====
====Common Kernels====
; Polynomial Kernel
; Polynomial Kernel
* See [[wikipedia:Polynomial kernel]]
* See [[Wikipedia:Polynomial kernel]]
* <math>K(x,z) = (c+x^Tz)^d</math>
* \(K(x,z) = (c+x^Tz)^d\)
* For <math>d=2</math>
* For \(d=2\), we have  
** we have <math>(1+x^Tz)^2 = 1 + 2(x^Tz) + (x^Tz)^2</math>
\[
** <math>= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)</math>
\begin{align}
** <math>= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2</math>
(1+x^Tz)^2 &= 1 + 2(x^Tz) + (x^Tz)^2\\
** <math>= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2</math>
&= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)\\
&= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2\\
&= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2
\end{align}
\]
* The dimension of the feature map associated with this kernel is exponential in d
* The dimension of the feature map associated with this kernel is exponential in d
** There are <math>1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)</math> terms
** There are \(1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)\) terms


==Learning Theory==
==Learning Theory==
Line 227: Line 245:
* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
* Number of parameters do not necessarily correspond to VC dimension.  
* Number of parameters do not necessarily correspond to VC dimension.  
: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter
*: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter


====Theory====
====Theory====
Line 254: Line 272:


===Bias-Variance Tradeoff===
===Bias-Variance Tradeoff===
* Let <math>L_D(h)</math> be the true loss of hypothesis h
* Let <math>L_D(h)</math> be the true loss of hypothesis <math>h</math> and <math>L_s(h)</math> be the empirical loss of hypothesis <math>h</math>.
: and <math>L_S(h)</math> be the true loss of hypothesis h
** Here D is the true distribution and s is the training sample.
* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
* The term <math>L_D(h_D^*)</math> is called the bias
* The term <math>L_D(h_D^*)</math> is called the bias
Line 278: Line 296:
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>
==See Also==
* [[Supervised Learning]]
* [[Unsupervised Learning]]
* [[Deep Learning]]