Machine Learning: Difference between revisions

(43 intermediate revisions by the same user not shown)

Line 1:

Machine Learning

Here are some selected topics from the ML classes CMSC422 and CMSC726 at UMD.

==Loss functions==

Line 21:

Line 23:

The cross entropy loss is

* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>

;Notes

* This is the sum of the log probabilities of picking the correct class (i.e. p if y=1 or 1-p if y=0).

* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex

{{hidden | Proof |

We show that the Hessian is positive semi definite.

<math>

~~\begin{aligned}~~

\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]

\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\

</math>

&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\

<math>

&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\

= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]

&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\

</math>

&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\

<math>

\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\

= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]

&= \~~sum~~ g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T ~~\quad~~

</math>

~~\end{aligned}~~

<math>

= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]

</math>

<math>

= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]

</math>

<math>

\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]

</math>

<math>

= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T

</math>

which is a PSD matrix

Line 62:

Line 80:

===Learning Rate===

===Hessian===

Some optimization methods may rely on the Hessian.

;How to calculate <math>Hv</math>

Use finite differencing with directional derivatives:

*<math> H_{\theta} \mathbf{v} = \lim_{\epsilon \rightarrow 0} \frac{g(\theta + \epsilon \mathbf{v}) - g(\theta)}{\epsilon}</math>

;How to calculate <math>H_{\theta}^{-1}v</math>?

Calculate H. Then use gradient descent to minimize <math>\frac{1}{2}x^T H x - v^T x</math>.

By first order optimality, the minimum occurs when the gradient is zero.

* <math>\nabla_{x} (\frac{1}{2}x^T H x - v^T x) = (\frac{1}{2}H + \frac{1}{2}H^T)x - v^T = Hx - v^T</math>

* <math>\implies x^* = H^{-1}v</math>

* Using [https://math.stackexchange.com/questions/222894/how-to-take-the-gradient-of-the-quadratic-form Gradient of quadratic form]

==SVM==

[~~http~~://~~cs229~~.stanford.edu/~~notes~~/cs229-notes3.pdf Andrew Ng Notes]

[https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf Andrew Ng Notes]

Support Vector Machine

This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.

Line 80:

Line 112:

;Notes

* <math>\mathbf{w}</math> is the normal vector of our hyperplane so ~~<math>~~\frac{w}{\Vert w \Vert})^Tx^{(i)}~~</math>~~ is the length of the projection of x onto our normal vector.

* <math>\mathbf{w}</math> is the normal vector of our hyperplane so \((\frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector.

: This is the distance to our hyperplane.

Line 123:

Line 155:

In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.

For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>

If our ~~original~~ model and ~~training~~ only ~~used~~ <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math>~~ ~~

A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>

\(\DeclareMathOperator{\sign}{sign}\)

Suppose our model is <math>\hat{y}=\sign \sum_{i=1}^{n} w_i y_i \langle x_i, z \rangle</math>.

In this case, our model is a linear combination of the training data \(y\) where <math>\langle x_i, z \rangle</math> represents a similarity between \(z\) and \(x_i\).

Since we only use <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math> to simulate a non-linear processing of the data.

A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>.

Ideally, our kernel function should be able to be computed without having to compute the actual <math>\phi</math> and the full dot product.

====Identifying if a function is a kernel====

Basic check:

Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.~~ ~~

Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.

====Mercer's Theorem====

Let our kernel function be <math>K(z,x)</math>.

Line 135:

Line 175:

Positive Definite:

Let <math>\mathbf{v} \in \mathbb{R}^n</math>.

Then

Then

<math>

~~\begin{aligned}~~

\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j]

\mathbf{v}^T \mathbf{K} \mathbf{v}

= \sum_i \sum_j v_{i}K_{ij}v_{j}

&= v^T [\sum_j K_{ij}v_j]\\

</math>

&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\

<math>

&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\

= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}

&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\

</math>

&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\

<math>

&= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\

= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}

&= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\

</math>

&\geq 0

<math>

~~\end{aligned}~~

= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}

</math>

<math>

= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}

</math>

<math>

= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2

\geq 0

</math>

}}

====Common Kernels====

; Polynomial Kernel

* See [[Wikipedia:Polynomial kernel]]

* \(K(x,z) = (c+x^Tz)^d\)

* For \(d=2\), we have

\[

\begin{align}

(1+x^Tz)^2 &= 1 + 2(x^Tz) + (x^Tz)^2\\

&= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)\\

&= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2\\

&= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2

\end{align}

\]

* The dimension of the feature map associated with this kernel is exponential in d

** There are \(1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)\) terms

==Learning Theory==

Line 158:

Line 221:

===Uniform Convergence===

If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.

Then

Then if a training set is <math>\epsilon / 2</math>-representative,

<math>

L_D(h_s)

Line 182:

Line 245:

* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.

* Number of parameters do not necessarily correspond to VC dimension.

: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter

*: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter

====Theory====

[https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]

In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have

In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have with probability <math>1-\delta</math>

* <math>|L_D(h) - L_S(h)| < \sqrt{

\frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}

Line 196:

Line 259:

</math>

: for some constants <math>K_1, K_2</math>

====Growth Function====

The growth function is maximum number of ways <math>m</math> examples can be labelled using hypotheses from <math>\mathcal{H}</math>

* <math>\tau_H(m) = \max_{|C| = m} |H_C|</math>

;Notes

* If <math>m \leq VCdim(H)</math>, then <math>\tau_H(m) = 2^m</math>

====Sauer's Lemma====

[https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf Reference]

After the VCdim, the growth function grows as a polynomial

* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>

* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>.

===Bias-Variance Tradeoff===

* Let <math>L_D(h)</math> be the true loss of hypothesis h

* Let <math>L_D(h)</math> be the true loss of hypothesis <math>h</math> and <math>L_s(h)</math> be the empirical loss of hypothesis <math>h</math>.

: and <math>~~L_S~~(h)</math> be the ~~true~~ loss of hypothesis h

** Here D is the true distribution and s is the training sample.

* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>

* The term <math>L_D(h_D^*)</math> is called the bias

Line 222:

Line 296:

Let <math>X_1,...,X_n</math> be bounded in (a,b)

Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>

==See Also==

* [[Supervised Learning]]

* [[Unsupervised Learning]]

* [[Deep Learning]]

@@ Line 1: / Line 1: @@
 Machine Learning
+Here are some selected topics from the ML classes CMSC422 and CMSC726 at UMD.
 ==Loss functions==
@@ Line 21: / Line 23: @@
 The cross entropy loss is
 * <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
 ;Notes
+* This is the sum of the log probabilities of picking the correct class (i.e. p if y=1 or 1-p if y=0).
 * If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
 {{hidden | Proof |
+We show that the Hessian is positive semi definite.<br>
 <math>
-\begin{aligned}
+\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]
-\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
+</math><br>
-&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
+<math>
-&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
+= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]
-&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
+</math><br>
-&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
+<math>
-\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
+= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]
-&= \sum g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T \quad
+</math><br>
-\end{aligned}
+<math>
+= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]
+</math><br>
+<math>
+= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
+</math><br>
+<math>
+\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
+</math><br>
+<math>
+= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T
 </math><br>
 which is a PSD matrix
@@ Line 62: / Line 80: @@
 ===Learning Rate===
+===Hessian===
+Some optimization methods may rely on the Hessian.<br>
+;How to calculate <math>Hv</math><br>
+Use finite differencing with directional derivatives:<br>
+*<math> H_{\theta} \mathbf{v} = \lim_{\epsilon \rightarrow 0} \frac{g(\theta + \epsilon \mathbf{v}) - g(\theta)}{\epsilon}</math>
+;How to calculate <math>H_{\theta}^{-1}v</math>?
+Calculate H. Then use gradient descent to minimize <math>\frac{1}{2}x^T H x - v^T x</math>.<br>
+By first order optimality, the minimum occurs when the gradient is zero.<br>
+* <math>\nabla_{x} (\frac{1}{2}x^T H x - v^T x) = (\frac{1}{2}H + \frac{1}{2}H^T)x - v^T = Hx - v^T</math>
+* <math>\implies x^* = H^{-1}v</math>
+* Using [https://math.stackexchange.com/questions/222894/how-to-take-the-gradient-of-the-quadratic-form Gradient of quadratic form]
 ==SVM==
-[http://cs229.stanford.edu/notes/cs229-notes3.pdf Andrew Ng Notes]<br>
+[https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf Andrew Ng Notes]<br>
 Support Vector Machine<br>
 This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
@@ Line 80: / Line 112: @@
 ;Notes
-* <math>\mathbf{w}</math> is the normal vector of our hyperplane so <math>\frac{w}{\Vert w \Vert})^Tx^{(i)}</math> is the length of the projection of x onto our normal vector.
+* <math>\mathbf{w}</math> is the normal vector of our hyperplane so \((\frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector.
 : This is the distance to our hyperplane.
@@ Line 123: / Line 155: @@
 In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
 For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
-If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
-A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
+\(\DeclareMathOperator{\sign}{sign}\)
+Suppose our model is <math>\hat{y}=\sign \sum_{i=1}^{n} w_i y_i \langle x_i, z \rangle</math>.
+In this case, our model is a linear combination of the training data \(y\) where <math>\langle x_i, z \rangle</math> represents a similarity between \(z\) and \(x_i\).
+Since we only use <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math> to simulate a non-linear processing of the data.
+A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>.
+Ideally, our kernel function should be able to be computed without having to compute the actual <math>\phi</math> and the full dot product.
 ====Identifying if a function is a kernel====
 Basic check:
-Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
+Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.
 ====Mercer's Theorem====
 Let our kernel function be <math>K(z,x)</math>.
@@ Line 135: / Line 175: @@
 Positive Definite:<br>
 Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
-Then<br>
+Then <br>
 <math>
-\begin{aligned}
+\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j]
-\mathbf{v}^T \mathbf{K} \mathbf{v}
+= \sum_i \sum_j v_{i}K_{ij}v_{j}
-&= v^T [\sum_j K_{ij}v_j]\\
+</math><br>
-&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
+<math>
-&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
+= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}
-&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
+</math><br>
-&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
+<math>
-&= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\
+= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
-&= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
+</math><br>
-&\geq 0
+<math>
-\end{aligned}
+= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
+</math><br>
+<math>
+= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}
+</math><br>
+<math>
+= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2
+\geq 0
 </math>
 }}
+====Common Kernels====
+; Polynomial Kernel
+* See [[Wikipedia:Polynomial kernel]]
+* \(K(x,z) = (c+x^Tz)^d\)
+* For \(d=2\), we have
+\[
+\begin{align}
+(1+x^Tz)^2 &= 1 + 2(x^Tz) + (x^Tz)^2\\
+&= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)\\
+&= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2\\
+&= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2
+\end{align}
+\]
+* The dimension of the feature map associated with this kernel is exponential in d
+** There are \(1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)\) terms
 ==Learning Theory==
@@ Line 158: / Line 221: @@
 ===Uniform Convergence===
 If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br>
-Then
+Then if a training set is <math>\epsilon / 2</math>-representative,<br>
 <math>
 L_D(h_s)
@@ Line 182: / Line 245: @@
 * To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
 * Number of parameters do not necessarily correspond to VC dimension.
-: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter
+*: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter
 ====Theory====
 [https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]<br>
-In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have
+In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have with probability <math>1-\delta</math>
 * <math>|L_D(h) - L_S(h)| < \sqrt{
 \frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}
@@ Line 196: / Line 259: @@
 </math>
 : for some constants <math>K_1, K_2</math>
+====Growth Function====
+The growth function is maximum number of ways <math>m</math> examples can be labelled using hypotheses from <math>\mathcal{H}</math>
+* <math>\tau_H(m) = \max_{|C| = m} |H_C|</math>
+;Notes
+* If <math>m \leq VCdim(H)</math>, then <math>\tau_H(m) = 2^m</math>
+====Sauer's Lemma====
+[https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf Reference]<br>
+After the VCdim, the growth function grows as a polynomial
+* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>
+* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>.
 ===Bias-Variance Tradeoff===
-* Let <math>L_D(h)</math> be the true loss of hypothesis h
+* Let <math>L_D(h)</math> be the true loss of hypothesis <math>h</math> and <math>L_s(h)</math> be the empirical loss of hypothesis <math>h</math>.
-: and <math>L_S(h)</math> be the true loss of hypothesis h
+** Here D is the true distribution and s is the training sample.
 * <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
 * The term <math>L_D(h_D^*)</math> is called the bias
@@ Line 222: / Line 296: @@
 Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
 Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>
+==See Also==
+* [[Supervised Learning]]
+* [[Unsupervised Learning]]
+* [[Deep Learning]]