Machine Learning: Difference between revisions

(44 intermediate revisions by the same user not shown)

Line 1:

Machine Learning

~~==Hyperparameters==~~

~~===Batch Size===~~

~~[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]~~

==Loss functions==

===(Mean) Squared Error===

The squared error is:

<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^(i)|^2</math>

<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math>

If our model is linear regression <math>h(x)=w^tx</math> then this is convex.

{{hidden|Proof|

Line 25:

Line 19:

===Cross Entropy===

The cross entropy loss is

* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>

;Notes

* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex

{{hidden | Proof |

We show that the Hessian is positive semi definite.

<math>

\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]

</math>

<math>

= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]

</math>

<math>

= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]

</math>

<math>

= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]

</math>

<math>

= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]

</math>

<math>

\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]

</math>

<math>

= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T

</math>

which is a PSD matrix

}}

===Hinge Loss===

Line 42:

Line 67:

update using above gradient

</pre>

;Batch Size

* [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]

===Coordinate Block Descent===

Line 75:

Line 103:

\end{aligned}

</math>

which is equivalent to by setting <math>\hat{\gamma}=1</math>

which is equivalent to, by setting <math>\hat{\gamma}=1</math>,

<math>

\begin{aligned}

Line 119:

Line 147:

Positive Definite:

Let <math>\mathbf{v} \in \mathbb{R}^n</math>.

Then

Then

<math>

~~\begin{aligned}~~

\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j]

\mathbf{v}^T \mathbf{K} \mathbf{v}

= \sum_i \sum_j v_{i}K_{ij}v_{j}

&= v^T [\sum_j K_{ij}v_j]\\

= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}

&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\

= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}

&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\

= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}

&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\

= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}

&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\

= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2

&= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\

\geq 0

&= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\

&\geq 0

~~\end{aligned}~~

</math>

}}

====Common Kernels====

; Polynomial Kernel

* See [[wikipedia:Polynomial kernel]]

* <math>K(x,z) = (c+x^Tz)^d</math>

* For <math>d=2</math>

** we have <math>(1+x^Tz)^2 = 1 + 2(x^Tz) + (x^Tz)^2</math>

** <math>= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)</math>

** <math>= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2</math>

** <math>= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2</math>

* The dimension of the feature map associated with this kernel is exponential in d

** There are <math>1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)</math> terms

==Learning Theory==

Line 155:

Line 192:

[https://www.stat.berkeley.edu/~bartlett/courses/2013spring-stat210b/notes/10notes.pdf Some slides]

====Shattering====

A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if there exists <math>\theta</math> such that <math>f</math> makes no errors.

A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if for every possible set of binary labellings <math>\{y_1,...,y_n\}</math> there exists <math>\theta</math> such that <math>f</math> makes no errors.

====Definition====

Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.

Line 168:

Line 206:

====Theory====

[https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]

In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have

* <math>|L_D(h) - L_S(h)| < \sqrt{

\frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}

</math>

: where <math>m</math> is the size of the sample.

For all h in H,

* <math>|L_D(h) - L_S(h)| < K_1 \sqrt{

\frac{VCdim + K_2 log(2/\delta)}{2m}}

\frac{VCdim + K_2 \log(2/\delta)}{2m}}

</math>

: for some constants <math>K_1, K_2</math>

====Growth Function====

The growth function is maximum number of ways <math>m</math> examples can be labelled using hypotheses from <math>\mathcal{H}</math>

* <math>\tau_H(m) = \max_{|C| = m} |H_C|</math>

;Notes

* If <math>m \leq VCdim(H)</math>, then <math>\tau_H(m) = 2^m</math>

====Sauer's Lemma====

[https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf Reference]

After the VCdim, the growth function grows as a polynomial

* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>

* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>.

===Bias-Variance Tradeoff===

Line 185:

Line 241:

====Definition====

The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.

In this case, we see how well we can fit random noise.

Given a set <math>A \subset \mathbb{R}^m</math> the Rademacher complexity is:

<math>R(A) = \frac{1}{m}E_{\sigma} [\sup_{a \in A} \sum_{i=1}^{m} \sigma_i a_i]</math>

where each <math>\sigma_i</math> are from a discrete uniform distribution <math>\{-1, 1\}</math>

where each <math>\sigma_i</math> are from a discrete uniform distribution <math>\{-1, 1\}</math>

Given a sample <math>S=\{z_1,...,z_n\}</math> and a function class <math>F</math>, the empirical rademacher complexity is:

where <math>F \circ S = \{(f(z_1),...,f(z_n)) \mid f \in F\}</math>

;Notes

===Concentration Bounds===

====Hoeffding's inequality====

Let <math>X_1,...,X_n</math> be bounded in (a,b)

Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>

@@ Line 1: / Line 1: @@
 Machine Learning
-==Hyperparameters==
-===Batch Size===
-[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
 ==Loss functions==
 ===(Mean) Squared Error===
 The squared error is:<br>
-<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^(i)|^2</math><br>
+<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br>
 If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br>
 {{hidden|Proof|
@@ Line 25: / Line 19: @@
 ===Cross Entropy===
+The cross entropy loss is
+* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
+;Notes
+* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
+{{hidden | Proof |
+We show that the Hessian is positive semi definite.<br>
+<math>
+\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]
+</math><br>
+<math>
+= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]
+</math><br>
+<math>
+= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]
+</math><br>
+<math>
+= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]
+</math><br>
+<math>
+= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
+</math><br>
+<math>
+\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
+</math><br>
+<math>
+= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T
+</math><br>
+which is a PSD matrix
+}}
 ===Hinge Loss===
@@ Line 42: / Line 67: @@
      update using above gradient
 </pre>
+;Batch Size
+* [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
 ===Coordinate Block Descent===
@@ Line 75: / Line 103: @@
 \end{aligned}
 </math><br>
-which is equivalent to by setting <math>\hat{\gamma}=1</math>
+which is equivalent to, by setting <math>\hat{\gamma}=1</math>,<br>
 <math>
 \begin{aligned}
@@ Line 119: / Line 147: @@
 Positive Definite:<br>
 Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
-Then
+Then <br>
 <math>
-\begin{aligned}
+\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j]
-\mathbf{v}^T \mathbf{K} \mathbf{v}
+= \sum_i \sum_j v_{i}K_{ij}v_{j}
-&= v^T [\sum_j K_{ij}v_j]\\
+= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}
-&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
+= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
-&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
+= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
-&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
+= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}
-&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
+= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2
-&= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\
+\geq 0
-&= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
-&\geq 0
-\end{aligned}
 </math>
 }}
+====Common Kernels====
+; Polynomial Kernel
+* See [[wikipedia:Polynomial kernel]]
+* <math>K(x,z) = (c+x^Tz)^d</math>
+* For <math>d=2</math>
+** we have <math>(1+x^Tz)^2 = 1 + 2(x^Tz) + (x^Tz)^2</math>
+** <math>= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)</math>
+** <math>= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2</math>
+** <math>= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2</math>
+* The dimension of the feature map associated with this kernel is exponential in d
+** There are <math>1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)</math> terms
 ==Learning Theory==
@@ Line 155: / Line 192: @@
 [https://www.stat.berkeley.edu/~bartlett/courses/2013spring-stat210b/notes/10notes.pdf Some slides]
 ====Shattering====
-A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if there exists <math>\theta</math> such that <math>f</math> makes no errors.
+A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if for every possible set of binary labellings <math>\{y_1,...,y_n\}</math> there exists <math>\theta</math> such that <math>f</math> makes no errors.
 ====Definition====
 Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.<br>
@@ Line 168: / Line 206: @@
 ====Theory====
+[https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]<br>
+In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have
+* <math>|L_D(h) - L_S(h)| < \sqrt{
+\frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}
+</math>
+: where <math>m</math> is the size of the sample.
 For all h in H,
 * <math>|L_D(h) - L_S(h)| < K_1 \sqrt{
-\frac{VCdim + K_2 log(2/\delta)}{2m}}
+\frac{VCdim + K_2 \log(2/\delta)}{2m}}
 </math>
+: for some constants <math>K_1, K_2</math>
+====Growth Function====
+The growth function is maximum number of ways <math>m</math> examples can be labelled using hypotheses from <math>\mathcal{H}</math>
+* <math>\tau_H(m) = \max_{|C| = m} |H_C|</math>
+;Notes
+* If <math>m \leq VCdim(H)</math>, then <math>\tau_H(m) = 2^m</math>
+====Sauer's Lemma====
+[https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf Reference]<br>
+After the VCdim, the growth function grows as a polynomial
+* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>
+* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>.
 ===Bias-Variance Tradeoff===
@@ Line 185: / Line 241: @@
 ====Definition====
 The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.<br>
+In this case, we see how well we can fit random noise.<br>
 Given a set <math>A \subset \mathbb{R}^m</math> the Rademacher complexity is:<br>
 <math>R(A) = \frac{1}{m}E_{\sigma} [\sup_{a \in A} \sum_{i=1}^{m} \sigma_i a_i]</math><br>
-where each <math>\sigma_i</math> are from a discrete uniform distribution <math>\{-1, 1\}</math>
+where each <math>\sigma_i</math> are from a discrete uniform distribution <math>\{-1, 1\}</math><br>
 Given a sample <math>S=\{z_1,...,z_n\}</math> and a function class <math>F</math>, the empirical rademacher complexity is:<br>
 <math>R(F \circ S)</math><br>
 where <math>F \circ S = \{(f(z_1),...,f(z_n)) \mid f \in F\}</math><br>
+;Notes
+===Concentration Bounds===
+====Hoeffding's inequality====
+Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
+Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>