Machine Learning: Difference between revisions

From David's Wiki
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
Machine Learning
Machine Learning
Here are some selected topics from the ML classes CMSC422 and CMSC726 at UMD.


==Loss functions==
==Loss functions==
Line 21: Line 23:
The cross entropy loss is
The cross entropy loss is
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
;Notes
;Notes
* This is the sum of the log probabilities of picking the correct class (i.e. p if y=1 or 1-p if y=0).
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
{{hidden | Proof |
{{hidden | Proof |
We show that the Hessian is positive semi definite.<br>
<math>
<math>
\begin{aligned}
\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]
\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
</math><br>
&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
<math>
&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]
&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
</math><br>
&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
<math>
\implies \nabla^2_\theta J(\theta) &= \nabla_{\theta} -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]
&= \sum g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T \quad
</math><br>
\end{aligned}
<math>
= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T
</math><br>
</math><br>
which is a PSD matrix
which is a PSD matrix
Line 62: Line 80:


===Learning Rate===
===Learning Rate===
===Hessian===
Some optimization methods may rely on the Hessian.<br>
;How to calculate <math>Hv</math><br>
Use finite differencing with directional derivatives:<br>
*<math> H_{\theta} \mathbf{v} = \lim_{\epsilon \rightarrow 0} \frac{g(\theta + \epsilon \mathbf{v}) - g(\theta)}{\epsilon}</math>
;How to calculate <math>H_{\theta}^{-1}v</math>?
Calculate H. Then use gradient descent to minimize <math>\frac{1}{2}x^T H x - v^T x</math>.<br>
By first order optimality, the minimum occurs when the gradient is zero.<br>
* <math>\nabla_{x} (\frac{1}{2}x^T H x - v^T x) = (\frac{1}{2}H + \frac{1}{2}H^T)x - v^T = Hx - v^T</math>
* <math>\implies x^* = H^{-1}v</math>
* Using [https://math.stackexchange.com/questions/222894/how-to-take-the-gradient-of-the-quadratic-form Gradient of quadratic form]
==SVM==
==SVM==
[http://cs229.stanford.edu/notes/cs229-notes3.pdf Andrew Ng Notes]<br>
[https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf Andrew Ng Notes]<br>
Support Vector Machine<br>
Support Vector Machine<br>
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
Line 80: Line 112:


;Notes
;Notes
* <math>\mathbf{w}</math> is the normal vector of our hyperplane so <math>\frac{w}{\Vert w \Vert})^Tx^{(i)}</math> is the length of the projection of x onto our normal vector.  
* <math>\mathbf{w}</math> is the normal vector of our hyperplane so \((\frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector.  
: This is the distance to our hyperplane.
: This is the distance to our hyperplane.


Line 123: Line 155:
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
 
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
\(\DeclareMathOperator{\sign}{sign}\)
Suppose our model is <math>\hat{y}=\sign \sum_{i=1}^{n} w_i y_i \langle x_i, z \rangle</math>. 
In this case, our model is a linear combination of the training data \(y\) where <math>\langle x_i, z \rangle</math> represents a similarity between \(z\) and \(x_i\). 
Since we only use <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math> to simulate a non-linear processing of the data.
 
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math>.
Ideally, our kernel function should be able to be computed without having to compute the actual <math>\phi</math> and the full dot product.
 
====Identifying if a function is a kernel====
====Identifying if a function is a kernel====
Basic check:
Basic check:
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.
 
====Mercer's Theorem====
====Mercer's Theorem====
Let our kernel function be <math>K(z,x)</math>.
Let our kernel function be <math>K(z,x)</math>.
Line 135: Line 175:
Positive Definite:<br>
Positive Definite:<br>
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
Then<br>
Then <br>
<math>
<math>
\begin{aligned}
\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j]
\mathbf{v}^T \mathbf{K} \mathbf{v}
= \sum_i \sum_j v_{i}K_{ij}v_{j}
&= v^T [\sum_j K_{ij}v_j]\\
</math><br>
&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
<math>
&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}
&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
</math><br>
&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
<math>
&= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\
= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
&= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
</math><br>
&\geq 0
<math>
\end{aligned}
= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
</math><br>
<math>
= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}
</math><br>
<math>
= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2
\geq 0
</math>
</math>
}}
}}
====Common Kernels====
; Polynomial Kernel
* See [[Wikipedia:Polynomial kernel]]
* \(K(x,z) = (c+x^Tz)^d\)
* For \(d=2\), we have
\[
\begin{align}
(1+x^Tz)^2 &= 1 + 2(x^Tz) + (x^Tz)^2\\
&= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)\\
&= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2\\
&= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2
\end{align}
\]
* The dimension of the feature map associated with this kernel is exponential in d
** There are \(1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)\) terms


==Learning Theory==
==Learning Theory==
Line 158: Line 221:
===Uniform Convergence===
===Uniform Convergence===
If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br>
If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br>
Then  
Then if a training set is <math>\epsilon / 2</math>-representative,<br>
<math>
<math>
L_D(h_s)  
L_D(h_s)  
Line 182: Line 245:
* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
* To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
* Number of parameters do not necessarily correspond to VC dimension.  
* Number of parameters do not necessarily correspond to VC dimension.  
: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter
*: <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter


====Theory====
====Theory====
[https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]<br>
[https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]<br>
In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have
In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have with probability <math>1-\delta</math>
* <math>|L_D(h) - L_S(h)| < \sqrt{
* <math>|L_D(h) - L_S(h)| < \sqrt{
\frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}
\frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}
Line 196: Line 259:
</math>
</math>
: for some constants <math>K_1, K_2</math>
: for some constants <math>K_1, K_2</math>
====Growth Function====
The growth function is maximum number of ways <math>m</math> examples can be labelled using hypotheses from <math>\mathcal{H}</math>
* <math>\tau_H(m) = \max_{|C| = m} |H_C|</math>
;Notes
* If <math>m \leq VCdim(H)</math>, then <math>\tau_H(m) = 2^m</math>
====Sauer's Lemma====
[https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf Reference]<br>
After the VCdim, the growth function grows as a polynomial
* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>
* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>.


===Bias-Variance Tradeoff===
===Bias-Variance Tradeoff===
* Let <math>L_D(h)</math> be the true loss of hypothesis h
* Let <math>L_D(h)</math> be the true loss of hypothesis <math>h</math> and <math>L_s(h)</math> be the empirical loss of hypothesis <math>h</math>.
: and <math>L_S(h)</math> be the true loss of hypothesis h
** Here D is the true distribution and s is the training sample.
* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
* <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
* The term <math>L_D(h_D^*)</math> is called the bias
* The term <math>L_D(h_D^*)</math> is called the bias
Line 222: Line 296:
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>
==See Also==
* [[Supervised Learning]]
* [[Unsupervised Learning]]
* [[Deep Learning]]

Latest revision as of 17:23, 19 April 2024

Machine Learning

Here are some selected topics from the ML classes CMSC422 and CMSC726 at UMD.

Loss functions

(Mean) Squared Error

The squared error is:
\(\displaystyle J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2\)
If our model is linear regression \(\displaystyle h(x)=w^tx\) then this is convex.

Proof

\(\displaystyle \begin{aligned} \nabla_{w} J(w) &= \nabla_{w} \sum (w^tx^{(i)} - y^{(i)})^2\\ &= 2\sum (w^t x^{(i)} - y^{(i)})x \\ \implies \nabla_{w}^2 J(w) &= \nabla 2\sum (w^T x^{(i)} - y^{(i)})x^{(i)}\\ &= 2 \sum x^{(i)}(x^{(i)})^T \end{aligned} \)
so the hessian is positive semi-definite

Cross Entropy

The cross entropy loss is

  • \(\displaystyle J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]\)


Notes
  • This is the sum of the log probabilities of picking the correct class (i.e. p if y=1 or 1-p if y=0).
  • If our model is \(\displaystyle g(\theta^Tx^{(i)})\) where \(\displaystyle g(x)\) is the sigmoid function \(\displaystyle \frac{e^x}{1+e^x}\) then this is convex
Proof

We show that the Hessian is positive semi definite.
\(\displaystyle \nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))] \)
\(\displaystyle = -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}] \)
\(\displaystyle = -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}] \)
\(\displaystyle = -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}] \)
\(\displaystyle = -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}] \)
\(\displaystyle \implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}] \)
\(\displaystyle = \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T \)
which is a PSD matrix

Hinge Loss

Optimization

Gradient Descent

Also known as direction of steepest gradient.
To minimize a loss function, just take steps in the opposite direction of the gradient.

Stochastic Gradient Descent

Oftentimes with large amounts of data, you can't take the gradient of all your data at once. In this case, we use batches of data.
We can take batches by getting a random order of indices without replacement.

for epoch=1 to n
  generate batches
  for i=1 to m
    take gradient w.r.t batch i
    update using above gradient
Batch Size

Coordinate Block Descent

Learning Rate

Hessian

Some optimization methods may rely on the Hessian.

How to calculate \(\displaystyle Hv\)

Use finite differencing with directional derivatives:

  • \(\displaystyle H_{\theta} \mathbf{v} = \lim_{\epsilon \rightarrow 0} \frac{g(\theta + \epsilon \mathbf{v}) - g(\theta)}{\epsilon}\)
How to calculate \(\displaystyle H_{\theta}^{-1}v\)?

Calculate H. Then use gradient descent to minimize \(\displaystyle \frac{1}{2}x^T H x - v^T x\).
By first order optimality, the minimum occurs when the gradient is zero.

  • \(\displaystyle \nabla_{x} (\frac{1}{2}x^T H x - v^T x) = (\frac{1}{2}H + \frac{1}{2}H^T)x - v^T = Hx - v^T\)
  • \(\displaystyle \implies x^* = H^{-1}v\)
  • Using Gradient of quadratic form

SVM

Andrew Ng Notes
Support Vector Machine
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.
\(\displaystyle h_{w,b}(x) = g(W*x+b)\) where \(\displaystyle g(x) = I[x\gt =0]-I[x\lt 0]\) is the sign function.

Margins

The margin denoted by \(\displaystyle \gamma\) is the distance between our classifier and the closest point.

Functional Margin

The margin corresponding to one example is:
\(\displaystyle \hat{\gamma}^{(i)} = y^{(i)}(w^Tx^{(i)}+b)\) The margin for our entire sample is the smallest margin per example.

Geometric Margin

The geometric margin is the actual distance.
\(\displaystyle \hat{\gamma}^{(i)} = y^{(i)}((\frac{w}{\Vert w \Vert})^Tx^{(i)}+\frac{b}{\Vert w \Vert})\)

Notes
  • \(\displaystyle \mathbf{w}\) is the normal vector of our hyperplane so \((\frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector.
This is the distance to our hyperplane.

Lagrangians

The goal for svm is to maximize the margin:
\(\displaystyle \begin{aligned} \max_{\hat{\gamma}, w, b} &\frac{\hat{\gamma}}{\Vert w \Vert}\\ \text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq \hat{\gamma} \quad \forall i \end{aligned} \)
which is equivalent to, by setting \(\displaystyle \hat{\gamma}=1\),
\(\displaystyle \begin{aligned} \min_{\gamma, w, b} &\Vert w \Vert ^2\\ \text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq 1 \quad \forall i \end{aligned} \)

In general, given an optimization in the (primal) form:
\(\displaystyle \begin{aligned} \min_w & f(w)\\ \text{s.t. }& h_i(w) \leq 0 \quad \forall i\\ & g_i(w) = 0 \end{aligned} \)
we can rewrite the optimization as
\(\displaystyle \min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) \)
where \(\displaystyle \mathcal{L}(w, \alpha, \beta) = f(w) + \sum \alpha_i g_i(w) + \sum \beta_i h_i(w)\) is called the lagrangian.
Since \(\displaystyle \max \min \leq \min \max\),
we have:
\(\displaystyle \max_{\alpha, \beta \mid \alpha \geq 0}\min_{w} \mathcal{L}(w, \alpha, \beta) \leq \min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) \)
The left term is called the dual problem.
If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the solution to the original problem.

Kernel Trick

Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.
In this case, one way to get around this is to perform a non-linear preprocessing of the data \(\displaystyle \phi(x)\).
For example \(\displaystyle \phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}\)

\(\DeclareMathOperator{\sign}{sign}\) Suppose our model is \(\displaystyle \hat{y}=\sign \sum_{i=1}^{n} w_i y_i \langle x_i, z \rangle\).
In this case, our model is a linear combination of the training data \(y\) where \(\displaystyle \langle x_i, z \rangle\) represents a similarity between \(z\) and \(x_i\).
Since we only use \(\displaystyle \langle x, z\rangle\) then we only need \(\displaystyle \phi(x)^T\phi(z)\) to simulate a non-linear processing of the data.

A kernel \(\displaystyle K(x,z)\) is a function that can be expressed as \(\displaystyle K(x,z)=\phi(x)^T\phi(z)\) for some function \(\displaystyle \phi\). Ideally, our kernel function should be able to be computed without having to compute the actual \(\displaystyle \phi\) and the full dot product.

Identifying if a function is a kernel

Basic check: Since the kernel is an inner-product between \(\displaystyle \phi(x), \phi(z)\), it should satisfy the axioms of inner products, namely \(\displaystyle K(x,z)=K(z,x)\), otherwise it is not a kernel.

Mercer's Theorem

Let our kernel function be \(\displaystyle K(z,x)\). Then for any sample S, the corresponding matrix \(\displaystyle \mathbf{K}\) where \(\displaystyle K_{ij} = K(x^{(i)},x^{(j)})\) is symmetric positive definite.

Proof

Symmetry: \(\displaystyle K_{ij} = K(\mathbf{x}^{(i)},\mathbf{x}^{(j)} = \phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(j)})^T\phi(\mathbf{x}^{(i)}) = K_{ji}\)
Positive Definite:
Let \(\displaystyle \mathbf{v} \in \mathbb{R}^n\).
Then
\(\displaystyle \mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j] = \sum_i \sum_j v_{i}K_{ij}v_{j} \)
\(\displaystyle = \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j} \)
\(\displaystyle = \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j} \)
\(\displaystyle = \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j} \)
\(\displaystyle = \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j} \)
\(\displaystyle = \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2 \geq 0 \)

Common Kernels

Polynomial Kernel

\[ \begin{align} (1+x^Tz)^2 &= 1 + 2(x^Tz) + (x^Tz)^2\\ &= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)\\ &= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2\\ &= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2 \end{align} \]

  • The dimension of the feature map associated with this kernel is exponential in d
    • There are \(1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)\) terms

Learning Theory

PAC Learning

Probably Approximately Correct (PAC)
A hypothesis class \(\displaystyle H\) is PAC learnable if given \(\displaystyle 0 \lt \epsilon, \delta \lt 1\), there is some function \(\displaystyle m(\epsilon, \delta)\) polynomial in \(\displaystyle 1/\epsilon, 1/\delta\) such that if we have a sample size \(\displaystyle \geq m(\epsilon, \delta)\) then with probability \(\displaystyle 1-\delta\) the hypothesis we will learn will have an average error \(\displaystyle \leq \epsilon\).

Uniform Convergence

If for all hypothesis \(\displaystyle h\), \(\displaystyle |L_S(h)-L_D(h)| \leq \epsilon\), then the training set \(\displaystyle S\) is called \(\displaystyle \epsilon\)-representative.
Then if a training set is \(\displaystyle \epsilon / 2\)-representative,
\(\displaystyle L_D(h_s) \leq L_S(h_S) + \epsilon / 2 \leq L_S(h_D) + \epsilon / 2 \leq L_D(h_D) + \epsilon \).
A hypothesis class \(\displaystyle H\) has uniform convergence if there exists \(\displaystyle m^{UC}(\epsilon, \delta)\) such that for every \(\displaystyle \epsilon, \delta\), if we draw a sample \(\displaystyle S\) then with probability \(\displaystyle 1-\delta\), \(\displaystyle S\) is \(\displaystyle \epsilon\)-representative.

VC dimension

Wikipedia Page
Some slides

Shattering

A model \(\displaystyle f\) parameterized by \(\displaystyle \theta\) is said to shatter a set of points \(\displaystyle \{x_1, ..., x_n\}\) if for every possible set of binary labellings \(\displaystyle \{y_1,...,y_n\}\) there exists \(\displaystyle \theta\) such that \(\displaystyle f\) makes no errors.

Definition

Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.
Stolen from wikipedia:
The VC dimension of a model \(\displaystyle f\) is the maximum number of points that can be arranged so that \(\displaystyle f\) shatters them. More formally, it is the maximum cardinal \(\displaystyle D\) such that some data point set of cardinality \(\displaystyle D\) can be shattered by \(\displaystyle f\).

Notes
  • To show VCdim is at least n, find a training set of size n that can be shattered by our hypothesis class.
  • To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
  • Number of parameters do not necessarily correspond to VC dimension.
    \(\displaystyle H=\{h(x)=\sin(\theta x)\}\) has infinite VC dimension with one parameter

Theory

Reference
In the case where the Hypothesis class \(\displaystyle \mathcal{H}\) is finite, we have with probability \(\displaystyle 1-\delta\)

  • \(\displaystyle |L_D(h) - L_S(h)| \lt \sqrt{ \frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}} \)
where \(\displaystyle m\) is the size of the sample.

For all h in H,

  • \(\displaystyle |L_D(h) - L_S(h)| \lt K_1 \sqrt{ \frac{VCdim + K_2 \log(2/\delta)}{2m}} \)
for some constants \(\displaystyle K_1, K_2\)

Growth Function

The growth function is maximum number of ways \(\displaystyle m\) examples can be labelled using hypotheses from \(\displaystyle \mathcal{H}\)

  • \(\displaystyle \tau_H(m) = \max_{|C| = m} |H_C|\)
Notes
  • If \(\displaystyle m \leq VCdim(H)\), then \(\displaystyle \tau_H(m) = 2^m\)

Sauer's Lemma

Reference
After the VCdim, the growth function grows as a polynomial

  • If \(\displaystyle VCdim(H)\leq d \leq \infty\) then \(\displaystyle \tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}\)
  • Also if \(\displaystyle m \gt d+1\) then \(\displaystyle \tau_H(m) \leq \left(\frac{em}{d}\right)^d\).

Bias-Variance Tradeoff

  • Let \(\displaystyle L_D(h)\) be the true loss of hypothesis \(\displaystyle h\) and \(\displaystyle L_s(h)\) be the empirical loss of hypothesis \(\displaystyle h\).
    • Here D is the true distribution and s is the training sample.
  • \(\displaystyle L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]\)
  • The term \(\displaystyle L_D(h_D^*)\) is called the bias
  • The term \(\displaystyle [L_D(h_s^*) - L_D(h_D^*)]\) is called variance.
  • Larger hypothesis class will get smaller bias but larger variance.
  • Overfitting vs. underfitting

Rademacher Complexity

Definition

The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.
In this case, we see how well we can fit random noise.
Given a set \(\displaystyle A \subset \mathbb{R}^m\) the Rademacher complexity is:
\(\displaystyle R(A) = \frac{1}{m}E_{\sigma} [\sup_{a \in A} \sum_{i=1}^{m} \sigma_i a_i]\)
where each \(\displaystyle \sigma_i\) are from a discrete uniform distribution \(\displaystyle \{-1, 1\}\)
Given a sample \(\displaystyle S=\{z_1,...,z_n\}\) and a function class \(\displaystyle F\), the empirical rademacher complexity is:
\(\displaystyle R(F \circ S)\)
where \(\displaystyle F \circ S = \{(f(z_1),...,f(z_n)) \mid f \in F\}\)

Notes

Concentration Bounds

Hoeffding's inequality

Let \(\displaystyle X_1,...,X_n\) be bounded in (a,b)
Then \(\displaystyle P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})\)

See Also