Machine Learning

Hyperparameters

Batch Size

A medium post empirically evaluating the effect of batch_size

Optimization

Gradient Descent

Also known as direction of steepest gradient.
To minimize a loss function, just take steps in the opposite direction of the gradient.

Stochastic Gradient Descent

Oftentimes with large amounts of data, you can't take the gradient of all your data at once. In this case, we use batches of data.
We can take batches by getting a random order of indices without replacement.

for epoch=1 to n
  generate batches
  for i=1 to m
    take gradient w.r.t batch i
    update using above gradient

Coordinate Block Descent

Learning Rate

SVM

Kernel Trick

Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.
In this case, one way to get around this is to perform a non-linear preprocessing of the data \(\displaystyle \phi(x)\).
For example \(\displaystyle \phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}\) If our original model and training only used \(\displaystyle \langle x, z\rangle\) then we only need \(\displaystyle \phi(x)^T\phi(z)\)
A kernel \(\displaystyle K(x,z)\) is a function that can be expressed as \(\displaystyle K(x,z)=\phi(x)^T\phi(z)\) for some function \(\displaystyle \phi\)

Identifying if a function is a kernel

Basic check: Since the kernel is an inner-product between \(\displaystyle \phi(x), \phi(z)\), it should satisfy the axioms of inner products, namely \(\displaystyle K(x,z)=K(z,x)\), otherwise it is not a kernel.

Mercer's Theorem

Let our kernel function be \(\displaystyle K(z,x)\). Then for any sample S, the corresponding matrix \(\displaystyle \mathbf{K}\) where \(\displaystyle K_{ij} = K(x^{(i)},x^{(j)}\) is symmetric positive definite.

Proof

Symmetry: \(\displaystyle K_{ij} = K(\mathbf{x}^{(i)},\mathbf{x}^{(j)} = \phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(j)})^T\phi(\mathbf{x}^{(i)}) = K_{ji}\)
Positive Definite:
Let \(\displaystyle \mathbf{v} \in \mathbb{R}^n\).
Then \(\displaystyle \begin{aligned} \mathbf{v}^T \mathbf{K} \mathbf{v} &= v^T [\sum_j K_{ij}v_j]\\ &= \sum_i \sum_j v_{i}K_{ij}v_{j}\\ &= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\ &= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\ &= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\ &= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\ &= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\ &\geq 0 \end{aligned} \)

Learning Theory

PAC Learning

Probably Approximately Correct (PAC)
A hypothesis class \(\displaystyle H\) is PAC learnable if given \(\displaystyle 0 \lt \epsilon, \delta \lt 1\), there is some function \(\displaystyle m(\epsilon, \delta)\) polynomial in \(\displaystyle 1/\epsilon, 1/\delta\) such that if we have a sample size \(\displaystyle \geq m(\epsilon, \delta)\) then with probability \(\displaystyle 1-\delta\) the hypothesis we will learn will have an average error \(\displaystyle \leq \epsilon\).

Uniform Convergence

If for all hypothesis \(\displaystyle h\), \(\displaystyle |L_S(h)-L_D(h)| \leq \epsilon\), then the training set \(\displaystyle S\) is called \(\displaystyle \epsilon\)-representative.
Then \(\displaystyle L_D(h_s) \leq L_S(h_S) + \epsilon / 2 \leq L_S(h_D) + \epsilon / 2 \leq L_D(h_D) + \epsilon \).
A hypothesis class \(\displaystyle H\) has uniform convergence if there exists \(\displaystyle m^{UC}(\epsilon, \delta)\) such that for every \(\displaystyle \epsilon, \delta\), if we draw a sample \(\displaystyle S\) then with probability \(\displaystyle 1-\delta\), \(\displaystyle S\) is \(\displaystyle \epsilon\)-representative.

VC dimension

Wikipedia Page
Some slides

Shattering

A model \(\displaystyle f\) parameterized by \(\displaystyle \theta\) is said to shatter a set of points \(\displaystyle \{x_1, ..., x_n\}\) if there exists \(\displaystyle \theta\) such that \(\displaystyle f\) makes no errors.

Definition

Stolen from wikipedia:
The VC dimension of a model \(\displaystyle f\) is the maximum number of points that can be arranged so that \(\displaystyle f\) shatters them. More formally, it is the maximum cardinal \(\displaystyle D\) such that some data point set of cardinality \(\displaystyle D\) can be shattered by \(\displaystyle f\).

Notes

To show VCdim is at least n, find a training set of size n that can be shattered by our hypothesis class.
To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
Number of parameters do not necessarily correspond to VC dimension.

\(\displaystyle H=\{h(x)=\sin(\theta x)\}\) has infinite VC dimension with one parameter

Theory

For all h in H,

\(\displaystyle |L_D(h) - L_S(h)| \lt K_1 \sqrt{ \frac{VCdim + K_2 log(2/\delta)}{2m}} \)

Bias-Variance Tradeoff

Let \(\displaystyle L_D(h)\) be the true loss of hypothesis h

and \(\displaystyle L_S(h)\) be the true loss of hypothesis h

\(\displaystyle L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]\)
The term \(\displaystyle L_D(h_D^*)\)
The term \(\displaystyle [L_D(h_s^*) - L_D(h_D^*)]\) is called variance.
Larger hypothesis class will get smaller bias but larger variance.
Overfitting vs. underfitting