Machine Learning
Machine Learning
Loss functisons
(Mean) Squared Error
The squared error is:
\(\displaystyle J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2\)
If our model is linear regression \(\displaystyle h(x)=w^tx\) then this is convex.
\(\displaystyle
\begin{aligned}
\nabla_{w} J(w) &= \nabla_{w} \sum (w^tx^{(i)} - y^{(i)})^2\\
&= 2\sum (w^t x^{(i)} - y^{(i)})x \\
\implies \nabla_{w}^2 J(w) &= \nabla 2\sum (w^T x^{(i)} - y^{(i)})x^{(i)}\\
&= 2 \sum x^{(i)}(x^{(i)})^T
\end{aligned}
\)
so the hessian is positive semi-definite
Cross Entropy
The cross entropy loss is
- \(\displaystyle J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]\)
- Notes
- If our model is \(\displaystyle g(\theta^Tx^{(i)})\) where \(\displaystyle g(x)\) is the sigmoid function \(\displaystyle \frac{e^x}{1+e^x}\) then this is convex
\(\displaystyle
\begin{aligned}
\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
&= \sum g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T
\end{aligned}
\)
which is a PSD matrix
Hinge Loss
Optimization
Gradient Descent
Also known as direction of steepest gradient.
To minimize a loss function, just take steps in the opposite direction of the gradient.
Stochastic Gradient Descent
Oftentimes with large amounts of data, you can't take the gradient of all your data at once.
In this case, we use batches of data.
We can take batches by getting a random order of indices without replacement.
for epoch=1 to n generate batches for i=1 to m take gradient w.r.t batch i update using above gradient
- Batch Size
Coordinate Block Descent
Learning Rate
SVM
Andrew Ng Notes
Support Vector Machine
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.
\(\displaystyle h_{w,b}(x) = g(W*x+b)\) where \(\displaystyle g(x) = I[x\gt =0]-I[x\lt 0]\) is the sign function.
Margins
The margin denoted by \(\displaystyle \gamma\) is the distance between our classifier and the closest point.
- Functional Margin
The margin corresponding to one example is:
\(\displaystyle \hat{\gamma}^{(i)} = y^{(i)}(w^Tx^{(i)}+b)\)
The margin for our entire sample is the smallest margin per example.
- Geometric Margin
The geometric margin is the actual distance.
\(\displaystyle \hat{\gamma}^{(i)} = y^{(i)}((\frac{w}{\Vert w \Vert})^Tx^{(i)}+\frac{b}{\Vert w \Vert})\)
- Notes
- \(\displaystyle \mathbf{w}\) is the normal vector of our hyperplane so \(\displaystyle \frac{w}{\Vert w \Vert})^Tx^{(i)}\) is the length of the projection of x onto our normal vector.
- This is the distance to our hyperplane.
Lagrangians
The goal for svm is to maximize the margin:
\(\displaystyle
\begin{aligned}
\max_{\hat{\gamma}, w, b} &\frac{\hat{\gamma}}{\Vert w \Vert}\\
\text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq \hat{\gamma} \quad \forall i
\end{aligned}
\)
which is equivalent to, by setting \(\displaystyle \hat{\gamma}=1\),
\(\displaystyle
\begin{aligned}
\min_{\gamma, w, b} &\Vert w \Vert ^2\\
\text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq 1 \quad \forall i
\end{aligned}
\)
In general, given an optimization in the (primal) form:
\(\displaystyle
\begin{aligned}
\min_w & f(w)\\
\text{s.t. }& h_i(w) \leq 0 \quad \forall i\\
& g_i(w) = 0
\end{aligned}
\)
we can rewrite the optimization as
\(\displaystyle
\min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta)
\)
where \(\displaystyle \mathcal{L}(w, \alpha, \beta) = f(w) + \sum \alpha_i g_i(w) + \sum \beta_i h_i(w)\) is called the lagrangian.
Since \(\displaystyle \max \min \leq \min \max\),
we have:
\(\displaystyle
\max_{\alpha, \beta \mid \alpha \geq 0}\min_{w} \mathcal{L}(w, \alpha, \beta) \leq \min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta)
\)
The left term is called the dual problem.
If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the solution to the original problem.
Kernel Trick
Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.
In this case, one way to get around this is to perform a non-linear preprocessing of the data \(\displaystyle \phi(x)\).
For example \(\displaystyle \phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}\)
If our original model and training only used \(\displaystyle \langle x, z\rangle\) then we only need \(\displaystyle \phi(x)^T\phi(z)\)
A kernel \(\displaystyle K(x,z)\) is a function that can be expressed as \(\displaystyle K(x,z)=\phi(x)^T\phi(z)\) for some function \(\displaystyle \phi\)
Identifying if a function is a kernel
Basic check:
Since the kernel is an inner-product between \(\displaystyle \phi(x), \phi(z)\), it should satisfy the axioms of inner products, namely \(\displaystyle K(x,z)=K(z,x)\), otherwise it is not a kernel.
Mercer's Theorem
Let our kernel function be \(\displaystyle K(z,x)\). Then for any sample S, the corresponding matrix \(\displaystyle \mathbf{K}\) where \(\displaystyle K_{ij} = K(x^{(i)},x^{(j)})\) is symmetric positive definite.
Symmetry: \(\displaystyle K_{ij} = K(\mathbf{x}^{(i)},\mathbf{x}^{(j)} = \phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(j)})^T\phi(\mathbf{x}^{(i)}) = K_{ji}\)
Positive Definite:
Let \(\displaystyle \mathbf{v} \in \mathbb{R}^n\).
Then
\(\displaystyle
\begin{aligned}
\mathbf{v}^T \mathbf{K} \mathbf{v}
&= v^T [\sum_j K_{ij}v_j]\\
&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
&= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\
&= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
&\geq 0
\end{aligned}
\)
Learning Theory
PAC Learning
Probably Approximately Correct (PAC)
A hypothesis class \(\displaystyle H\) is PAC learnable if given \(\displaystyle 0 \lt \epsilon, \delta \lt 1\), there is some function \(\displaystyle m(\epsilon, \delta)\) polynomial in \(\displaystyle 1/\epsilon, 1/\delta\) such that if we have a sample size \(\displaystyle \geq m(\epsilon, \delta)\) then with probability \(\displaystyle 1-\delta\) the hypothesis we will learn will have an average error \(\displaystyle \leq \epsilon\).
Uniform Convergence
If for all hypothesis \(\displaystyle h\), \(\displaystyle |L_S(h)-L_D(h)| \leq \epsilon\), then the training set \(\displaystyle S\) is called \(\displaystyle \epsilon\)-representative.
Then
\(\displaystyle
L_D(h_s)
\leq L_S(h_S) + \epsilon / 2
\leq L_S(h_D) + \epsilon / 2
\leq L_D(h_D) + \epsilon
\).
A hypothesis class \(\displaystyle H\) has uniform convergence if there exists \(\displaystyle m^{UC}(\epsilon, \delta)\) such that for every \(\displaystyle \epsilon, \delta\), if we draw a sample \(\displaystyle S\) then with probability \(\displaystyle 1-\delta\), \(\displaystyle S\) is \(\displaystyle \epsilon\)-representative.
VC dimension
Shattering
A model \(\displaystyle f\) parameterized by \(\displaystyle \theta\) is said to shatter a set of points \(\displaystyle \{x_1, ..., x_n\}\) if for every possible set of binary labellings \(\displaystyle \{y_1,...,y_n\}\) there exists \(\displaystyle \theta\) such that \(\displaystyle f\) makes no errors.
Definition
Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.
Stolen from wikipedia:
The VC dimension of a model \(\displaystyle f\) is the maximum number of points that can be arranged so that \(\displaystyle f\) shatters them.
More formally, it is the maximum cardinal \(\displaystyle D\) such that some data point set of cardinality \(\displaystyle D\) can be shattered by \(\displaystyle f\).
- Notes
- To show VCdim is at least n, find a training set of size n that can be shattered by our hypothesis class.
- To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
- Number of parameters do not necessarily correspond to VC dimension.
- \(\displaystyle H=\{h(x)=\sin(\theta x)\}\) has infinite VC dimension with one parameter
Theory
Reference
In the case where the Hypothesis class \(\displaystyle \mathcal{H}\) is finite, we have
- \(\displaystyle |L_D(h) - L_S(h)| \lt \sqrt{ \frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}} \)
- where \(\displaystyle m\) is the size of the sample.
For all h in H,
- \(\displaystyle |L_D(h) - L_S(h)| \lt K_1 \sqrt{ \frac{VCdim + K_2 \log(2/\delta)}{2m}} \)
- for some constants \(\displaystyle K_1, K_2\)
Bias-Variance Tradeoff
- Let \(\displaystyle L_D(h)\) be the true loss of hypothesis h
- and \(\displaystyle L_S(h)\) be the true loss of hypothesis h
- \(\displaystyle L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]\)
- The term \(\displaystyle L_D(h_D^*)\) is called the bias
- The term \(\displaystyle [L_D(h_s^*) - L_D(h_D^*)]\) is called variance.
- Larger hypothesis class will get smaller bias but larger variance.
- Overfitting vs. underfitting
Rademacher Complexity
Definition
The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.
In this case, we see how well we can fit random noise.
Given a set \(\displaystyle A \subset \mathbb{R}^m\) the Rademacher complexity is:
\(\displaystyle R(A) = \frac{1}{m}E_{\sigma} [\sup_{a \in A} \sum_{i=1}^{m} \sigma_i a_i]\)
where each \(\displaystyle \sigma_i\) are from a discrete uniform distribution \(\displaystyle \{-1, 1\}\)
Given a sample \(\displaystyle S=\{z_1,...,z_n\}\) and a function class \(\displaystyle F\), the empirical rademacher complexity is:
\(\displaystyle R(F \circ S)\)
where \(\displaystyle F \circ S = \{(f(z_1),...,f(z_n)) \mid f \in F\}\)
- Notes
Concentration Bounds
Hoeffding's inequality
Let \(\displaystyle X_1,...,X_n\) be bounded in (a,b)
Then \(\displaystyle P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})\)