# Machine Learning

Machine Learning

## Loss functions

### (Mean) Squared Error

The squared error is:
${\displaystyle J(\theta )=\sum |h_{\theta }(x^{(i)})-y^{(i)}|^{2}}$
If our model is linear regression ${\displaystyle h(x)=w^{t}x}$ then this is convex.

Proof

{\displaystyle {\begin{aligned}\nabla _{w}J(w)&=\nabla _{w}\sum (w^{t}x^{(i)}-y^{(i)})^{2}\\&=2\sum (w^{t}x^{(i)}-y^{(i)})x\\\implies \nabla _{w}^{2}J(w)&=\nabla 2\sum (w^{T}x^{(i)}-y^{(i)})x^{(i)}\\&=2\sum x^{(i)}(x^{(i)})^{T}\end{aligned}}}
so the hessian is positive semi-definite

### Cross Entropy

The cross entropy loss is

• ${\displaystyle J(\theta )=\sum [(y^{(i)})\log(h_{\theta }(x))+(1-y^{(i)})\log(1-h_{\theta }(x))]}$
Notes
• If our model is ${\displaystyle g(\theta ^{T}x^{(i)})}$ where ${\displaystyle g(x)}$ is the sigmoid function ${\displaystyle {\frac {e^{x}}{1+e^{x}}}}$ then this is convex
Proof

which is a PSD matrix

## Optimization

Also known as direction of steepest gradient.
To minimize a loss function, just take steps in the opposite direction of the gradient.

Oftentimes with large amounts of data, you can't take the gradient of all your data at once. In this case, we use batches of data.
We can take batches by getting a random order of indices without replacement.

for epoch=1 to n
generate batches
for i=1 to m

Batch Size

## SVM

Andrew Ng Notes
Support Vector Machine
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.
${\displaystyle h_{w,b}(x)=g(W*x+b)}$ where ${\displaystyle g(x)=I[x>=0]-I[x<0]}$ is the sign function.

### Margins

The margin denoted by ${\displaystyle \gamma }$ is the distance between our classifier and the closest point.

Functional Margin

The margin corresponding to one example is:
${\displaystyle {\hat {\gamma }}^{(i)}=y^{(i)}(w^{T}x^{(i)}+b)}$ The margin for our entire sample is the smallest margin per example.

Geometric Margin

The geometric margin is the actual distance.
${\displaystyle {\hat {\gamma }}^{(i)}=y^{(i)}(({\frac {w}{\Vert w\Vert }})^{T}x^{(i)}+{\frac {b}{\Vert w\Vert }})}$

Notes
• ${\displaystyle \mathbf {w} }$ is the normal vector of our hyperplane so ${\displaystyle {\frac {w}{\Vert w\Vert }})^{T}x^{(i)}}$ is the length of the projection of x onto our normal vector.
This is the distance to our hyperplane.

### Lagrangians

The goal for svm is to maximize the margin:
{\displaystyle {\begin{aligned}\max _{{\hat {\gamma }},w,b}&{\frac {\hat {\gamma }}{\Vert w\Vert }}\\{\text{s.t. }}&y^{(i)}(w^{T}x^{(i)}+b)\geq {\hat {\gamma }}\quad \forall i\end{aligned}}}
which is equivalent to, by setting ${\displaystyle {\hat {\gamma }}=1}$,
{\displaystyle {\begin{aligned}\min _{\gamma ,w,b}&\Vert w\Vert ^{2}\\{\text{s.t. }}&y^{(i)}(w^{T}x^{(i)}+b)\geq 1\quad \forall i\end{aligned}}}

In general, given an optimization in the (primal) form:
{\displaystyle {\begin{aligned}\min _{w}&f(w)\\{\text{s.t. }}&h_{i}(w)\leq 0\quad \forall i\\&g_{i}(w)=0\end{aligned}}}
we can rewrite the optimization as
${\displaystyle \min _{w}\max _{\alpha ,\beta \mid \alpha \geq 0}{\mathcal {L}}(w,\alpha ,\beta )}$
where ${\displaystyle {\mathcal {L}}(w,\alpha ,\beta )=f(w)+\sum \alpha _{i}g_{i}(w)+\sum \beta _{i}h_{i}(w)}$ is called the lagrangian.
Since ${\displaystyle \max \min \leq \min \max }$,
we have:
${\displaystyle \max _{\alpha ,\beta \mid \alpha \geq 0}\min _{w}{\mathcal {L}}(w,\alpha ,\beta )\leq \min _{w}\max _{\alpha ,\beta \mid \alpha \geq 0}{\mathcal {L}}(w,\alpha ,\beta )}$
The left term is called the dual problem.
If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the solution to the original problem.

### Kernel Trick

Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.
In this case, one way to get around this is to perform a non-linear preprocessing of the data ${\displaystyle \phi (x)}$.
For example ${\displaystyle \phi (x)={\begin{bmatrix}x\\x^{2}\\x^{3}\end{bmatrix}}}$ If our original model and training only used ${\displaystyle \langle x,z\rangle }$ then we only need ${\displaystyle \phi (x)^{T}\phi (z)}$
A kernel ${\displaystyle K(x,z)}$ is a function that can be expressed as ${\displaystyle K(x,z)=\phi (x)^{T}\phi (z)}$ for some function ${\displaystyle \phi }$

#### Identifying if a function is a kernel

Basic check: Since the kernel is an inner-product between ${\displaystyle \phi (x),\phi (z)}$, it should satisfy the axioms of inner products, namely ${\displaystyle K(x,z)=K(z,x)}$, otherwise it is not a kernel.

#### Mercer's Theorem

Let our kernel function be ${\displaystyle K(z,x)}$. Then for any sample S, the corresponding matrix ${\displaystyle \mathbf {K} }$ where ${\displaystyle K_{ij}=K(x^{(i)},x^{(j)})}$ is symmetric positive definite.

Proof

Symmetry: ${\displaystyle K_{ij}=K(\mathbf {x} ^{(i)},\mathbf {x} ^{(j)}=\phi (\mathbf {x} ^{(i)})^{T}\phi (\mathbf {x} ^{(j)})=\phi (\mathbf {x} ^{(j)})^{T}\phi (\mathbf {x} ^{(i)})=K_{ji}}$
Positive Definite:
Let ${\displaystyle \mathbf {v} \in \mathbb {R} ^{n}}$.
Then
${\displaystyle H}$