5,337
edits
Line 237: | Line 237: | ||
* Data is linearly separable | * Data is linearly separable | ||
* No bias term (b=0) | * No bias term (b=0) | ||
Why do we pick \(w^*\) to be the max-margin solution? | |||
To have better generalization. | |||
We will consider functions \(f_w(x) = w^t x\) where we have \(\{(x_i, y_i)\}_{1}^{N}\), \(f_w: \mathbb{R}^d \to \mathbb{R}\), \(x_i \in \mathbb{R}^d\) and \(y_i \in \{-1, 1\}\). | |||
We can use one of the loss functions: | |||
* hinge loss \(l(x)=\max(1-x, 0\)) | |||
* logistic loss \(l(u) = \log(1+e^{-u})\) | |||
* exp loss \(l(u)=e^{-u}\) | |||
Our overall loss will be the sum of loss for each sample: | |||
\(L(w) = \sum_{u=1}^{N}l(y_i w^t x_i)\) | |||
Here \(x_i\) and \(y_i\) are constants so call \(\tilde{x}_i = y_i x_i\). | |||
Now our loss is \(L(w) = \sum_{u=1}^{N}l(w^t \tilde{x}_i)\) | |||
Assumptions: | |||
# Data is linearly separable: \(\exists w^* \) s.t. \((w^*)^t x_i > 0\). | |||
# \(l(u)\) is positive, differentiable, monotonically decreasing to 0, and \(\beta\)-smooth. | |||
#* Logistic and exp loss satisfy assumption 2. | |||
Hard SVM: | |||
Call the SVM solution \(w_{svm}^t x = 0\). | |||
Recall the SVM optimization is: | |||
<math> | |||
\begin{aligned} | |||
\min &\Vert w \Vert^2\\ | |||
&w^t x_i \geq 1 \; \forall i | |||
\end{aligned} | |||
</math> | |||
We use GD to find \(w^*\): | |||
\(\min \sum_{i=1}^{N} l(w^t x_i)\) | |||
<math> | |||
\begin{aligned} | |||
w(t+1) &= w(t) - \eta \nabla L(w(t))\\ | |||
&= w(t) - \eta \sum_{i=1}^{N} l'(w^t x_i) x_i | |||
\end{aligned} | |||
</math> | |||
Since the data is linearly separable, there exists \(w^*\) such that: | |||
\((w^*)^t \nabla L(w) = \sum_{i=1}^{N} l'(w^t x_i) x_i^t w^* < 0\) | |||
Here \(l'(w^t x_i) > 0\) and \(x_i^t w^* < 0\). | |||
This implies there are no finite critical points (where \(\nabla L(w) = 0\). | |||
But GD on a smooth loss converges to a critical point. | |||
This implies \(\Vert w(t) \Vert\) goes to infinity and the loss converges to 0. | |||
Since the loss converges to 0, GD converges to a global min. | |||
==Misc== | ==Misc== |