Deep Learning: Difference between revisions

1,733 bytes added ,  10 September 2020
Line 237: Line 237:
* Data is linearly separable
* Data is linearly separable
* No bias term (b=0)
* No bias term (b=0)
Why do we pick \(w^*\) to be the max-margin solution? 
To have better generalization.
We will consider functions \(f_w(x) = w^t x\) where we have \(\{(x_i, y_i)\}_{1}^{N}\), \(f_w: \mathbb{R}^d \to \mathbb{R}\), \(x_i \in \mathbb{R}^d\) and \(y_i \in \{-1, 1\}\).
We can use one of the loss functions:
* hinge loss \(l(x)=\max(1-x, 0\))
* logistic loss \(l(u) = \log(1+e^{-u})\)
* exp loss \(l(u)=e^{-u}\)
Our overall loss will be the sum of loss for each sample: 
\(L(w) = \sum_{u=1}^{N}l(y_i w^t x_i)\) 
Here \(x_i\) and \(y_i\) are constants so call \(\tilde{x}_i = y_i x_i\). 
Now our loss is \(L(w) = \sum_{u=1}^{N}l(w^t \tilde{x}_i)\)
Assumptions:
# Data is linearly separable: \(\exists w^* \) s.t. \((w^*)^t x_i > 0\).
# \(l(u)\) is positive, differentiable, monotonically decreasing to 0, and \(\beta\)-smooth.
#* Logistic and exp loss satisfy assumption 2.
Hard SVM: 
Call the SVM solution \(w_{svm}^t x = 0\). 
Recall the SVM optimization is:
<math>
\begin{aligned}
\min &\Vert w \Vert^2\\
&w^t x_i \geq 1 \; \forall i
\end{aligned}
</math>
We use GD to find \(w^*\): 
\(\min \sum_{i=1}^{N} l(w^t x_i)\) 
<math>
\begin{aligned}
w(t+1) &= w(t) - \eta \nabla L(w(t))\\
&= w(t) - \eta \sum_{i=1}^{N} l'(w^t x_i) x_i
\end{aligned}
</math>
Since the data is linearly separable, there exists \(w^*\) such that: 
\((w^*)^t \nabla L(w) = \sum_{i=1}^{N} l'(w^t x_i) x_i^t w^* < 0\) 
Here \(l'(w^t x_i) > 0\) and \(x_i^t w^* < 0\). 
This implies there are no finite critical points (where \(\nabla L(w) = 0\). 
But GD on a smooth loss converges to a critical point. 
This implies \(\Vert w(t) \Vert\) goes to infinity and the loss converges to 0. 
Since the loss converges to 0, GD converges to a global min.


==Misc==
==Misc==