Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 15:45, 10 September 2020

1,733 bytes added , 10 September 2020

→‎Start of Lecture 4 (September 10)

David

Bureaucrats, Interface administrators, Administrators

5,337

edits

@@ Line 237: / Line 237: @@
 * Data is linearly separable
 * No bias term (b=0)
+Why do we pick \(w^*\) to be the max-margin solution?
+To have better generalization.
+We will consider functions \(f_w(x) = w^t x\) where we have \(\{(x_i, y_i)\}_{1}^{N}\), \(f_w: \mathbb{R}^d \to \mathbb{R}\), \(x_i \in \mathbb{R}^d\) and \(y_i \in \{-1, 1\}\).
+We can use one of the loss functions:
+* hinge loss \(l(x)=\max(1-x, 0\))
+* logistic loss \(l(u) = \log(1+e^{-u})\)
+* exp loss \(l(u)=e^{-u}\)
+Our overall loss will be the sum of loss for each sample:
+\(L(w) = \sum_{u=1}^{N}l(y_i w^t x_i)\)
+Here \(x_i\) and \(y_i\) are constants so call \(\tilde{x}_i = y_i x_i\).
+Now our loss is \(L(w) = \sum_{u=1}^{N}l(w^t \tilde{x}_i)\)
+Assumptions:
+# Data is linearly separable: \(\exists w^* \) s.t. \((w^*)^t x_i > 0\).
+# \(l(u)\) is positive, differentiable, monotonically decreasing to 0, and \(\beta\)-smooth.
+#* Logistic and exp loss satisfy assumption 2.
+Hard SVM:
+Call the SVM solution \(w_{svm}^t x = 0\).
+Recall the SVM optimization is:
+<math>
+\begin{aligned}
+\min &\Vert w \Vert^2\\
+&w^t x_i \geq 1 \; \forall i
+\end{aligned}
+</math>
+We use GD to find \(w^*\):
+\(\min \sum_{i=1}^{N} l(w^t x_i)\)
+<math>
+\begin{aligned}
+w(t+1) &= w(t) - \eta \nabla L(w(t))\\
+&= w(t) - \eta \sum_{i=1}^{N} l'(w^t x_i) x_i
+\end{aligned}
+</math>
+Since the data is linearly separable, there exists \(w^*\) such that:
+\((w^*)^t \nabla L(w) = \sum_{i=1}^{N} l'(w^t x_i) x_i^t w^* < 0\)
+Here \(l'(w^t x_i) > 0\) and \(x_i^t w^* < 0\).
+This implies there are no finite critical points (where \(\nabla L(w) = 0\).
+But GD on a smooth loss converges to a critical point.
+This implies \(\Vert w(t) \Vert\) goes to infinity and the loss converges to 0.
+Since the loss converges to 0, GD converges to a global min.
 ==Misc==