Deep Learning: Difference between revisions

Deep Learning (view source)

208 bytes added , 8 September 2020

5,337

edits

@@ Line 159: / Line 159: @@
 <math display="block">
 \begin{aligned}
-L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w)(w_{t+1}-w_{t})\\
+L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w')(w_{t+1}-w_{t})\\
-&=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w)(-\eta) \nabla L(w_t)\\
+&=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w')(-\eta) \nabla L(w_t)\\
-&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w) \nabla L(w_t)\\
+&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w') \nabla L(w_t)\\
 &\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 (1-\frac{\eta \beta}{2}) &\text{by assumption 3}\\
 &\leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2 &\text{by assumption 4}\\
@@ Line 170: / Line 170: @@
 This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>.
 Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>.
+If we don't have convergence, we're not sure if this means we're violating the <math>\mu</math>-PL condition.
+It is possible one of the other assumptions is violated (e.g. if learning rate is too large).
 ==Misc==