Deep Learning: Difference between revisions

4,095 bytes added ,  15 October 2020
Line 1,109: Line 1,109:
==Min-max Optimization==
==Min-max Optimization==
Beginning of Lecture 14 (Oct. 15, 2020)
Beginning of Lecture 14 (Oct. 15, 2020)
<math>
\DeclareMathOperator{\Tr}{Tr}
\DeclareMathOperator{\VCdim}{VCdim}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\rank}{rank}
\DeclareMathOperator{\argmin}{argmin}
\DeclareMathOperator{\argmax}{argmax}
</math>
;Problem
<math>\min_{x \in X} \max_{y \in Y} f(x,y)</math>. 
* This is a zero-sum game.
* Assume <math>f</math> is smooth and differentiable.
;Goal
Find <math>(x^*, y^*)</math> which is a global solution, saddle point, or equilibrium
* <math>y^* \in \argmax f(x^*,y)</math>
* <math>x^* \in \argmin f(x, y^*)</math>
We know:
<math>f(x^*,y) \leq f(x^*, y^*) \leq f(x, y^*)</math>
===Simultaneous Gradient Descent (GDA)===
<math>
\begin{cases}
x_{t+1} = x_t - \eta \nabla_{x} f(x_t, y_t)\\
y_{t+1} = y_t + \eta \nabla_{y} f(x_t, y_t)
\end{cases}
</math>
;Notes
* <math>x, y \in \mathbb{R}^d</math>
* In the constrained case, we need to project back onto <math>S</math>.
Define <math>\theta = [x, y]</math>. 
Then we can write the above update equation as <math>\theta_{t+1} = \theta_{t} + \eta \overrightarrow{g}(\theta_t)</math> where
<math>\overrightarrow{g}(\theta_t) =
\begin{bmatrix}
- \nabla_x f(x_t, y_t)\\
\nabla_y f(x_t, y_t)
\end{bmatrix}
</math> 
Or in other words, <math>\theta_{t+1} = F(\theta_t)</math>. 
We want <math>F</math> to lead to <math>\theta^*</math> until <math>\theta^* = F(\theta^*)</math>.
Here, <math>\theta^*</math> is a fixed point of <math>F</math>.
{{hidden | Example |
Consider <math>\min_x \max_y f(x,y)</math> with <math>f(x,y) = xy</math>. 
his has a global saddle point at <math>(0,0)</math>. 
Our update rule is:
<math>
\begin{pmatrix}
x_{t+1} \\ y_{t+1}
\end{pmatrix}
=
\begin{pmatrix}
1 & -\eta\\
\eta & 1
\end{pmatrix}
\begin{pmatrix}
x_t \\ y_t
\end{pmatrix}
</math>
Does this converge? 
We can check the magnitudes: 
<math>
\begin{cases}
\Vert x_{t+1} \Vert^2 = \Vert x_t \Vert^2 + \eta^2 \Vert y_t \Vert^2 - 2 \eta x_t y_t\\
\Vert y_{t+1} \Vert^2 = \Vert y_t \Vert^2 + \eta^2 \Vert x_t \Vert^2 + 2 \eta x_y y_t
\end{cases}
</math> 
<math>
\Vert x_{t+1} \Vert^2 + \Vert y_{t+1} \Vert^2 = (1 + \eta^2) (\Vert x_t \Vert^2 + \Vert y_t \Vert^2)
</math>
Thus, each update gets further and further from the origin.
}}
===Convex-concave min-max===
The min-max theorem by [Von Neumann (1928)]. 
* Suppose <math>X,Y</math> be compact/convex.
* Suppose <math>f</math> is continuous and convex-concave.
Then:
* <math>\min_{x \in X} \max_{y \in Y} f(x,y) = \max_{y \in Y} \min_{x \in X} f(x,y)</math>
* min-max optimal point is unique if f is strictly convex-concave otherwise a convex-set of solutions exists.
;Notes
* If <math>f</math> is non-convex concave, it doesn't hold.
* bilinear core: <math>f(x,y) = x^t A y + b^t x + c^t y</math>
[Von Neumann-Dantzig 1947] show that there is a strong connection between the min-max theorem and strong <math>L_p</math> duality.
[Frennd & Schapive 199] show the convergence of sim GDA in an ''average sense'': 
Assume:
* f-> convex in x/concave in y 
* S-> convex/compact/bounded 
* <math>\eta \approx \frac{1}{\sqrt{T}</math> 
Then for Sim GDA: <math>f(\bar{x}, \bar{y}) \to f(x^*, y^*)</math> with order <math>O(1/\sqrt{T})</math> 
No guarantees exist for the last iteration.
===Optimistic GDA (OGDA)===
Gradient descent with negative momentum: 
<math>x_{t+1} = x_t - \eta \nabla f(x_t) + \frac{\eta}{2} \nabla f(x_{t-1})</math> 
This technique by [Popov 1980] helps the convergence stability of GD.
;OGDA
<math>
\begin{cases}
x_{t+1} = x_t - \eta \nabla_x f(x_t, y_t) + \frac{\eta}{2} \nabla_x f(x_{t-1},y_{t-1})\\
y_{t+1} = y_t + \eta \nabla_y f(x_t, y_t) - \frac{\eta}{2} \nabla_y f(x_{t-1},y_{t-1})
\end{cases}
</math>
{{hidden | Example |
<math>f(x, y) = xy</math>
Using OGDA 
<math>
\begin{cases}
x_{t+1} = x_t - \eta \nabla_x y_t + \frac{\eta}{2} \nabla_x y_{t-1}\\
y_{t+1} = y_t + \eta \nabla_y x_t - \frac{\eta}{2} \nabla_y x_{t-1}
\end{cases}
</math>
This small changes allows GDA to converge to <math>(0,0)</math>.
}}
OGDA has ''last iter'' convergence and linear rates for unconstrained bilinear min-max.
===Nonconvex-nonconcave min-max opt===


==Misc==
==Misc==