Deep Learning: Difference between revisions

← Older edit Newer edit →

@@ Line 1,109: / Line 1,109: @@
 ==Min-max Optimization==
 Beginning of Lecture 14 (Oct. 15, 2020)
+<math>
+\DeclareMathOperator{\Tr}{Tr}
+\DeclareMathOperator{\VCdim}{VCdim}
+\DeclareMathOperator{\sign}{sign}
+\DeclareMathOperator{\rank}{rank}
+\DeclareMathOperator{\argmin}{argmin}
+\DeclareMathOperator{\argmax}{argmax}
+</math>
+;Problem
+<math>\min_{x \in X} \max_{y \in Y} f(x,y)</math>.
+* This is a zero-sum game.
+* Assume <math>f</math> is smooth and differentiable.
+;Goal
+Find <math>(x^*, y^*)</math> which is a global solution, saddle point, or equilibrium
+* <math>y^* \in \argmax f(x^*,y)</math>
+* <math>x^* \in \argmin f(x, y^*)</math>
+We know:
+<math>f(x^*,y) \leq f(x^*, y^*) \leq f(x, y^*)</math>
+===Simultaneous Gradient Descent (GDA)===
+<math>
+\begin{cases}
+x_{t+1} = x_t - \eta \nabla_{x} f(x_t, y_t)\\
+y_{t+1} = y_t + \eta \nabla_{y} f(x_t, y_t)
+\end{cases}
+</math>
+;Notes
+* <math>x, y \in \mathbb{R}^d</math>
+* In the constrained case, we need to project back onto <math>S</math>.
+Define <math>\theta = [x, y]</math>.
+Then we can write the above update equation as <math>\theta_{t+1} = \theta_{t} + \eta \overrightarrow{g}(\theta_t)</math> where
+<math>\overrightarrow{g}(\theta_t) =
+\begin{bmatrix}
+- \nabla_x f(x_t, y_t)\\
+\nabla_y f(x_t, y_t)
+\end{bmatrix}
+</math>
+Or in other words, <math>\theta_{t+1} = F(\theta_t)</math>.
+We want <math>F</math> to lead to <math>\theta^*</math> until <math>\theta^* = F(\theta^*)</math>.
+Here, <math>\theta^*</math> is a fixed point of <math>F</math>.
+{{hidden | Example |
+Consider <math>\min_x \max_y f(x,y)</math> with <math>f(x,y) = xy</math>.
+his has a global saddle point at <math>(0,0)</math>.
+Our update rule is:
+<math>
+\begin{pmatrix}
+x_{t+1} \\ y_{t+1}
+\end{pmatrix}
+=
+\begin{pmatrix}
+& -\eta\\
+\eta & 1
+\end{pmatrix}
+\begin{pmatrix}
+x_t \\ y_t
+\end{pmatrix}
+</math>
+Does this converge?
+We can check the magnitudes:
+<math>
+\begin{cases}
+\Vert x_{t+1} \Vert^2 = \Vert x_t \Vert^2 + \eta^2 \Vert y_t \Vert^2 - 2 \eta x_t y_t\\
+\Vert y_{t+1} \Vert^2 = \Vert y_t \Vert^2 + \eta^2 \Vert x_t \Vert^2 + 2 \eta x_y y_t
+\end{cases}
+</math>
+<math>
+\Vert x_{t+1} \Vert^2 + \Vert y_{t+1} \Vert^2 = (1 + \eta^2) (\Vert x_t \Vert^2 + \Vert y_t \Vert^2)
+</math>
+Thus, each update gets further and further from the origin.
+}}
+===Convex-concave min-max===
+The min-max theorem by [Von Neumann (1928)].
+* Suppose <math>X,Y</math> be compact/convex.
+* Suppose <math>f</math> is continuous and convex-concave.
+Then:
+* <math>\min_{x \in X} \max_{y \in Y} f(x,y) = \max_{y \in Y} \min_{x \in X} f(x,y)</math>
+* min-max optimal point is unique if f is strictly convex-concave otherwise a convex-set of solutions exists.
+;Notes
+* If <math>f</math> is non-convex concave, it doesn't hold.
+* bilinear core: <math>f(x,y) = x^t A y + b^t x + c^t y</math>
+[Von Neumann-Dantzig 1947] show that there is a strong connection between the min-max theorem and strong <math>L_p</math> duality.
+[Frennd & Schapive 199] show the convergence of sim GDA in an ''average sense'':
+Assume:
+* f-> convex in x/concave in y
+* S-> convex/compact/bounded
+* <math>\eta \approx \frac{1}{\sqrt{T}</math>
+Then for Sim GDA: <math>f(\bar{x}, \bar{y}) \to f(x^*, y^*)</math> with order <math>O(1/\sqrt{T})</math>
+No guarantees exist for the last iteration.
+===Optimistic GDA (OGDA)===
+Gradient descent with negative momentum:
+<math>x_{t+1} = x_t - \eta \nabla f(x_t) + \frac{\eta}{2} \nabla f(x_{t-1})</math>
+This technique by [Popov 1980] helps the convergence stability of GD.
+;OGDA
+<math>
+\begin{cases}
+x_{t+1} = x_t - \eta \nabla_x f(x_t, y_t) + \frac{\eta}{2} \nabla_x f(x_{t-1},y_{t-1})\\
+y_{t+1} = y_t + \eta \nabla_y f(x_t, y_t) - \frac{\eta}{2} \nabla_y f(x_{t-1},y_{t-1})
+\end{cases}
+</math>
+{{hidden | Example |
+<math>f(x, y) = xy</math>
+Using OGDA
+<math>
+\begin{cases}
+x_{t+1} = x_t - \eta \nabla_x y_t + \frac{\eta}{2} \nabla_x y_{t-1}\\
+y_{t+1} = y_t + \eta \nabla_y x_t - \frac{\eta}{2} \nabla_y x_{t-1}
+\end{cases}
+</math>
+This small changes allows GDA to converge to <math>(0,0)</math>.
+}}
+OGDA has ''last iter'' convergence and linear rates for unconstrained bilinear min-max.
+===Nonconvex-nonconcave min-max opt===
 ==Misc==