5,337
edits
Line 1,109: | Line 1,109: | ||
==Min-max Optimization== | ==Min-max Optimization== | ||
Beginning of Lecture 14 (Oct. 15, 2020) | Beginning of Lecture 14 (Oct. 15, 2020) | ||
<math> | |||
\DeclareMathOperator{\Tr}{Tr} | |||
\DeclareMathOperator{\VCdim}{VCdim} | |||
\DeclareMathOperator{\sign}{sign} | |||
\DeclareMathOperator{\rank}{rank} | |||
\DeclareMathOperator{\argmin}{argmin} | |||
\DeclareMathOperator{\argmax}{argmax} | |||
</math> | |||
;Problem | |||
<math>\min_{x \in X} \max_{y \in Y} f(x,y)</math>. | |||
* This is a zero-sum game. | |||
* Assume <math>f</math> is smooth and differentiable. | |||
;Goal | |||
Find <math>(x^*, y^*)</math> which is a global solution, saddle point, or equilibrium | |||
* <math>y^* \in \argmax f(x^*,y)</math> | |||
* <math>x^* \in \argmin f(x, y^*)</math> | |||
We know: | |||
<math>f(x^*,y) \leq f(x^*, y^*) \leq f(x, y^*)</math> | |||
===Simultaneous Gradient Descent (GDA)=== | |||
<math> | |||
\begin{cases} | |||
x_{t+1} = x_t - \eta \nabla_{x} f(x_t, y_t)\\ | |||
y_{t+1} = y_t + \eta \nabla_{y} f(x_t, y_t) | |||
\end{cases} | |||
</math> | |||
;Notes | |||
* <math>x, y \in \mathbb{R}^d</math> | |||
* In the constrained case, we need to project back onto <math>S</math>. | |||
Define <math>\theta = [x, y]</math>. | |||
Then we can write the above update equation as <math>\theta_{t+1} = \theta_{t} + \eta \overrightarrow{g}(\theta_t)</math> where | |||
<math>\overrightarrow{g}(\theta_t) = | |||
\begin{bmatrix} | |||
- \nabla_x f(x_t, y_t)\\ | |||
\nabla_y f(x_t, y_t) | |||
\end{bmatrix} | |||
</math> | |||
Or in other words, <math>\theta_{t+1} = F(\theta_t)</math>. | |||
We want <math>F</math> to lead to <math>\theta^*</math> until <math>\theta^* = F(\theta^*)</math>. | |||
Here, <math>\theta^*</math> is a fixed point of <math>F</math>. | |||
{{hidden | Example | | |||
Consider <math>\min_x \max_y f(x,y)</math> with <math>f(x,y) = xy</math>. | |||
his has a global saddle point at <math>(0,0)</math>. | |||
Our update rule is: | |||
<math> | |||
\begin{pmatrix} | |||
x_{t+1} \\ y_{t+1} | |||
\end{pmatrix} | |||
= | |||
\begin{pmatrix} | |||
1 & -\eta\\ | |||
\eta & 1 | |||
\end{pmatrix} | |||
\begin{pmatrix} | |||
x_t \\ y_t | |||
\end{pmatrix} | |||
</math> | |||
Does this converge? | |||
We can check the magnitudes: | |||
<math> | |||
\begin{cases} | |||
\Vert x_{t+1} \Vert^2 = \Vert x_t \Vert^2 + \eta^2 \Vert y_t \Vert^2 - 2 \eta x_t y_t\\ | |||
\Vert y_{t+1} \Vert^2 = \Vert y_t \Vert^2 + \eta^2 \Vert x_t \Vert^2 + 2 \eta x_y y_t | |||
\end{cases} | |||
</math> | |||
<math> | |||
\Vert x_{t+1} \Vert^2 + \Vert y_{t+1} \Vert^2 = (1 + \eta^2) (\Vert x_t \Vert^2 + \Vert y_t \Vert^2) | |||
</math> | |||
Thus, each update gets further and further from the origin. | |||
}} | |||
===Convex-concave min-max=== | |||
The min-max theorem by [Von Neumann (1928)]. | |||
* Suppose <math>X,Y</math> be compact/convex. | |||
* Suppose <math>f</math> is continuous and convex-concave. | |||
Then: | |||
* <math>\min_{x \in X} \max_{y \in Y} f(x,y) = \max_{y \in Y} \min_{x \in X} f(x,y)</math> | |||
* min-max optimal point is unique if f is strictly convex-concave otherwise a convex-set of solutions exists. | |||
;Notes | |||
* If <math>f</math> is non-convex concave, it doesn't hold. | |||
* bilinear core: <math>f(x,y) = x^t A y + b^t x + c^t y</math> | |||
[Von Neumann-Dantzig 1947] show that there is a strong connection between the min-max theorem and strong <math>L_p</math> duality. | |||
[Frennd & Schapive 199] show the convergence of sim GDA in an ''average sense'': | |||
Assume: | |||
* f-> convex in x/concave in y | |||
* S-> convex/compact/bounded | |||
* <math>\eta \approx \frac{1}{\sqrt{T}</math> | |||
Then for Sim GDA: <math>f(\bar{x}, \bar{y}) \to f(x^*, y^*)</math> with order <math>O(1/\sqrt{T})</math> | |||
No guarantees exist for the last iteration. | |||
===Optimistic GDA (OGDA)=== | |||
Gradient descent with negative momentum: | |||
<math>x_{t+1} = x_t - \eta \nabla f(x_t) + \frac{\eta}{2} \nabla f(x_{t-1})</math> | |||
This technique by [Popov 1980] helps the convergence stability of GD. | |||
;OGDA | |||
<math> | |||
\begin{cases} | |||
x_{t+1} = x_t - \eta \nabla_x f(x_t, y_t) + \frac{\eta}{2} \nabla_x f(x_{t-1},y_{t-1})\\ | |||
y_{t+1} = y_t + \eta \nabla_y f(x_t, y_t) - \frac{\eta}{2} \nabla_y f(x_{t-1},y_{t-1}) | |||
\end{cases} | |||
</math> | |||
{{hidden | Example | | |||
<math>f(x, y) = xy</math> | |||
Using OGDA | |||
<math> | |||
\begin{cases} | |||
x_{t+1} = x_t - \eta \nabla_x y_t + \frac{\eta}{2} \nabla_x y_{t-1}\\ | |||
y_{t+1} = y_t + \eta \nabla_y x_t - \frac{\eta}{2} \nabla_y x_{t-1} | |||
\end{cases} | |||
</math> | |||
This small changes allows GDA to converge to <math>(0,0)</math>. | |||
}} | |||
OGDA has ''last iter'' convergence and linear rates for unconstrained bilinear min-max. | |||
===Nonconvex-nonconcave min-max opt=== | |||
==Misc== | ==Misc== |