Deep Learning: Difference between revisions
Line 1,337: | Line 1,337: | ||
\end{aligned} | \end{aligned} | ||
</math> | </math> | ||
}} | |||
===Regularized GDA=== | |||
<math>\theta_{t+1} = \theta_t + \eta R(\theta_t) G(\theta_t)</math> | |||
* <math>R(\theta)</math> is chosen to not change the fixed point of the original problewm. | |||
* Suppose <math>R(\theta)</math> is invertible. <math>\theta^*</math> is a fixed point iff <math>g(\theta^*) = 0</math>. | |||
Proof: | |||
(->) If <math>\theta^*</math. is a fixed point the <math>0 = \eta R(\theta^*) g(\theta^*) \implies g(\theta^*) = 0</math>. | |||
(<-) If <math>g(\theta^*)=0</math>, we want to show <math>\theta^*</math> is a fixed point. | |||
If <math>g(\theta^*)=0</math> then <math>\tilde{J}(\theta^*) = I + \eta R(\theta^*) H(\theta^*) + 0</math>. | |||
<math>R(\theta) = I - \gamma H^T</math> | |||
<math>R(\theta) H(\theta) = (I-\gamma H^T)H = H - \gamma H^T H</math> | |||
<math>J = I - \eta H - \eta \gamma H^T H</math> | |||
Thus this regularization pushes the real parts of the eigenvalues to be more negative so that the training will be more stable. | |||
This reduces <math>Im(\lambda)/Re(\lambda)</math> but may not necessarily help because it increases <math>Re(\lambda)</math>. | |||
===Approximate Local minmax=== | |||
Under continuous and differentiable assumptions, the space of fixed-points is guaranteed to be non-empty (by Brouwev's Fixed Point thm) however the set of local min-max can be empty. | |||
[Dskalakis & Penges 2018, Adolphs et al 2018] | |||
In an unconstrained case, the strong local min-max set is a subset of stabled fixed points of GDA which is a subset of stable points of OGDA. | |||
;Definition | |||
<math>(x^*, y^*)</math> is an <math>(\epsilon, \delta)</math> approximate local min-max if: | |||
<math>f(x^*, y) - \epsilon \leq f(x^*, y^*) \leq f(x, y^*) + \epsilon</math> | |||
<math>\forall (x,y) \in B_{\delta}(x^*, y^*)</math> | |||
Suppose f is a G-lipschitz and L-smooth. | |||
For <math>\delta < \epsilon/G</math>, every point is a local min-max. | |||
For <math>\delta > \sqrt{2\epsilon/L}</math>, the local min-max may not exist and np-hard to compute. | |||
For the region in between, the local min max is guaranteed to exist. The compute is PPAD (super polynomial) and not np-hard. | |||
;So what? | |||
For ERM non-convex minimization, we us GD to solve. Over-parameterization helped us in terms of global convergence. | |||
Would over-parameterization help in solving non-convex concave min-max? | |||
Yes. An ICLR 2021 paper by anonymous authors ''Understanding the role of over-parameterization in GANs''. | |||
;Theorem | |||
Consider a WGAN problem. | |||
Suppose the generator is one-hidden layer and the discriminator is linear. | |||
<math>\min_{G} \max_{D} f(G,D)</math> is non-convex concave optimization. | |||
If the generator is sufficiently over-parameterized then Sim GDA converges to a global min-max solution. | |||
==Misc== | ==Misc== |