Deep Learning: Difference between revisions

Line 1,337:

\end{aligned}

</math>

}}

===Regularized GDA===

<math>\theta_{t+1} = \theta_t + \eta R(\theta_t) G(\theta_t)</math>

* <math>R(\theta)</math> is chosen to not change the fixed point of the original problewm.

* Suppose <math>R(\theta)</math> is invertible. <math>\theta^*</math> is a fixed point iff <math>g(\theta^*) = 0</math>.

Proof:

(->) If <math>\theta^*</math. is a fixed point the <math>0 = \eta R(\theta^*) g(\theta^*) \implies g(\theta^*) = 0</math>.

(<-) If <math>g(\theta^*)=0</math>, we want to show <math>\theta^*</math> is a fixed point.

If <math>g(\theta^*)=0</math> then <math>\tilde{J}(\theta^*) = I + \eta R(\theta^*) H(\theta^*) + 0</math>.

<math>R(\theta) = I - \gamma H^T</math>

<math>R(\theta) H(\theta) = (I-\gamma H^T)H = H - \gamma H^T H</math>

<math>J = I - \eta H - \eta \gamma H^T H</math>

Thus this regularization pushes the real parts of the eigenvalues to be more negative so that the training will be more stable.

This reduces <math>Im(\lambda)/Re(\lambda)</math> but may not necessarily help because it increases <math>Re(\lambda)</math>.

===Approximate Local minmax===

Under continuous and differentiable assumptions, the space of fixed-points is guaranteed to be non-empty (by Brouwev's Fixed Point thm) however the set of local min-max can be empty.

[Dskalakis & Penges 2018, Adolphs et al 2018]

In an unconstrained case, the strong local min-max set is a subset of stabled fixed points of GDA which is a subset of stable points of OGDA.

;Definition

<math>(x^*, y^*)</math> is an <math>(\epsilon, \delta)</math> approximate local min-max if:

<math>f(x^*, y) - \epsilon \leq f(x^*, y^*) \leq f(x, y^*) + \epsilon</math>

<math>\forall (x,y) \in B_{\delta}(x^*, y^*)</math>

Suppose f is a G-lipschitz and L-smooth.

For <math>\delta < \epsilon/G</math>, every point is a local min-max.

For <math>\delta > \sqrt{2\epsilon/L}</math>, the local min-max may not exist and np-hard to compute.

For the region in between, the local min max is guaranteed to exist. The compute is PPAD (super polynomial) and not np-hard.

;So what?

For ERM non-convex minimization, we us GD to solve. Over-parameterization helped us in terms of global convergence.

Would over-parameterization help in solving non-convex concave min-max?

Yes. An ICLR 2021 paper by anonymous authors ''Understanding the role of over-parameterization in GANs''.

;Theorem

Consider a WGAN problem.

Suppose the generator is one-hidden layer and the discriminator is linear.

<math>\min_{G} \max_{D} f(G,D)</math> is non-convex concave optimization.

If the generator is sufficiently over-parameterized then Sim GDA converges to a global min-max solution.

==Misc==

@@ Line 1,337: / Line 1,337: @@
 \end{aligned}
 </math>
+}}
+===Regularized GDA===
+<math>\theta_{t+1} = \theta_t + \eta R(\theta_t) G(\theta_t)</math>
+* <math>R(\theta)</math> is chosen to not change the fixed point of the original problewm.
+* Suppose <math>R(\theta)</math> is invertible. <math>\theta^*</math> is a fixed point iff <math>g(\theta^*) = 0</math>.
+Proof:
+(->) If <math>\theta^*</math. is a fixed point the <math>0 = \eta R(\theta^*) g(\theta^*) \implies g(\theta^*) = 0</math>.
+(<-) If <math>g(\theta^*)=0</math>, we want to show <math>\theta^*</math> is a fixed point.
+If <math>g(\theta^*)=0</math> then <math>\tilde{J}(\theta^*) = I + \eta R(\theta^*) H(\theta^*) + 0</math>.
+<math>R(\theta) = I - \gamma H^T</math>
+<math>R(\theta) H(\theta) = (I-\gamma H^T)H = H - \gamma H^T H</math>
+<math>J = I - \eta H - \eta \gamma H^T H</math>
+Thus this regularization pushes the real parts of the eigenvalues to be more negative so that the training will be more stable.
+This reduces <math>Im(\lambda)/Re(\lambda)</math> but may not necessarily help because it increases <math>Re(\lambda)</math>.
+===Approximate Local minmax===
+Under continuous and differentiable assumptions, the space of fixed-points is guaranteed to be non-empty (by Brouwev's Fixed Point thm) however the set of local min-max can be empty.
+[Dskalakis & Penges 2018, Adolphs et al 2018]
+In an unconstrained case, the strong local min-max set is a subset of stabled fixed points of GDA which is a subset of stable points of OGDA.
+;Definition
+<math>(x^*, y^*)</math> is an <math>(\epsilon, \delta)</math> approximate local min-max if:
+<math>f(x^*, y) - \epsilon \leq f(x^*, y^*) \leq f(x, y^*) + \epsilon</math>
+<math>\forall (x,y) \in B_{\delta}(x^*, y^*)</math>
+Suppose f is a G-lipschitz and L-smooth.
+For <math>\delta < \epsilon/G</math>, every point is a local min-max.
+For <math>\delta > \sqrt{2\epsilon/L}</math>, the local min-max may not exist and np-hard to compute.
+For the region in between, the local min max is guaranteed to exist. The compute is PPAD (super polynomial) and not np-hard.
+;So what?
+For ERM non-convex minimization, we us GD to solve. Over-parameterization helped us in terms of global convergence.
+Would over-parameterization help in solving non-convex concave min-max?
+Yes. An ICLR 2021 paper by anonymous authors ''Understanding the role of over-parameterization in GANs''.
+;Theorem
+Consider a WGAN problem.
+Suppose the generator is one-hidden layer and the discriminator is linear.
+<math>\min_{G} \max_{D} f(G,D)</math> is non-convex concave optimization.
+If the generator is sufficiently over-parameterized then Sim GDA converges to a global min-max solution.
 ==Misc==