Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 16:12, 17 September 2020

1,426 bytes added , 17 September 2020

→‎Can we analyze the double descent curve for some simple distributions or models? (Belkin et al.)

David

Bureaucrats, Interface administrators, Administrators

5,321

edits

@@ Line 457: / Line 457: @@
 \end{pmatrix}
 </math>
-Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>.
+Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>. Here <math>T^c</math> is the set of features we are not using.
+Define <math>X_{T} =
+\begin{pmatrix}
+(x^{(1)}_T)^t\\
+\vdots\\
+(x^{(n)}_T)^t
+\end{pmatrix} \in \mathbb{R}^{n \times P}
+</math>
+Quadratic loss: <math>\min_{\beta_T} \Vert X_T \beta_T - y \Vert_{2}^{2} \in \mathbb{R}</math>.
+The optimal solution is <math>\beta_{T}^* = (X_{T}^t X_{T})^{-1} X_{T}^t y = X_{T}^{+} y</math> where <math>X_{T}^{+}</math> is the ''Moore-penrose Pseudo-inverse''.
+Since we know <math>P_{X, Y}</math>, we can compute the generalization error exactly.
+<math>
+\begin{aligned}
+E_{X,Y} \left[(y - x^t \beta^*)^2 \right] &= E \left[(x^t(\beta - \beta^*))^2\right]\\
+&= E \left[(\beta - \beta^*)^t x x^t (\beta - \beta^2)\right]\\
+&= (\beta - \beta^*)^t E \left[ x x^t \right] (\beta - \beta^2)\\
+&= \Vert \beta - \beta^* \Vert\\
+&= \Vert \beta_{T^c} \Vert^2 + \Vert \beta_{T} - \beta_{T}^* \Vert^2
+\end{aligned}
+</math>
+;Theorem: <math>B_{T^c} \neq 0</math>
+<math>
+E \left[(y - x^t \beta^*)^2 \right] =
+\begin{cases}
+\Vert B_{T^C} \Vert^2 (1 + \frac{p}{n-p-1}) & p \leq n-2\\
++\infty & n-1 \leq p \leq n+1\\
+\Vert B_{T} \Vert ^2  (1 - \frac{n}{p}) + \Vert B_{T^c} \Vert^2 (1 + \frac{n}{p-n-1}) & p \geq n+2
+\end{cases}
+</math>
+In other cases, ''prescient'' feature selection. We can include features in <math>T</math> by decreasing the order of <math>\beta_j^2 = \frac{1}{j^2}</math>. From this we get a behavior like double descent.
 ==Misc==