Deep Learning: Difference between revisions
Line 457: | Line 457: | ||
\end{pmatrix} | \end{pmatrix} | ||
</math> | </math> | ||
Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>. | Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>. Here <math>T^c</math> is the set of features we are not using. | ||
Define <math>X_{T} = | |||
\begin{pmatrix} | |||
(x^{(1)}_T)^t\\ | |||
\vdots\\ | |||
(x^{(n)}_T)^t | |||
\end{pmatrix} \in \mathbb{R}^{n \times P} | |||
</math> | |||
Quadratic loss: <math>\min_{\beta_T} \Vert X_T \beta_T - y \Vert_{2}^{2} \in \mathbb{R}</math>. | |||
The optimal solution is <math>\beta_{T}^* = (X_{T}^t X_{T})^{-1} X_{T}^t y = X_{T}^{+} y</math> where <math>X_{T}^{+}</math> is the ''Moore-penrose Pseudo-inverse''. | |||
Since we know <math>P_{X, Y}</math>, we can compute the generalization error exactly. | |||
<math> | |||
\begin{aligned} | |||
E_{X,Y} \left[(y - x^t \beta^*)^2 \right] &= E \left[(x^t(\beta - \beta^*))^2\right]\\ | |||
&= E \left[(\beta - \beta^*)^t x x^t (\beta - \beta^2)\right]\\ | |||
&= (\beta - \beta^*)^t E \left[ x x^t \right] (\beta - \beta^2)\\ | |||
&= \Vert \beta - \beta^* \Vert\\ | |||
&= \Vert \beta_{T^c} \Vert^2 + \Vert \beta_{T} - \beta_{T}^* \Vert^2 | |||
\end{aligned} | |||
</math> | |||
;Theorem: <math>B_{T^c} \neq 0</math> | |||
<math> | |||
E \left[(y - x^t \beta^*)^2 \right] = | |||
\begin{cases} | |||
\Vert B_{T^C} \Vert^2 (1 + \frac{p}{n-p-1}) & p \leq n-2\\ | |||
+\infty & n-1 \leq p \leq n+1\\ | |||
\Vert B_{T} \Vert ^2 (1 - \frac{n}{p}) + \Vert B_{T^c} \Vert^2 (1 + \frac{n}{p-n-1}) & p \geq n+2 | |||
\end{cases} | |||
</math> | |||
In other cases, ''prescient'' feature selection. We can include features in <math>T</math> by decreasing the order of <math>\beta_j^2 = \frac{1}{j^2}</math>. From this we get a behavior like double descent. | |||
==Misc== | ==Misc== |