Deep Learning: Difference between revisions

1,426 bytes added ,  17 September 2020
Line 457: Line 457:
\end{pmatrix}
\end{pmatrix}
</math>   
</math>   
Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>.
Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>. Here <math>T^c</math> is the set of features we are not using.
 
Define <math>X_{T} =
\begin{pmatrix}
(x^{(1)}_T)^t\\
\vdots\\
(x^{(n)}_T)^t
\end{pmatrix} \in \mathbb{R}^{n \times P}
</math>
 
Quadratic loss: <math>\min_{\beta_T} \Vert X_T \beta_T - y \Vert_{2}^{2} \in \mathbb{R}</math>. 
The optimal solution is <math>\beta_{T}^* = (X_{T}^t X_{T})^{-1} X_{T}^t y = X_{T}^{+} y</math> where <math>X_{T}^{+}</math> is the ''Moore-penrose Pseudo-inverse''.
 
Since we know <math>P_{X, Y}</math>, we can compute the generalization error exactly. 
<math>
\begin{aligned}
E_{X,Y} \left[(y - x^t \beta^*)^2 \right] &= E \left[(x^t(\beta - \beta^*))^2\right]\\
&= E \left[(\beta - \beta^*)^t x x^t (\beta - \beta^2)\right]\\
&= (\beta - \beta^*)^t E \left[ x x^t \right] (\beta - \beta^2)\\
&= \Vert \beta - \beta^* \Vert\\
&= \Vert \beta_{T^c} \Vert^2 + \Vert \beta_{T} - \beta_{T}^* \Vert^2
\end{aligned}
</math>
 
;Theorem: <math>B_{T^c} \neq 0</math>
<math>
E \left[(y - x^t \beta^*)^2 \right] =
\begin{cases}
\Vert B_{T^C} \Vert^2 (1 + \frac{p}{n-p-1}) & p \leq n-2\\
+\infty & n-1 \leq p \leq n+1\\
\Vert B_{T} \Vert ^2  (1 - \frac{n}{p}) + \Vert B_{T^c} \Vert^2 (1 + \frac{n}{p-n-1}) & p \geq n+2
\end{cases}
</math>
 
In other cases, ''prescient'' feature selection. We can include features in <math>T</math> by decreasing the order of <math>\beta_j^2 = \frac{1}{j^2}</math>. From this we get a behavior like double descent.


==Misc==
==Misc==