5,321
edits
(27 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Machine Learning | Machine Learning | ||
==Loss functions== | ==Loss functions== | ||
===(Mean) Squared Error=== | ===(Mean) Squared Error=== | ||
The squared error is:<br> | The squared error is:<br> | ||
<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^(i)|^2</math><br> | <math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br> | ||
If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br> | If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br> | ||
{{hidden|Proof| | {{hidden|Proof| | ||
Line 25: | Line 19: | ||
===Cross Entropy=== | ===Cross Entropy=== | ||
The cross entropy loss is | |||
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math> | |||
;Notes | |||
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex | |||
{{hidden | Proof | | |||
We show that the Hessian is positive semi definite.<br> | |||
<math> | |||
\begin{align} | |||
\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\ | |||
&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\ | |||
&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\ | |||
&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\ | |||
&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\ | |||
\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\ | |||
&= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T\\ | |||
\end{align} | |||
</math><br> | |||
which is a PSD matrix | |||
}} | |||
===Hinge Loss=== | ===Hinge Loss=== | ||
Line 42: | Line 56: | ||
update using above gradient | update using above gradient | ||
</pre> | </pre> | ||
;Batch Size | |||
* [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size] | |||
===Coordinate Block Descent=== | ===Coordinate Block Descent=== | ||
Line 119: | Line 136: | ||
Positive Definite:<br> | Positive Definite:<br> | ||
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br> | Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br> | ||
Then | Then <br> | ||
<math> | <math> | ||
\begin{ | \begin{align} | ||
\mathbf{v}^T \mathbf{K} \mathbf{v} | \mathbf{v}^T \mathbf{K} \mathbf{v}&= \mathbf{v}^T [\sum_j K_{ij}v_j]\\ | ||
&= v^T [\sum_j K_{ij}v_j]\\ | |||
&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\ | &= \sum_i \sum_j v_{i}K_{ij}v_{j}\\ | ||
&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\ | &= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\ | ||
Line 131: | Line 147: | ||
&= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\ | &= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\ | ||
&\geq 0 | &\geq 0 | ||
\end{ | \end{align} | ||
</math> | </math> | ||
}} | }} | ||
Line 205: | Line 221: | ||
====Hoeffding's inequality==== | ====Hoeffding's inequality==== | ||
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br> | Let <math>X_1,...,X_n</math> be bounded in (a,b)<br> | ||
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq | Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math> |