Jump to content

Machine Learning: Difference between revisions

(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
Machine Learning
Machine Learning
==Hyperparameters==
===Batch Size===
[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]


==Loss functions==
==Loss functions==
===(Mean) Squared Error===
===(Mean) Squared Error===
The squared error is:<br>
The squared error is:<br>
<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^(i)|^2</math><br>
<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br>
If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br>
If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br>
{{hidden|Proof|
{{hidden|Proof|
Line 25: Line 19:


===Cross Entropy===
===Cross Entropy===
The cross entropy loss is
* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
;Notes
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
{{hidden | Proof |
We show that the Hessian is positive semi definite.<br>
<math>
\begin{align}
\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
&= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T\\
\end{align}
</math><br>
which is a PSD matrix
}}
===Hinge Loss===
===Hinge Loss===


Line 42: Line 56:
     update using above gradient
     update using above gradient
</pre>
</pre>
;Batch Size
* [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
===Coordinate Block Descent===
===Coordinate Block Descent===


Line 119: Line 136:
Positive Definite:<br>
Positive Definite:<br>
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
Then  
Then <br>
<math>
<math>
\begin{aligned}
\begin{align}
\mathbf{v}^T \mathbf{K} \mathbf{v}
\mathbf{v}^T \mathbf{K} \mathbf{v}&= \mathbf{v}^T [\sum_j K_{ij}v_j]\\
&= v^T [\sum_j K_{ij}v_j]\\
&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
Line 131: Line 147:
&= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
&= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
&\geq 0
&\geq 0
\end{aligned}
\end{align}
</math>
</math>
}}
}}
Line 205: Line 221:
====Hoeffding's inequality====
====Hoeffding's inequality====
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2exp(-\frac{2nt^2}{(b-a)^2})</math>
Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>