Machine Learning: Difference between revisions

← Older edit Newer edit →

@@ Line 1: / Line 1: @@
 Machine Learning
-==Hyperparameters==
-===Batch Size===
-[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
 ==Loss functions==
 ===(Mean) Squared Error===
 The squared error is:<br>
-<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^(i)|^2</math><br>
+<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br>
 If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br>
 {{hidden|Proof|
@@ Line 25: / Line 19: @@
 ===Cross Entropy===
+The cross entropy loss is
+* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
+;Notes
+* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
+{{hidden | Proof |
+We show that the Hessian is positive semi definite.<br>
+<math>
+\begin{align}
+\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
+&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
+&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
+&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
+&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
+\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
+&= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T\\
+\end{align}
+</math><br>
+which is a PSD matrix
+}}
 ===Hinge Loss===
@@ Line 42: / Line 56: @@
      update using above gradient
 </pre>
+;Batch Size
+* [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
 ===Coordinate Block Descent===
@@ Line 119: / Line 136: @@
 Positive Definite:<br>
 Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
-Then
+Then <br>
 <math>
-\begin{aligned}
+\begin{align}
-\mathbf{v}^T \mathbf{K} \mathbf{v}
+\mathbf{v}^T \mathbf{K} \mathbf{v}&= \mathbf{v}^T [\sum_j K_{ij}v_j]\\
-&= v^T [\sum_j K_{ij}v_j]\\
 &= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
 &= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
@@ Line 131: / Line 147: @@
 &= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
 &\geq 0
-\end{aligned}
+\end{align}
 </math>
 }}
@@ Line 205: / Line 221: @@
 ====Hoeffding's inequality====
 Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
-Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2exp(-\frac{2nt^2}{(b-a)^2})</math>
+Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>