Machine Learning: Difference between revisions

Revision as of 16:12, 10 November 2019

Cross Entropy

The cross entropy loss is

\(\displaystyle J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]\)

Notes

If our model is \(\displaystyle g(\theta^Tx^{(i)})\) where \(\displaystyle g(x)\) is the sigmoid function \(\displaystyle \frac{e^x}{1+e^x}\) then this is convex

Proof

\(\displaystyle \begin{aligned} \nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\ &= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\ &= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\ &= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\ &= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\ \implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\ &= \sum g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T\\ \end{aligned} \)
which is a PSD matrix

@@ Line 1: / Line 1: @@
-Machine Learning
-==Loss functions==
+===Cross Entropy===
-===(Mean) Squared Error===
+The cross entropy loss is
-The squared error is:<br>
+* <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
-<math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br>
+;Notes
-If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br>
+* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
-{{hidden|Proof|
+{{hidden | Proof |
 <math>
 \begin{aligned}
-\nabla_{w} J(w) &= \nabla_{w} \sum (w^tx^{(i)} - y^{(i)})^2\\
+\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
-&= 2\sum (w^t x^{(i)} - y^{(i)})x \\
+&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
-\implies \nabla_{w}^2 J(w) &= \nabla 2\sum (w^T x^{(i)} - y^{(i)})x^{(i)}\\
+&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
-&= 2 \sum x^{(i)}(x^{(i)})^T
+&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
+&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
+\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
+&= \sum g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T\\
 \end{aligned}
 </math><br>
-so the hessian is positive semi-definite
+which is a PSD matrix
 }}