Jump to content

Machine Learning: Difference between revisions

(7 intermediate revisions by the same user not shown)
Line 24: Line 24:
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
* If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex


<!--{{hidden | Proof |
{{hidden | Proof |
We show that the Hessian is positive semi definite.<br>
We show that the Hessian is positive semi definite.<br>
<math>
<math>
\begin{align}
\nabla_\theta J(\theta) = -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]
\nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
</math><br>
&= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
<math>
&= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]
&= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
</math><br>
&= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
<math>
\implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]
&= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T\\
</math><br>
\end{align}
<math>
= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
\implies \nabla^2_\theta J(\theta) = \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]
</math><br>
<math>
= \sum_i g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T
</math><br>
</math><br>
which is a PSD matrix
which is a PSD matrix
}}
}}
-->


===Hinge Loss===
===Hinge Loss===
Line 140: Line 149:
Then <br>
Then <br>
<math>
<math>
\begin{align}
\mathbf{v}^T \mathbf{K} \mathbf{v}= \mathbf{v}^T [\sum_j K_{ij}v_j]
\mathbf{v}^T \mathbf{K} \mathbf{v}&= \mathbf{v}^T [\sum_j K_{ij}v_j]\\
= \sum_i \sum_j v_{i}K_{ij}v_{j}
&= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}
&= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
&= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}
&= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}
&= \sum_k \sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\
= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2
&= \sum_k (\sum_i  v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
\geq 0
&\geq 0
\end{align}
</math>
</math>
}}
}}
====Common Kernels====
; Polynomial Kernel
* See [[wikipedia:Polynomial kernel]]
* <math>K(x,z) = (c+x^Tz)^d</math>
* For <math>d=2</math>
** we have <math>(1+x^Tz)^2 = 1 + 2(x^Tz) + (x^Tz)^2</math>
** <math>= 1 + 2 \sum x_i z_i + (\sum x_i z_i)(\sum x_j z_j)</math>
** <math>= 1 + 2 \sum x_i z_i + 2\sum_{i < j} (x_i x_j) (z_i z_j) + \sum (x_i^2)(z_i)^2</math>
** <math>= 1 + \sum (\sqrt{2}x_i) (\sqrt{2}z_i) + \sum_{i < j} (\sqrt{2} x_i x_j)(\sqrt{2} z_i z_j) + \sum x_i^2 z_i^2</math>
* The dimension of the feature map associated with this kernel is exponential in d
** There are <math>1+n+\binom{n}{2} + ... + \binom{n}{d} = O(n^d)</math> terms


==Learning Theory==
==Learning Theory==
Line 208: Line 227:
After the VCdim, the growth function grows as a polynomial
After the VCdim, the growth function grows as a polynomial
* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>
* If <math>VCdim(H)\leq d \leq \infty</math> then <math>\tau_H(m) \leq \sum_{i=0}^{d} \binom{n}{i}</math>
* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq (\frac{em}{d})^d</math>.
* Also if <math>m > d+1</math> then <math>\tau_H(m) \leq \left(\frac{em}{d}\right)^d</math>.


===Bias-Variance Tradeoff===
===Bias-Variance Tradeoff===