Machine Learning: Difference between revisions

no edit summary
No edit summary
Line 6: Line 6:
===Batch Size===
===Batch Size===
[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
==Optimization==
===Gradient Descent===
Also known as direction of steepest gradient.<br>
To minimize a loss function, just take steps in the opposite direction of the gradient.
===Stochastic Gradient Descent===
Oftentimes with large amounts of data, you can't take the gradient of all your data at once.
In this case, we use batches of data.<br>
We can take batches by getting a random order of indices without replacement.
<pre>
for epoch=1 to n
  generate batches
  for i=1 to m
    take gradient w.r.t batch i
    update using above gradient
</pre>
===Coordinate Block Descent===


===Learning Rate===
===Learning Rate===
 
==SVM==
==Kernel Trick==
===Kernel Trick===
Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.<br>
Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.<br>
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
Line 15: Line 33:
If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
===Identifying if a function is a kernel===
====Identifying if a function is a kernel====
Basic check:
Basic check:
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
====Mercer Conditions====
====Mercer's Theorem====
Let our kernel function be <math>K(z,x)</math>.
Then for any sample S, the corresponding matrix <math>\mathbf{K}</math> where <math>K_{ij} = K(x^{(i)},x^{(j)}</math> is symmetric positive definite.
{{hidden | Proof |
Symmetry: <math>K_{ij} = K(x^{(i)},x^{(j)} = \phi(x^{(i)})^T\phi(x^{(j)}) = \phi(x^{(j)})^T\phi(x^{(i)}) = K_{ji}</math><br>
Positive Definite:<br>
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
Then <math>\mathbf{v}^T K v
= v^T \[\sum_j K_{ij}v_j\]
= \sum_i \sum_j v_{i}K_{ij}v_{j}
= \sum_i \sum_j v_{i}\phi(x^{(i)})^T\phi(x^{(j)})v_{j}
= \sum_i \sum_j v_{i} \sum_k \phi_k(x^{(i)}) \phi_k(x^{(j)})v_{j}
= \sum_k \sum_i \sum_j v_{i} \phi_k(x^{(i)}) \phi_k(x^{(j)})v_{j}
= \sum_k \sum_i  v_{i} \phi_k(x^{(i)}) \sum_j \phi_k(x^{(j)})v_{j}
= \sum_k (\sum_i  v_{i} \phi_k(x^{(i)}))^2
\geq 0
</math>
}}


==Learning Theory==
==Learning Theory==