5,337
edits
No edit summary |
|||
Line 6: | Line 6: | ||
===Batch Size=== | ===Batch Size=== | ||
[https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size] | [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size] | ||
==Optimization== | |||
===Gradient Descent=== | |||
Also known as direction of steepest gradient.<br> | |||
To minimize a loss function, just take steps in the opposite direction of the gradient. | |||
===Stochastic Gradient Descent=== | |||
Oftentimes with large amounts of data, you can't take the gradient of all your data at once. | |||
In this case, we use batches of data.<br> | |||
We can take batches by getting a random order of indices without replacement. | |||
<pre> | |||
for epoch=1 to n | |||
generate batches | |||
for i=1 to m | |||
take gradient w.r.t batch i | |||
update using above gradient | |||
</pre> | |||
===Coordinate Block Descent=== | |||
===Learning Rate=== | ===Learning Rate=== | ||
==SVM== | |||
==Kernel Trick== | ===Kernel Trick=== | ||
Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.<br> | Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.<br> | ||
In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br> | In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br> | ||
Line 15: | Line 33: | ||
If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br> | If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br> | ||
A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br> | A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br> | ||
===Identifying if a function is a kernel=== | ====Identifying if a function is a kernel==== | ||
Basic check: | Basic check: | ||
Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br> | Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br> | ||
====Mercer | ====Mercer's Theorem==== | ||
Let our kernel function be <math>K(z,x)</math>. | |||
Then for any sample S, the corresponding matrix <math>\mathbf{K}</math> where <math>K_{ij} = K(x^{(i)},x^{(j)}</math> is symmetric positive definite. | |||
{{hidden | Proof | | |||
Symmetry: <math>K_{ij} = K(x^{(i)},x^{(j)} = \phi(x^{(i)})^T\phi(x^{(j)}) = \phi(x^{(j)})^T\phi(x^{(i)}) = K_{ji}</math><br> | |||
Positive Definite:<br> | |||
Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br> | |||
Then <math>\mathbf{v}^T K v | |||
= v^T \[\sum_j K_{ij}v_j\] | |||
= \sum_i \sum_j v_{i}K_{ij}v_{j} | |||
= \sum_i \sum_j v_{i}\phi(x^{(i)})^T\phi(x^{(j)})v_{j} | |||
= \sum_i \sum_j v_{i} \sum_k \phi_k(x^{(i)}) \phi_k(x^{(j)})v_{j} | |||
= \sum_k \sum_i \sum_j v_{i} \phi_k(x^{(i)}) \phi_k(x^{(j)})v_{j} | |||
= \sum_k \sum_i v_{i} \phi_k(x^{(i)}) \sum_j \phi_k(x^{(j)})v_{j} | |||
= \sum_k (\sum_i v_{i} \phi_k(x^{(i)}))^2 | |||
\geq 0 | |||
</math> | |||
}} | |||
==Learning Theory== | ==Learning Theory== |