Machine Learning: Difference between revisions

Machine Learning (view source)

Revision as of 19:45, 9 November 2019

1,415 bytes added , 9 November 2019

no edit summary

David

Bureaucrats, Interface administrators, Administrators

5,337

edits

@@ Line 6: / Line 6: @@
 ===Batch Size===
 [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
+==Optimization==
+===Gradient Descent===
+Also known as direction of steepest gradient.<br>
+To minimize a loss function, just take steps in the opposite direction of the gradient.
+===Stochastic Gradient Descent===
+Oftentimes with large amounts of data, you can't take the gradient of all your data at once.
+In this case, we use batches of data.<br>
+We can take batches by getting a random order of indices without replacement.
+<pre>
+for epoch=1 to n
+  generate batches
+  for i=1 to m
+    take gradient w.r.t batch i
+    update using above gradient
+</pre>
+===Coordinate Block Descent===
 ===Learning Rate===
+==SVM==
-==Kernel Trick==
+===Kernel Trick===
 Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.<br>
 In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
@@ Line 15: / Line 33: @@
 If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
 A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
-===Identifying if a function is a kernel===
+====Identifying if a function is a kernel====
 Basic check:
 Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
-====Mercer Conditions====
+====Mercer's Theorem====
+Let our kernel function be <math>K(z,x)</math>.
+Then for any sample S, the corresponding matrix <math>\mathbf{K}</math> where <math>K_{ij} = K(x^{(i)},x^{(j)}</math> is symmetric positive definite.
+{{hidden | Proof |
+Symmetry: <math>K_{ij} = K(x^{(i)},x^{(j)} = \phi(x^{(i)})^T\phi(x^{(j)}) = \phi(x^{(j)})^T\phi(x^{(i)}) = K_{ji}</math><br>
+Positive Definite:<br>
+Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
+Then <math>\mathbf{v}^T K v
+= v^T \[\sum_j K_{ij}v_j\]
+= \sum_i \sum_j v_{i}K_{ij}v_{j}
+= \sum_i \sum_j v_{i}\phi(x^{(i)})^T\phi(x^{(j)})v_{j}
+= \sum_i \sum_j v_{i} \sum_k \phi_k(x^{(i)}) \phi_k(x^{(j)})v_{j}
+= \sum_k \sum_i \sum_j v_{i} \phi_k(x^{(i)}) \phi_k(x^{(j)})v_{j}
+= \sum_k \sum_i  v_{i} \phi_k(x^{(i)}) \sum_j \phi_k(x^{(j)})v_{j}
+= \sum_k (\sum_i  v_{i} \phi_k(x^{(i)}))^2
+\geq 0
+</math>
+}}
 ==Learning Theory==