|
|
Line 1: |
Line 1: |
| Machine Learning
| | HI |
| | |
| ==Loss functions==
| |
| ===(Mean) Squared Error===
| |
| The squared error is:<br>
| |
| <math>J(\theta) = \sum|h_{\theta}(x^{(i)}) - y^{(i)}|^2</math><br>
| |
| If our model is linear regression <math>h(x)=w^tx</math> then this is convex.<br>
| |
| {{hidden|Proof|
| |
| <math>
| |
| \begin{aligned}
| |
| \nabla_{w} J(w) &= \nabla_{w} \sum (w^tx^{(i)} - y^{(i)})^2\\
| |
| &= 2\sum (w^t x^{(i)} - y^{(i)})x \\
| |
| \implies \nabla_{w}^2 J(w) &= \nabla 2\sum (w^T x^{(i)} - y^{(i)})x^{(i)}\\
| |
| &= 2 \sum x^{(i)}(x^{(i)})^T
| |
| \end{aligned}
| |
| </math><br>
| |
| so the hessian is positive semi-definite
| |
| }}
| |
| | |
| ===Cross Entropy===
| |
| The cross entropy loss is
| |
| * <math>J(\theta) = \sum [(y^{(i)})\log(h_\theta(x)) + (1-y^{(i)})\log(1-h_\theta(x))]</math>
| |
| ;Notes
| |
| * If our model is <math>g(\theta^Tx^{(i)})</math> where <math>g(x)</math> is the sigmoid function <math>\frac{e^x}{1+e^x}</math> then this is convex
| |
| {{hidden | Proof |
| |
| <math>
| |
| \begin{aligned}
| |
| \nabla_\theta J(\theta) &= -\nabla_\theta \sum [(y^{(i)})\log(g(\theta^t x^{(i)})) + (1-y^{(i)})\log(1-g(\theta^t x^{(i)}))]\\
| |
| &= -\sum [(y^{(i)})\frac{g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{g(\theta^t x^{(i)})}x^{(i)} + (1-y^{(i)})\frac{-g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)}))}{1-g(\theta^t x^{(i)})}x^{(i)}]\\
| |
| &= -\sum [(y^{(i)})(1-g(\theta^t x^{(i)}))x^{(i)} - (1-y^{(i)})g(\theta^t x^{(i)})x^{(i)}]\\
| |
| &= -\sum [(y^{(i)})x^{(i)} -(y^{(i)}) g(\theta^t x^{(i)}))x^{(i)} - g(\theta^t x^{(i)})x^{(i)} + y^{(i)}g(\theta^t x^{(i)})x^{(i)}]\\
| |
| &= -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
| |
| \implies \nabla^2_\theta J(\theta) &= \nabla_\theta -\sum [(y^{(i)})x^{(i)} - g(\theta^t x^{(i)})x^{(i)}]\\
| |
| &= \sum g(\theta^t x^{(i)})(1-g(\theta^t x^{(i)})) x^{(i)} (x^{(i)})^T\\
| |
| \end{aligned}
| |
| </math><br>
| |
| which is a PSD matrix
| |
| }}
| |
| | |
| ===Hinge Loss===
| |
| | |
| ==Optimization==
| |
| ===Gradient Descent===
| |
| Also known as direction of steepest gradient.<br>
| |
| To minimize a loss function, just take steps in the opposite direction of the gradient.
| |
| ===Stochastic Gradient Descent===
| |
| Oftentimes with large amounts of data, you can't take the gradient of all your data at once.
| |
| In this case, we use batches of data.<br>
| |
| We can take batches by getting a random order of indices without replacement.
| |
| <pre>
| |
| for epoch=1 to n
| |
| generate batches
| |
| for i=1 to m
| |
| take gradient w.r.t batch i
| |
| update using above gradient
| |
| </pre>
| |
| ;Batch Size
| |
| * [https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e A medium post empirically evaluating the effect of batch_size]
| |
| | |
| ===Coordinate Block Descent===
| |
| | |
| | |
| ===Learning Rate===
| |
| ==SVM==
| |
| [http://cs229.stanford.edu/notes/cs229-notes3.pdf Andrew Ng Notes]<br>
| |
| Support Vector Machine<br>
| |
| This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.<br>
| |
| <math>h_{w,b}(x) = g(W*x+b)</math> where <math>g(x) = I[x>=0]-I[x<0]</math> is the sign function.<br>
| |
| ===Margins===
| |
| The margin denoted by <math>\gamma</math> is the distance between our classifier and the closest point.<br>
| |
| | |
| ;Functional Margin
| |
| The margin corresponding to one example is:<br>
| |
| <math>\hat{\gamma}^{(i)} = y^{(i)}(w^Tx^{(i)}+b)</math>
| |
| The margin for our entire sample is the smallest margin per example.
| |
| | |
| ;Geometric Margin
| |
| The geometric margin is the actual distance.<br>
| |
| <math>\hat{\gamma}^{(i)} = y^{(i)}((\frac{w}{\Vert w \Vert})^Tx^{(i)}+\frac{b}{\Vert w \Vert})</math><br>
| |
| | |
| ;Notes
| |
| * <math>\mathbf{w}</math> is the normal vector of our hyperplane so <math>\frac{w}{\Vert w \Vert})^Tx^{(i)}</math> is the length of the projection of x onto our normal vector.
| |
| : This is the distance to our hyperplane.
| |
| | |
| ===Lagrangians===
| |
| The goal for svm is to maximize the margin:<br>
| |
| <math>
| |
| \begin{aligned}
| |
| \max_{\hat{\gamma}, w, b} &\frac{\hat{\gamma}}{\Vert w \Vert}\\
| |
| \text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq \hat{\gamma} \quad \forall i
| |
| \end{aligned}
| |
| </math><br>
| |
| which is equivalent to, by setting <math>\hat{\gamma}=1</math>,<br>
| |
| <math>
| |
| \begin{aligned}
| |
| \min_{\gamma, w, b} &\Vert w \Vert ^2\\
| |
| \text{s.t. }& y^{(i)}(w^Tx^{(i)} + b) \geq 1 \quad \forall i
| |
| \end{aligned}
| |
| </math><br><br>
| |
| In general, given an optimization in the (primal) form:<br>
| |
| <math>
| |
| \begin{aligned}
| |
| \min_w & f(w)\\
| |
| \text{s.t. }& h_i(w) \leq 0 \quad \forall i\\
| |
| & g_i(w) = 0
| |
| \end{aligned}
| |
| </math><br>
| |
| we can rewrite the optimization as <br>
| |
| <math>
| |
| \min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta)
| |
| </math><br>
| |
| where <math>\mathcal{L}(w, \alpha, \beta) = f(w) + \sum \alpha_i g_i(w) + \sum \beta_i h_i(w)</math> is called the lagrangian.<br>
| |
| Since <math>\max \min \leq \min \max</math>,<br>
| |
| we have:<br>
| |
| <math>
| |
| \max_{\alpha, \beta \mid \alpha \geq 0}\min_{w} \mathcal{L}(w, \alpha, \beta) \leq \min_{w}\max_{\alpha, \beta \mid \alpha \geq 0} \mathcal{L}(w, \alpha, \beta)
| |
| </math><br>
| |
| The left term is called the dual problem.<br>
| |
| If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the solution to the original problem.
| |
| | |
| ===Kernel Trick===
| |
| Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.<br>
| |
| In this case, one way to get around this is to perform a non-linear preprocessing of the data <math>\phi(x)</math>.<br>
| |
| For example <math>\phi(x) = \begin{bmatrix}x \\ x^2 \\ x^3\end{bmatrix}</math>
| |
| If our original model and training only used <math>\langle x, z\rangle</math> then we only need <math>\phi(x)^T\phi(z)</math><br>
| |
| A kernel <math>K(x,z)</math> is a function that can be expressed as <math>K(x,z)=\phi(x)^T\phi(z)</math> for some function <math>\phi</math><br>
| |
| ====Identifying if a function is a kernel====
| |
| Basic check:
| |
| Since the kernel is an inner-product between <math>\phi(x), \phi(z)</math>, it should satisfy the axioms of inner products, namely <math>K(x,z)=K(z,x)</math>, otherwise it is not a kernel.<br>
| |
| ====Mercer's Theorem====
| |
| Let our kernel function be <math>K(z,x)</math>.
| |
| Then for any sample S, the corresponding matrix <math>\mathbf{K}</math> where <math>K_{ij} = K(x^{(i)},x^{(j)})</math> is symmetric positive definite.
| |
| {{hidden | Proof |
| |
| Symmetry: <math>K_{ij} = K(\mathbf{x}^{(i)},\mathbf{x}^{(j)} = \phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(j)})^T\phi(\mathbf{x}^{(i)}) = K_{ji}</math><br>
| |
| Positive Definite:<br>
| |
| Let <math>\mathbf{v} \in \mathbb{R}^n</math>.<br>
| |
| Then
| |
| <math>
| |
| \begin{aligned}
| |
| \mathbf{v}^T \mathbf{K} \mathbf{v}
| |
| &= v^T [\sum_j K_{ij}v_j]\\
| |
| &= \sum_i \sum_j v_{i}K_{ij}v_{j}\\
| |
| &= \sum_i \sum_j v_{i}\phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})v_{j}\\
| |
| &= \sum_i \sum_j v_{i} \sum_k \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
| |
| &= \sum_k \sum_i \sum_j v_{i} \phi_k(\mathbf{x}^{(i)}) \phi_k(\mathbf{x}^{(j)})v_{j}\\
| |
| &= \sum_k \sum_i v_{i} \phi_k(\mathbf{x}^{(i)}) \sum_j \phi_k(\mathbf{x}^{(j)})v_{j}\\
| |
| &= \sum_k (\sum_i v_{i} \phi_k(\mathbf{x}^{(i)}))^2\\
| |
| &\geq 0
| |
| \end{aligned}
| |
| </math>
| |
| }}
| |
| | |
| ==Learning Theory==
| |
| ===PAC Learning===
| |
| Probably Approximately Correct (PAC)<br>
| |
| A hypothesis class <math>H</math> is PAC learnable if given <math>0 < \epsilon, \delta < 1</math>, there is some function <math>m(\epsilon, \delta)</math> polynomial in <math>1/\epsilon, 1/\delta</math> such that if we have a sample size <math>\geq m(\epsilon, \delta)</math> then with probability <math>1-\delta</math> the hypothesis we will learn will have an average error <math>\leq \epsilon</math>.
| |
| | |
| ===Uniform Convergence===
| |
| If for all hypothesis <math>h</math>, <math>|L_S(h)-L_D(h)| \leq \epsilon</math>, then the training set <math>S</math> is called <math>\epsilon</math>-representative.<br>
| |
| Then
| |
| <math>
| |
| L_D(h_s)
| |
| \leq L_S(h_S) + \epsilon / 2
| |
| \leq L_S(h_D) + \epsilon / 2
| |
| \leq L_D(h_D) + \epsilon
| |
| </math>.<br>
| |
| A hypothesis class <math>H</math> has uniform convergence if there exists <math>m^{UC}(\epsilon, \delta)</math> such that for every <math>\epsilon, \delta</math>, if we draw a sample <math>S</math> then with probability <math>1-\delta</math>, <math>S</math> is <math>\epsilon</math>-representative.
| |
| | |
| ===VC dimension===
| |
| [https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension Wikipedia Page]<br>
| |
| [https://www.stat.berkeley.edu/~bartlett/courses/2013spring-stat210b/notes/10notes.pdf Some slides]
| |
| ====Shattering====
| |
| A model <math>f</math> parameterized by <math>\theta</math> is said to shatter a set of points <math>\{x_1, ..., x_n\}</math> if for every possible set of binary labellings <math>\{y_1,...,y_n\}</math> there exists <math>\theta</math> such that <math>f</math> makes no errors.
| |
| | |
| ====Definition====
| |
| Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.<br>
| |
| Stolen from wikipedia:<br>
| |
| The VC dimension of a model <math>f</math> is the maximum number of points that can be arranged so that <math>f</math> shatters them.
| |
| More formally, it is the maximum cardinal <math>D</math> such that some data point set of cardinality <math>D</math> can be shattered by <math>f</math>.
| |
| ;Notes
| |
| * To show VCdim is at least n, find a training set of size n that can be shattered by our hypothesis class.
| |
| * To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
| |
| * Number of parameters do not necessarily correspond to VC dimension.
| |
| : <math>H=\{h(x)=\sin(\theta x)\}</math> has infinite VC dimension with one parameter
| |
| | |
| ====Theory====
| |
| [https://nowak.ece.wisc.edu/SLT09/lecture8.pdf Reference]<br>
| |
| In the case where the Hypothesis class <math>\mathcal{H}</math> is finite, we have
| |
| * <math>|L_D(h) - L_S(h)| < \sqrt{
| |
| \frac{\log|\mathcal{H}| + \log(1/\delta)}{2m}}
| |
| </math>
| |
| : where <math>m</math> is the size of the sample.
| |
| For all h in H,
| |
| * <math>|L_D(h) - L_S(h)| < K_1 \sqrt{
| |
| \frac{VCdim + K_2 \log(2/\delta)}{2m}}
| |
| </math>
| |
| : for some constants <math>K_1, K_2</math>
| |
| | |
| ===Bias-Variance Tradeoff===
| |
| * Let <math>L_D(h)</math> be the true loss of hypothesis h
| |
| : and <math>L_S(h)</math> be the true loss of hypothesis h
| |
| * <math>L_D(h_s^*) = L_D(h_D^*) + [L_D(h_s^*) - L_D(h_D^*)]</math>
| |
| * The term <math>L_D(h_D^*)</math> is called the bias
| |
| * The term <math>[L_D(h_s^*) - L_D(h_D^*)]</math> is called variance.
| |
| * Larger hypothesis class will get smaller bias but larger variance.
| |
| * Overfitting vs. underfitting
| |
| | |
| ===Rademacher Complexity===
| |
| ====Definition====
| |
| The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.<br>
| |
| In this case, we see how well we can fit random noise.<br>
| |
| Given a set <math>A \subset \mathbb{R}^m</math> the Rademacher complexity is:<br>
| |
| <math>R(A) = \frac{1}{m}E_{\sigma} [\sup_{a \in A} \sum_{i=1}^{m} \sigma_i a_i]</math><br>
| |
| where each <math>\sigma_i</math> are from a discrete uniform distribution <math>\{-1, 1\}</math><br>
| |
| Given a sample <math>S=\{z_1,...,z_n\}</math> and a function class <math>F</math>, the empirical rademacher complexity is:<br>
| |
| <math>R(F \circ S)</math><br>
| |
| where <math>F \circ S = \{(f(z_1),...,f(z_n)) \mid f \in F\}</math><br>
| |
| ;Notes
| |
| | |
| ===Concentration Bounds===
| |
| ====Hoeffding's inequality====
| |
| Let <math>X_1,...,X_n</math> be bounded in (a,b)<br>
| |
| Then <math>P(|\bar{X}-E[\bar{X}]| \geq t) \leq 2\exp(-\frac{2nt^2}{(b-a)^2})</math>
| |