Machine Learning

From David's Wiki
Revision as of 08:39, 11 November 2019 by David (talk | contribs) (→‎Cross Entropy)
Jump to navigation Jump to search

Machine Learning

Loss functions

(Mean) Squared Error

The squared error is:

If our model is linear regression then this is convex.

Proof


so the hessian is positive semi-definite

Cross Entropy

The cross entropy loss is

Notes
  • If our model is where is the sigmoid function then this is convex
Proof

We show that the Hessian is positive semi definite.







which is a PSD matrix

Hinge Loss

Optimization

Gradient Descent

Also known as direction of steepest gradient.
To minimize a loss function, just take steps in the opposite direction of the gradient.

Stochastic Gradient Descent

Oftentimes with large amounts of data, you can't take the gradient of all your data at once. In this case, we use batches of data.
We can take batches by getting a random order of indices without replacement.

for epoch=1 to n
  generate batches
  for i=1 to m
    take gradient w.r.t batch i
    update using above gradient
Batch Size

Coordinate Block Descent

Learning Rate

SVM

Andrew Ng Notes
Support Vector Machine
This is a linear classifier the same as a perceptron except the goal is not to just classify our data properly but to also maximize the margin.
where is the sign function.

Margins

The margin denoted by is the distance between our classifier and the closest point.

Functional Margin

The margin corresponding to one example is:
The margin for our entire sample is the smallest margin per example.

Geometric Margin

The geometric margin is the actual distance.

Notes
  • is the normal vector of our hyperplane so is the length of the projection of x onto our normal vector.
This is the distance to our hyperplane.

Lagrangians

The goal for svm is to maximize the margin:

which is equivalent to, by setting ,


In general, given an optimization in the (primal) form:

we can rewrite the optimization as

where is called the lagrangian.
Since ,
we have:

The left term is called the dual problem.
If the solution to the dual problem satisfy some conditions called the KKT conditions, then it is also the solution to the original problem.

Kernel Trick

Oftentimes, using linear classifiers such as perceptron and SVM may fail to classify data for which the true decision boundary is non-linear.
In this case, one way to get around this is to perform a non-linear preprocessing of the data .
For example If our original model and training only used then we only need
A kernel is a function that can be expressed as for some function

Identifying if a function is a kernel

Basic check: Since the kernel is an inner-product between , it should satisfy the axioms of inner products, namely , otherwise it is not a kernel.

Mercer's Theorem

Let our kernel function be . Then for any sample S, the corresponding matrix where is symmetric positive definite.

Proof

Symmetry:
Positive Definite:
Let .
Then

Learning Theory

PAC Learning

Probably Approximately Correct (PAC)
A hypothesis class is PAC learnable if given , there is some function polynomial in such that if we have a sample size then with probability the hypothesis we will learn will have an average error .

Uniform Convergence

If for all hypothesis , , then the training set is called -representative.
Then .
A hypothesis class has uniform convergence if there exists such that for every , if we draw a sample then with probability , is -representative.

VC dimension

Wikipedia Page
Some slides

Shattering

A model parameterized by is said to shatter a set of points if for every possible set of binary labellings there exists such that makes no errors.

Definition

Intuitively, the VC dimension of a hypothesis set is how complex of a model it is.
Stolen from wikipedia:
The VC dimension of a model is the maximum number of points that can be arranged so that shatters them. More formally, it is the maximum cardinal such that some data point set of cardinality can be shattered by .

Notes
  • To show VCdim is at least n, find a training set of size n that can be shattered by our hypothesis class.
  • To show VCdim is leq n, prove no training set of size n+1 can be shattered by our hypothesis class.
  • Number of parameters do not necessarily correspond to VC dimension.
has infinite VC dimension with one parameter

Theory

Reference
In the case where the Hypothesis class is finite, we have

where is the size of the sample.

For all h in H,

for some constants

Growth Function

The growth function is maximum number of ways examples can be labelled using hypotheses from

Notes
  • If , then

Sauer's Lemma

Reference
After the VCdim, the growth function grows as a polynomial

  • If then
  • Also if then .

Bias-Variance Tradeoff

  • Let be the true loss of hypothesis h
and be the true loss of hypothesis h
  • The term is called the bias
  • The term is called variance.
  • Larger hypothesis class will get smaller bias but larger variance.
  • Overfitting vs. underfitting

Rademacher Complexity

Definition

The Rademacher complexity, like the VC dimension, measures how "rich" the hypothesis space is.
In this case, we see how well we can fit random noise.
Given a set the Rademacher complexity is:

where each are from a discrete uniform distribution
Given a sample and a function class , the empirical rademacher complexity is:

where

Notes

Concentration Bounds

Hoeffding's inequality

Let be bounded in (a,b)
Then