Deep Learning: Difference between revisions

Line 1,832:

\end{cases}

</math>

* <math>i</math> is the index of the task

* <math>j</math> is the index of the sample

K-shot classification means k training samples per label. Some papers use k as the size of the whole training set.

Q: How to use <math>D_{metatrain}</math> to solve meta-test tasks more effectively?

* Use <math>D_{metatrain}</math> to learn meta parameters <math>\theta</math> such that:

* Base learner <math>A</math> outputs task-specific model parameters <math>\phi_i = A(D_{i}^{train}, \theta)</math> good for <math>D_{i}^{test}</math>.

* Training procedure:

** Loss: <math>\min_{\theta} \sum_{i=1}^{n} loss(D_{i}^{test}, \phi_i)</math> where <math>\phi_i = A(D_{i}^{train}, \phi)</math>.

* Test time: given a new task <math>(D^{train}, D^{test})</math>. Apply <math>A(D^{train}, \theta^*)</math> to get <math>\phi</math>.

[Fin et al. 2017]

* Idea: train a model that can be easy to fine-tune on new tasks

* one-step fine tuning: <math>\theta - \alpha \nabla L(\theta, D_{i}^{train})</math> gives us <math>\phi_{i}</math>

* Evaluate <math>\phi_i</math> on <math>D_i^{test}</math>

* <math>\min_{\theta} \sum_{i=1}^{n} L(\phi_i, D_i^{test}) = L(\theta - \alpha \nabla L(\theta, D_i^{train}), D_i^{test})</math>

1. Model-agnostic meta learning (MAML)

* Use GD to optimize over <math>\theta</math>

** <math>\nabla_\theta = \sum_{i=1}^{n}(\nabla_{\theta} \phi_i) \nabla_{\phi} L(\phi_i, D_i^{test})</math>

** <math>(\nabla_{\theta} \phi_i)</math> involves second-order derivatives which are expensive.

* First-order MAML: just ignore <math>\nabla_{\theta} \phi_i</math> term. Replace it with the identity matrix.

Reptile (Michal et al 2018)

2. <math>A</math> is a simple linear/non-parametric learning on data embeddings computed via <math>f_{\theta}</math>

[Lee et al. 2019]

* <math>f_{\theta}</math> is used to compute embeddings

* <math>A</math> is a linear classifier (e.g. SVM)

* Use dual form of SVM so # of optimization variables = # training samples * # classes

3. <math>A</math> is a non-parametric learning

* Embedding: <math>\tilde{x} = f_{\theta}(x)</math>

[Snell et al. 2017)

* Define prototypes (cluster centers)

** <math>c_k = \frac{1}{|D_i^{tr}|} \sum_{(x,y) \in S_k} f_{\theta}(x)</math>

** <math>P_{\theta)(y=k|x) = \frac{\exp(-d(f_\theta (x), c_k))}{\sum_{k'} \exp(-d(f_{\theta}(x), c_k))}</math>

4: <math>A</math> is a black box (e.g. LSTM).

==Misc==