Deep Learning: Difference between revisions
Line 1,832: | Line 1,832: | ||
\end{cases} | \end{cases} | ||
</math> | </math> | ||
* <math>i</math> is the index of the task | |||
* <math>j</math> is the index of the sample | |||
K-shot classification means k training samples per label. Some papers use k as the size of the whole training set. | |||
Q: How to use <math>D_{metatrain}</math> to solve meta-test tasks more effectively? | |||
* Use <math>D_{metatrain}</math> to learn meta parameters <math>\theta</math> such that: | |||
* Base learner <math>A</math> outputs task-specific model parameters <math>\phi_i = A(D_{i}^{train}, \theta)</math> good for <math>D_{i}^{test}</math>. | |||
* Training procedure: | |||
** Loss: <math>\min_{\theta} \sum_{i=1}^{n} loss(D_{i}^{test}, \phi_i)</math> where <math>\phi_i = A(D_{i}^{train}, \phi)</math>. | |||
* Test time: given a new task <math>(D^{train}, D^{test})</math>. Apply <math>A(D^{train}, \theta^*)</math> to get <math>\phi</math>. | |||
[Fin et al. 2017] | |||
* Idea: train a model that can be easy to fine-tune on new tasks | |||
* one-step fine tuning: <math>\theta - \alpha \nabla L(\theta, D_{i}^{train})</math> gives us <math>\phi_{i}</math> | |||
* Evaluate <math>\phi_i</math> on <math>D_i^{test}</math> | |||
* <math>\min_{\theta} \sum_{i=1}^{n} L(\phi_i, D_i^{test}) = L(\theta - \alpha \nabla L(\theta, D_i^{train}), D_i^{test})</math> | |||
1. Model-agnostic meta learning (MAML) | |||
* Use GD to optimize over <math>\theta</math> | |||
** <math>\nabla_\theta = \sum_{i=1}^{n}(\nabla_{\theta} \phi_i) \nabla_{\phi} L(\phi_i, D_i^{test})</math> | |||
** <math>(\nabla_{\theta} \phi_i)</math> involves second-order derivatives which are expensive. | |||
* First-order MAML: just ignore <math>\nabla_{\theta} \phi_i</math> term. Replace it with the identity matrix. | |||
Reptile (Michal et al 2018) | |||
2. <math>A</math> is a simple linear/non-parametric learning on data embeddings computed via <math>f_{\theta}</math> | |||
[Lee et al. 2019] | |||
* <math>f_{\theta}</math> is used to compute embeddings | |||
* <math>A</math> is a linear classifier (e.g. SVM) | |||
* Use dual form of SVM so # of optimization variables = # training samples * # classes | |||
3. <math>A</math> is a non-parametric learning | |||
* Embedding: <math>\tilde{x} = f_{\theta}(x)</math> | |||
[Snell et al. 2017) | |||
* Define prototypes (cluster centers) | |||
** <math>c_k = \frac{1}{|D_i^{tr}|} \sum_{(x,y) \in S_k} f_{\theta}(x)</math> | |||
** <math>P_{\theta)(y=k|x) = \frac{\exp(-d(f_\theta (x), c_k))}{\sum_{k'} \exp(-d(f_{\theta}(x), c_k))}</math> | |||
4: <math>A</math> is a black box (e.g. LSTM). | |||
==Misc== | ==Misc== |