Deep Learning: Difference between revisions

Line 1,832: Line 1,832:
\end{cases}
\end{cases}
</math>
</math>
* <math>i</math> is the index of the task
* <math>j</math> is the index of the sample
K-shot classification means k training samples per label. Some papers use k as the size of the whole training set. 
Q: How to use <math>D_{metatrain}</math> to solve meta-test tasks more effectively?
* Use <math>D_{metatrain}</math> to learn meta parameters <math>\theta</math> such that:
* Base learner <math>A</math> outputs task-specific model parameters <math>\phi_i = A(D_{i}^{train}, \theta)</math> good for <math>D_{i}^{test}</math>.
* Training procedure:
** Loss: <math>\min_{\theta} \sum_{i=1}^{n} loss(D_{i}^{test}, \phi_i)</math> where <math>\phi_i = A(D_{i}^{train}, \phi)</math>.
* Test time: given a new task <math>(D^{train}, D^{test})</math>. Apply <math>A(D^{train}, \theta^*)</math> to get <math>\phi</math>.
[Fin et al. 2017] 
* Idea: train a model that can be easy to fine-tune on new tasks
* one-step fine tuning: <math>\theta - \alpha \nabla L(\theta, D_{i}^{train})</math> gives us <math>\phi_{i}</math>
* Evaluate <math>\phi_i</math> on <math>D_i^{test}</math>
* <math>\min_{\theta} \sum_{i=1}^{n} L(\phi_i, D_i^{test}) = L(\theta - \alpha \nabla L(\theta, D_i^{train}), D_i^{test})</math>
1. Model-agnostic meta learning (MAML)
* Use GD to optimize over <math>\theta</math>
** <math>\nabla_\theta = \sum_{i=1}^{n}(\nabla_{\theta} \phi_i) \nabla_{\phi} L(\phi_i, D_i^{test})</math>
** <math>(\nabla_{\theta} \phi_i)</math> involves second-order derivatives which are expensive.
* First-order MAML: just ignore <math>\nabla_{\theta} \phi_i</math> term. Replace it with the identity matrix.
Reptile (Michal et al 2018)
2. <math>A</math> is a simple linear/non-parametric learning on data embeddings computed via <math>f_{\theta}</math>
[Lee et al. 2019]
* <math>f_{\theta}</math> is used to compute embeddings
* <math>A</math> is a linear classifier (e.g. SVM)
* Use dual form of SVM so # of optimization variables = # training samples * # classes
3. <math>A</math> is a non-parametric learning
* Embedding: <math>\tilde{x} = f_{\theta}(x)</math>
[Snell et al. 2017)
* Define prototypes (cluster centers)
** <math>c_k = \frac{1}{|D_i^{tr}|} \sum_{(x,y) \in S_k} f_{\theta}(x)</math>
** <math>P_{\theta)(y=k|x) = \frac{\exp(-d(f_\theta (x), c_k))}{\sum_{k'} \exp(-d(f_{\theta}(x), c_k))}</math>
4: <math>A</math> is a black box (e.g. LSTM).


==Misc==
==Misc==