5,337
edits
Line 1,564: | Line 1,564: | ||
*: The domain classifier penalizes the distance between <math>F(Q_X)</math> and <math>F(P_X)</math>. | *: The domain classifier penalizes the distance between <math>F(Q_X)</math> and <math>F(P_X)</math>. | ||
Example 1: | Example 1: MMD distance (Maximum mean discrepancy) | ||
MMD distance (Maximum mean discrepancy) | |||
Define <math>\tilde{x}_i = F(x_i)</math>. | Define <math>\tilde{x}_i = F(x_i)</math>. | ||
<math>D_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) \stackrel{\triangle}{=} \Vert \frac{1}{m}\sum \phi(\tilde{x}_i^S) - \frac{1}{m}\sum \phi(\tilde{x}_i^T) \Vert</math> | <math>D_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) \stackrel{\triangle}{=} \Vert \frac{1}{m}\sum \phi(\tilde{x}_i^S) - \frac{1}{m}\sum \phi(\tilde{x}_i^T) \Vert</math> | ||
Line 1,573: | Line 1,572: | ||
MMD-based DA (Tzeng ''et al.'' 2014): | MMD-based DA (Tzeng ''et al.'' 2014): | ||
<math>\min L(c_1 \circ F(x^s, y^s) + \lambda D^s_{MMD}(F(x^s, F(x^t))</math> | <math>\min L(c_1 \circ F(x^s, y^s) + \lambda D^s_{MMD}(F(x^s, F(x^t))</math> | ||
Example 2: Wasserstein distance | |||
<math>\min L_{cls}(C_1 \circ F(x^S), y^s) + \lambda W(F(x^s, F(x^s))</math> | |||
The wasserstein distance is computed using the Kantorovich duality. | |||
This is also called IPM (Integral prob. metrics) distance. | |||
* We can also use improved & robust version of Wasserstein in DA. | |||
** Robust Wasserstein [Balaji ''et al.'' Neurips 20] | |||
** Normalized Wasserstein [....ICCV] | |||
===CycleGAN=== | |||
[Zhu et al 2017] | |||
Another approach for image-to-image translation. | |||
Source: <math>(x^s, y^s)</math> | |||
Target: <math>x^t</math> | |||
Train two functions: <math>G_{S \to T}</math> and <math>G_{T \to S}</math>. | |||
Losses: | |||
* <math>L_{GAN}(x^S, x^T, G_{S\to T}(D^T) = E_{x^t}\left[\log D^T(x^t)\right] + E\left[\log(1-D^T(G_{S\ to T}(x^s)) \right]</math>. | |||
* <math>L_{GAN}(x^S, x^T, G_{T \to S}(D^S)</math> | |||
* Cycle consistency: <math>L_{cyc} = E\left[ \Vert G_{T\to S}(G_{S \to T}(x^s)) - x^s \Vert \right] + E \left[ \Vert G_{S \to T}(G_{T\ to S}(x^t)) - x^t \Vert \right]</math> | |||
Other tricks: | |||
* Domain-specific batch norms | |||
* Entropy based regularization | |||
===Are assumptions necessary?=== | |||
Assumptions: | |||
* Covariate shift | |||
* <math>d_H(Q_x, P_x)</math> is small | |||
* <math>\epsilon_{joint}</math> small | |||
See [Ben-David ''et al.'']. | |||
* Covariate shift assumption is not sufficient. | |||
* Necessity of small <math>d_{H}(P,Q)</math> for DA. | |||
* Necessity of small joint training error. | |||
==Misc== | ==Misc== |