Jump to content

Deep Learning: Difference between revisions

1,380 bytes added ,  29 October 2020
Line 1,564: Line 1,564:
*: The domain classifier penalizes the distance between <math>F(Q_X)</math> and <math>F(P_X)</math>.
*: The domain classifier penalizes the distance between <math>F(Q_X)</math> and <math>F(P_X)</math>.


Example 1:
Example 1: MMD distance (Maximum mean discrepancy)   
MMD distance (Maximum mean discrepancy)   
Define <math>\tilde{x}_i = F(x_i)</math>.   
Define <math>\tilde{x}_i = F(x_i)</math>.   
<math>D_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) \stackrel{\triangle}{=} \Vert \frac{1}{m}\sum \phi(\tilde{x}_i^S) - \frac{1}{m}\sum \phi(\tilde{x}_i^T) \Vert</math>   
<math>D_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) \stackrel{\triangle}{=} \Vert \frac{1}{m}\sum \phi(\tilde{x}_i^S) - \frac{1}{m}\sum \phi(\tilde{x}_i^T) \Vert</math>   
Line 1,573: Line 1,572:
MMD-based DA (Tzeng ''et al.'' 2014):   
MMD-based DA (Tzeng ''et al.'' 2014):   
<math>\min L(c_1 \circ F(x^s, y^s) + \lambda D^s_{MMD}(F(x^s, F(x^t))</math>
<math>\min L(c_1 \circ F(x^s, y^s) + \lambda D^s_{MMD}(F(x^s, F(x^t))</math>
Example 2: Wasserstein distance 
<math>\min L_{cls}(C_1 \circ F(x^S), y^s) + \lambda W(F(x^s, F(x^s))</math> 
The wasserstein distance is computed using the Kantorovich duality. 
This is also called IPM (Integral prob. metrics) distance.
* We can also use improved & robust version of Wasserstein in DA.
** Robust Wasserstein [Balaji ''et al.'' Neurips 20]
** Normalized Wasserstein [....ICCV]
===CycleGAN===
[Zhu et al 2017] 
Another approach for image-to-image translation. 
Source: <math>(x^s, y^s)</math> 
Target: <math>x^t</math> 
Train two functions: <math>G_{S \to T}</math> and <math>G_{T \to S}</math>. 
Losses:
* <math>L_{GAN}(x^S, x^T, G_{S\to T}(D^T) = E_{x^t}\left[\log D^T(x^t)\right] + E\left[\log(1-D^T(G_{S\ to T}(x^s)) \right]</math>.
* <math>L_{GAN}(x^S, x^T, G_{T \to S}(D^S)</math>
* Cycle consistency: <math>L_{cyc} = E\left[ \Vert G_{T\to S}(G_{S \to T}(x^s)) - x^s \Vert \right] + E \left[ \Vert G_{S \to T}(G_{T\ to S}(x^t)) - x^t \Vert \right]</math>
Other tricks:
* Domain-specific batch norms
* Entropy based regularization
===Are assumptions necessary?===
Assumptions:
* Covariate shift
* <math>d_H(Q_x, P_x)</math> is small
* <math>\epsilon_{joint}</math> small
See [Ben-David ''et al.'']. 
* Covariate shift assumption is not sufficient.
* Necessity of small <math>d_{H}(P,Q)</math> for DA.
* Necessity of small joint training error.


==Misc==
==Misc==