Domain Adaptation理论分析

最新推荐文章于 2024-05-25 11:16:24 发布

Jie Qiao

最新推荐文章于 2024-05-25 11:16:24 发布

阅读量7.7k

点赞数 15

文章标签： Domain Adaptation

本文链接：https://blog.csdn.net/a358463121/article/details/86627290

版权

文章目录

A theory of learning from different domains
- H-divergence
Analysis of Representations for Domain Adaptation
参考资料

本文是对两篇文章：

A theory of learning from different domains
Analysis of Representations for Domain Adaptation

的整理。他从理论上给出了在target domain的误差的bound是部分由source domain的误差决定的，具有指导性意义。

A theory of learning from different domains

首先我们给出一些基本的设置，用 $\displaystyle D_{s} ,f_{s}$ 表示在source domain上分布以及该domain上的函数分类函数（这里假设 $\displaystyle f_{s}$ 是二分类函数，所以取值是[0,1]），同理target domain：用 $\displaystyle D_{t} ,f_{t}$ 表示

我们称hypothesis是一个用来分类的函数 $\displaystyle h:\mathcal{X}\rightarrow \{0,1\}$ . 于是我们可以定义h和f的误差为：

$\epsilon _{S} (h,f)=\mathrm{E}_{\mathbf{x} \sim \mathcal{D}_{S}} [|h(\mathbf{x} )-f(\mathbf{x} )|]$

表示在source domain上h和f的误差，特别的，当 $\displaystyle f=f_{s}$ ，即为真实的分类函数时，记 $\displaystyle \epsilon _{S}$ (h)= $\displaystyle \epsilon _{S} (h,f_{s} )$ ，同理target domain的误差同样有 $\displaystyle \epsilon _{T}( h) =\epsilon _{T} (h,f_{t} )$ ,接下来我们给出最重要的H-divergence

H-divergence

所谓散度就是一个弱化的距离，他不一定具备距离的性质，比如有可能不满足对称性等等，那么所谓H是定义在假设空间 $\displaystyle \mathcal{H}$ 的 $\displaystyle \mathcal{D}$ 和 $\displaystyle \mathcal{D}^{\prime }$ 的距离：

$d_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) =2\sup _{h\in \mathcal{H}}\left| \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1]\right|$

直观来看，这个散度的意思是，在一个假设空间 $\displaystyle \mathcal{H}$ 中，找到一个函数h，使得 $\displaystyle \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]$ 的概率尽可能大，而 $\displaystyle \operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1]$ 的概率尽可能小，也就是说，我们用最大距离来衡量 $\displaystyle \mathcal{D} ,\mathcal{D}^{\prime }$ 之间的距离。同时这个h也可以理解为是用来尽可能区分 $\displaystyle \mathcal{D} ,\mathcal{D}^{\prime }$ 这两个分布的函数。
此外这个散度是可以从数据中估计出来的：

Lemma 1 LetHbe a hypothesis space on X with VC dimension d. If U and U’ are samples of size m from D and D’ respectively and $\displaystyle \hat{d}_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right)$ is the empirical H-divergence between samples, then for any δ ∈ (0,1), with probability at least 1−δ,

$d_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) \leq \hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) +4\sqrt{\frac{d\log (2m)+\log\left(\frac{2}{\delta }\right)}{m}}$

这个bound其实就是VC维的bound，这里d表示H的VC维m是样本数量。显然当d有限时，样本量趋于无穷的时候收敛。接下来给出一种计算的方法：
Lemma 2 该散度可以从样本中计算

$\hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) =2\left( 1-\min_{h\in \mathcal{H}}\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right]\right) \$

其中 $x\in U]$ 表示当 $\displaystyle x\in U$ 的时候等于1，也就是统计 $\displaystyle x\in U$ 的x的数量
可以其实可以直接看出他就是在估计这么个概率，也就是H散度：

$1-\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right] =\operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1]$

Definition 1 symmetric difference hypothesis space $\displaystyle \mathcal{H} \Delta \mathcal{H}$ 是一系列hypotheses的集合

$g\in \mathcal{H} \Delta \mathcal{H} \ \ \Longleftrightarrow \ \ g(\mathbf{x} )=h(\mathbf{x} )\oplus h^{\prime } (\mathbf{x} )\ \ \text{ for some } h,h^{\prime } \in \mathcal{H}$

其中 $\displaystyle \oplus$ 表示异或，就是当 $\displaystyle h(\mathbf{x} )\neq h'(\mathbf{x} )$ 时， $\displaystyle g(\mathbf{x} )=1$

直观来说，这个g就是判断两个h的结果相不相等的函数。这个东西的好好处是，可以用这个集合中的函数来表示两个函数不相等的概率，也就是两个函数之间的误差，如果能找到两个domain之间的两个函数间的最大误差，也就找到了H散度的值，即：

$d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right|$

推导过程可见引理3：

Lemma 3 对于任意的hypotheses $\displaystyle h,h'\in H$

$\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \leq \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})$

证明：

$\begin{aligned} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) & = 2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\oplus h^{\prime } (x)=1\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\oplus h^{\prime } (x)=1\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\neq h^{\prime } (x)\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\neq h^{\prime } (x)\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \geq 2\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \end{aligned}$

证毕。

有了上面的一些引理，我们证明一个重要的定理，这个定理告诉我们，只要找到一个h，使得在source domain上的误差尽可能小就能让target domain上的误差尽可能小。
Theorem 1 如果Us,Ut是从Ds，Dt中抽取的无标签数据。则

$\epsilon _{T} (h)\leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda$

证明：该证明用到了上面的引理1，以及三角不等式： $\displaystyle \epsilon _{T} (h,f_{T}) \leq \epsilon _{T}\left( f_{T} ,h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right)$

$\begin{aligned} \epsilon _{T} (h) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right)\\ & =\epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\\ & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\left| \epsilon _{T}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\right| \\ ( 引理1) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ ( 三角不等式\ ) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S} (h)+\epsilon _{S}\left( h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ & =\epsilon _{S} (h)+\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) +\lambda \\ & \leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda \end{aligned}$

式1.用了三角不等式,式5用了三角不等式： $\displaystyle \epsilon _{S}\left( h,h^{*}\right) \leqslant \epsilon _{S}( h,f_{s}) +\epsilon _{S}\left( h^{*} ,f_{s}\right)$ ,最后一个使用使用了VC维理论，这是从样本从估计 $\displaystyle \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}$ 的泛化误差，其中d为VC维度
证毕。

这个bound的本质就是用H-divrgence将两个domain误差的差距建立了一个联系：
$|\epsilon_S-\epsilon_T| \approx \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})$

Analysis of Representations for Domain Adaptation

这篇论文将DA的误差推广到存在representation的分布上。通过假设存在一个表征函数R，将domain映射到一个representation上，即负责将X映射到Z，当然，R确定时，也就表示一个domain被确定了，因为R可以将表征逆映射回X上，而这个X就是一个domain

$\begin{array}{ c c c } \operatorname{Pr}_{\tilde{\mathcal{D}}} [B] & \stackrel{\mathrm{def}}{=} & \operatorname{Pr}_{\mathcal{D}}\left[\mathcal{R}^{-1} (B)\right]\\ \tilde{f} (\mathbf{z} ) & \stackrel{\mathrm{def}}{=} & \mathrm{E}_{\mathcal{D}} [f(\mathbf{x} )|\mathcal{R} (\mathbf{x} )=\mathbf{z} ] \end{array}$

简单的说，B是在feature space上的一个时间，这里的 $\operatorname{Pr}_{\tilde{\mathcal{D}}} [B]$ 就是直接测量representation上的概率的测度。另外这里 $\displaystyle \tilde{f} (\mathbf{z} )$ 是所有被z表征的f(x)的均值,，这里每个f(x)都是一个label，将他们取均值来作为表征z的label.

在DA问题中，我们用 $\displaystyle D_{S}$ 表示source domain的分布，用 $\displaystyle \tilde{D}_{S}$ 表示是建立在feature space上的source domain的分布，也就是这个分布是经过一个z进行转换得到的，正如上述定义的公式描述的一样。

那么误差也同样可以推广到带representation的场景下，只要我们从 $\displaystyle \tilde{D}_{S}$ 从采样z就可以了，这里用h表示任意的一个分类器，于是h在source domain的误差计算如下：

$\begin{aligned} \epsilon _{S} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}} |\widetilde{f_{s}} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned}$

同理target domain的误差：

$\begin{aligned} \epsilon _{T} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}} |\tilde{f}_{t} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned}$

也就是说 $\displaystyle \epsilon _{S} (h)=\epsilon _{S} (h,\widetilde{f_{s}} )$ , $\displaystyle \epsilon _{T} (h)=\epsilon _{T} (h,\widetilde{f_{T}} )$

接下来我们开始尝试将定理1推广到带representation的情况。

Theorem 2 Let R be a fixed representation function from X to Z and H be a hypothesis space of VC-dimension d. If a random labeled sample of size m is generated by applying R to a DS-i.i.d. sample labeled according to f, then with probability at least 1−δ, for every h ∈ H:

$\epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) +\lambda$

其中e是自然底数
证明：
令 $h^{*} =\operatorname{argmin}_{h\in H}( \epsilon _{T} (h)+\epsilon _{S} (h))$ ，且 $\displaystyle \epsilon _{T} (h^{*} )=\lambda _{T} ,\epsilon _{S} (h^{*} )=\lambda _{S}$ . 记 $\displaystyle \lambda =\lambda _{T} +\lambda _{S}$

$\begin{aligned} \epsilon _{T} (h) & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & =\lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +| \operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] |\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda _{T} +\lambda _{S} +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) \end{aligned}$

其中 $\displaystyle \mathcal{Z}_{h} =\{\mathbf{z} \in \mathcal{Z} :h(\mathbf{z} )=1\}$ ,因此 $\displaystyle \operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]$ 可以看做是 $\displaystyle \epsilon _{T}\left( h,h^{*}\right)$ 。
第一条不等式来自与三角不等式： $\displaystyle \epsilon _{T} (h,f_{T} )\leqslant \epsilon _{T} (h^{*} ,f_{T} )+\epsilon _{T} (h^{*} ,h)$
第5条式子来自三角不等式: $\displaystyle \epsilon _{S} (h^{*} ,h)\leqslant \epsilon _{S} (h^{*} ,f_{T} )+\epsilon _{S} (h,f_{T} )$
最后根据Vapnik-Chervonenkis theory (V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998)

$\epsilon _{S} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)}$

因此

同理，对于 $\displaystyle d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)$ 的经验估计，设该分布有m’个样本，bound可以进一步写作：

$\epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\frac{4}{m}\sqrt{\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +\lambda +d_{\mathcal{H}}\left(\tilde{\mathcal{U}}_{S} ,\tilde{\mathcal{U}}_{T}\right) +4\sqrt{\frac{d\log\left( 2m^{\prime }\right) +\log\left(\frac{4}{\delta }\right)}{m^{\prime }}}$

证毕。

参考资料

A theory of learning fromdifferent domains

Analysis of Representations for Domain Adaptation

V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998

Jie Qiao

关注

15
点赞
踩
45

收藏

觉得还不错? 一键收藏
6
评论
Domain Adaptation理论分析

文章目录A theory of learning from different domainsH-divergenceAnalysis of Representations for Domain Adaptation参考资料本文是对两篇文章：A theory of learning from different domainsAnalysis of Representations for ...
复制链接

扫一扫