Domain Adaptation理论分析

本文是对两篇文章:

  • A theory of learning from different domains
  • Analysis of Representations for Domain Adaptation

的整理。他从理论上给出了在target domain的误差的bound是部分由source domain的误差决定的,具有指导性意义。

A theory of learning from different domains

首先我们给出一些基本的设置,用 D s , f s \displaystyle D_{s} ,f_{s} Ds,fs表示在source domain上分布以及该domain上的函数分类函数(这里假设 f s \displaystyle f_{s} fs是二分类函数,所以取值是[0,1]),同理target domain:用 D t , f t \displaystyle D_{t} ,f_{t} Dt,ft表示

我们称hypothesis是一个用来分类的函数 h : X → { 0 , 1 } \displaystyle h:\mathcal{X}\rightarrow \{0,1\} h:X{0,1}. 于是我们可以定义h和f的误差为:

ϵ S ( h , f ) = E x ∼ D S [ ∣ h ( x ) − f ( x ) ∣ ] \epsilon _{S} (h,f)=\mathrm{E}_{\mathbf{x} \sim \mathcal{D}_{S}} [|h(\mathbf{x} )-f(\mathbf{x} )|] ϵS(h,f)=ExDS[h(x)f(x)]

表示在source domain上h和f的误差,特别的,当 f = f s \displaystyle f=f_{s} f=fs,即为真实的分类函数时,记 ϵ S \displaystyle \epsilon _{S} ϵS(h)= ϵ S ( h , f s ) \displaystyle \epsilon _{S} (h,f_{s} ) ϵS(h,fs),同理target domain的误差同样有 ϵ T ( h ) = ϵ T ( h , f t ) \displaystyle \epsilon _{T}( h) =\epsilon _{T} (h,f_{t} ) ϵT(h)=ϵT(h,ft),接下来我们给出最重要的H-divergence

H-divergence

所谓散度就是一个弱化的距离,他不一定具备距离的性质,比如有可能不满足对称性等等,那么所谓H是定义在假设空间 H \displaystyle \mathcal{H} H D \displaystyle \mathcal{D} D D ′ \displaystyle \mathcal{D}^{\prime } D的距离:

d H ( D , D ′ ) = 2 sup ⁡ h ∈ H ∣ Pr ⁡ x ∼ D [ h ( x ) = 1 ] − Pr ⁡ x ∼ D ′ [ h ( x ) = 1 ] ∣ d_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) =2\sup _{h\in \mathcal{H}}\left| \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1]\right| dH(D,D)=2hHsupPrxD[h(x)=1]PrxD[h(x)=1]

直观来看,这个散度的意思是,在一个假设空间 H \displaystyle \mathcal{H} H中,找到一个函数h,使得 Pr ⁡ x ∼ D [ h ( x ) = 1 ] \displaystyle \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1] PrxD[h(x)=1]的概率尽可能大,而 Pr ⁡ x ∼ D ′ [ h ( x ) = 1 ] \displaystyle \operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1] PrxD[h(x)=1]的概率尽可能小,也就是说,我们用最大距离来衡量 D , D ′ \displaystyle \mathcal{D} ,\mathcal{D}^{\prime } D,D之间的距离。同时这个h也可以理解为是用来尽可能区分 D , D ′ \displaystyle \mathcal{D} ,\mathcal{D}^{\prime } D,D这两个分布的函数。
此外这个散度是可以从数据中估计出来的:

Lemma 1 LetHbe a hypothesis space on X with VC dimension d. If U and U’ are samples of size m from D and D’ respectively and d ^ H ( D , D ′ ) \displaystyle \hat{d}_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) d^H(D,D) is the empirical H-divergence between samples, then for any δ ∈ (0,1), with probability at least 1−δ,

d H ( D , D ′ ) ≤ d ^ H ( U , U ′ ) + 4 d log ⁡ ( 2 m ) + log ⁡ ( 2 δ ) m d_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) \leq \hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) +4\sqrt{\frac{d\log (2m)+\log\left(\frac{2}{\delta }\right)}{m}} dH(D,D)d^H(U,U)+4mdlog(2m)+log(δ2)

这个bound其实就是VC维的bound,这里d表示H的VC维m是样本数量。显然当d有限时,样本量趋于无穷的时候收敛。接下来给出一种计算的方法:
Lemma 2 该散度可以 从样本中计算

d ^ H ( U , U ′ ) = 2 ( 1 − min ⁡ h ∈ H [ 1 m ∑ x h ( x ) = 0 I [ x ∈ U ] + 1 m ∑ x h ( x ) = 1 I [ x ∈ U ′ ] ] )   \hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) =2\left( 1-\min_{h\in \mathcal{H}}\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right]\right) \ d^H(U,U)=21hHminm1xh(x)=0I[xU]+m1xh(x)=1I[xU] 

其中 I [ x ∈ U ] I[ x\in U] I[xU]表示当 x ∈ U \displaystyle x\in U xU 的时候等于1,也就是统计 x ∈ U \displaystyle x\in U xU的x的数量
可以其实可以直接看出他就是在估计这么个概率,也就是H散度:

1 − [ 1 m ∑ x h ( x ) = 0 I [ x ∈ U ] + 1 m ∑ x h ( x ) = 1 I [ x ∈ U ′ ] ] = Pr ⁡ x ∼ D [ h ( x ) = 1 ] − Pr ⁡ x ∼ D ′ [ h ( x ) = 1 ] 1-\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right] =\operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1] 1m1xh(x)=0I[xU]+m1xh(x)=1I[xU]=PrxD[h(x)=1]PrxD[h(x)=1]

Definition 1 symmetric difference hypothesis space H Δ H \displaystyle \mathcal{H} \Delta \mathcal{H} HΔH是一系列hypotheses的集合

g ∈ H Δ H    ⟺    g ( x ) = h ( x ) ⊕ h ′ ( x )    for some  h , h ′ ∈ H g\in \mathcal{H} \Delta \mathcal{H} \ \ \Longleftrightarrow \ \ g(\mathbf{x} )=h(\mathbf{x} )\oplus h^{\prime } (\mathbf{x} )\ \ \text{ for some } h,h^{\prime } \in \mathcal{H} gHΔH    g(x)=h(x)h(x)   for some h,hH

其中 ⊕ \displaystyle \oplus 表示异或,就是当 h ( x ) ≠ h ′ ( x ) \displaystyle h(\mathbf{x} )\neq h'(\mathbf{x} ) h(x)̸=h(x)时, g ( x ) = 1 \displaystyle g(\mathbf{x} )=1 g(x)=1

直观来说,这个g就是判断两个h的结果相不相等的函数。这个东西的好好处是,可以用这个集合中的函数来表示两个函数不相等的概率,也就是两个函数之间的误差,如果能找到两个domain之间的两个函数间的最大误差 ,也就找到了H散度的值,即:

d H Δ H ( D S , D T ) = 2 sup ⁡ h , h ′ ∈ H ∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| dHΔH(DS,DT)=2h,hHsupϵS(h,h)ϵT(h,h)

推导过程可见引理3:

Lemma 3 对于任意的hypotheses h , h ′ ∈ H \displaystyle h,h'\in H h,hH

∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ ≤ 1 2 d H Δ H ( D S , D T ) \left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \leq \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) ϵS(h,h)ϵT(h,h)21dHΔH(DS,DT)

证明

d H Δ H ( D S , D T ) = 2 sup ⁡ h , h ′ ∈ H ∣ Pr ⁡ x ∼ D S [ h ( x ) ⊕ h ′ ( x ) = 1 ] − Pr ⁡ x ∼ D T [ h ( x ) ⊕ h ′ ( x ) = 1 ] = 2 sup ⁡ h , h ′ ∈ H ∣ Pr ⁡ x ∼ D S [ h ( x ) ≠ h ′ ( x ) ] − Pr ⁡ x ∼ D T [ h ( x ) ≠ h ′ ( x ) ] = 2 sup ⁡ h , h ′ ∈ H ∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ ≥ 2 ∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ \begin{aligned} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) & = 2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\oplus h^{\prime } (x)=1\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\oplus h^{\prime } (x)=1\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\neq h^{\prime } (x)\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\neq h^{\prime } (x)\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \geq 2\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \end{aligned} dHΔH(DS,DT)=2h,hHsupPrxDS[h(x)h(x)=1]PrxDT[h(x)h(x)=1]=2h,hHsupPrxDS[h(x)̸=h(x)]PrxDT[h(x)̸=h(x)]=2h,hHsupϵS(h,h)ϵT(h,h)2ϵS(h,h)ϵT(h,h)

证毕。

有了上面的一些引理,我们证明一个重要的定理,这个定理告诉我们,只要找到一个h,使得在source domain上的误差尽可能小就能让target domain上的误差尽可能小。
Theorem 1 如果Us,Ut是从Ds,Dt中抽取的无标签数据。则

ϵ T ( h ) ≤ ϵ S ( h ) + 1 2 d ^ H Δ H ( U S , U T ) + 4 2 d log ⁡ ( 2 m ′ ) + log ⁡ ( 2 δ ) m ′ + λ \epsilon _{T} (h)\leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda ϵT(h)ϵS(h)+21d^HΔH(US,UT)+4m2dlog(2m)+log(δ2) +λ

证明:该证明用到了上面的引理1,以及三角不等式: ϵ T ( h , f T ) ≤ ϵ T ( f T , h ∗ ) + ϵ T ( h , h ∗ ) \displaystyle \epsilon _{T} (h,f_{T}) \leq \epsilon _{T}\left( f_{T} ,h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right) ϵT(h,fT)ϵT(fT,h)+ϵT(h,h)

ϵ T ( h ) ≤ ϵ T ( h ∗ ) + ϵ T ( h , h ∗ ) = ϵ T ( h ∗ ) + ϵ T ( h , h ∗ ) + ϵ S ( h , h ∗ ) − ϵ S ( h , h ∗ ) ≤ ϵ T ( h ∗ ) + ϵ S ( h , h ∗ ) + ∣ ϵ T ( h , h ∗ ) − ϵ S ( h , h ∗ ) ∣ ( 引 理 1 ) ≤ ϵ T ( h ∗ ) + ϵ S ( h , h ∗ ) + 1 2 d H Δ H ( D S , D T ) ( 三 角 不 等 式   ) ≤ ϵ T ( h ∗ ) + ϵ S ( h ) + ϵ S ( h ∗ ) + 1 2 d H Δ H ( D S , D T ) = ϵ S ( h ) + 1 2 d H Δ H ( D S , D T ) + λ ≤ ϵ S ( h ) + 1 2 d ^ H Δ H ( U S , U T ) + 4 2 d log ⁡ ( 2 m ′ ) + log ⁡ ( 2 δ ) m ′ + λ \begin{aligned} \epsilon _{T} (h) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right)\\ & =\epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\\ & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\left| \epsilon _{T}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\right| \\ ( 引理1) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ ( 三角不等式\ ) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S} (h)+\epsilon _{S}\left( h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ & =\epsilon _{S} (h)+\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) +\lambda \\ & \leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda \end{aligned} ϵT(h)(1)( )ϵT(h)+ϵT(h,h)=ϵT(h)+ϵT(h,h)+ϵS(h,h)ϵS(h,h)ϵT(h)+ϵS(h,h)+ϵT(h,h)ϵS(h,h)ϵT(h)+ϵS(h,h)+21dHΔH(DS,DT)ϵT(h)+ϵS(h)+ϵS(h)+21dHΔH(DS,DT)=ϵS(h)+21dHΔH(DS,DT)+λϵS(h)+21d^HΔH(US,UT)+4m2dlog(2m)+log(δ2) +λ

式1.用了三角不等式,式5用了三角不等式: ϵ S ( h , h ∗ ) ⩽ ϵ S ( h , f s ) + ϵ S ( h ∗ , f s ) \displaystyle \epsilon _{S}\left( h,h^{*}\right) \leqslant \epsilon _{S}( h,f_{s}) +\epsilon _{S}\left( h^{*} ,f_{s}\right) ϵS(h,h)ϵS(h,fs)+ϵS(h,fs),最后一个使用使用了VC维理论,这是从样本从估计 1 2 d H Δ H \displaystyle \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}} 21dHΔH的泛化误差,其中d为VC维度
证毕。

这个bound的本质就是用H-divrgence将两个domain误差的差距建立了一个联系
∣ ϵ S − ϵ T ∣ ≈ 1 2 d H Δ H ( D S , D T ) |\epsilon_S-\epsilon_T| \approx \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) ϵSϵT21dHΔH(DS,DT)

Analysis of Representations for Domain Adaptation

这篇论文将DA的误差推广到存在representation的分布上。通过假设存在一个表征函数R,将domain映射到一个representation上,即负责将X映射到Z,当然,R确定时,也就表示一个domain被确定了,因为R可以将表征逆映射回X上 ,而这个X就是一个domain

Pr ⁡ D ~ [ B ] = d e f Pr ⁡ D [ R − 1 ( B ) ] f ~ ( z ) = d e f E D [ f ( x ) ∣ R ( x ) = z ] \begin{array}{ c c c } \operatorname{Pr}_{\tilde{\mathcal{D}}} [B] & \stackrel{\mathrm{def}}{=} & \operatorname{Pr}_{\mathcal{D}}\left[\mathcal{R}^{-1} (B)\right]\\ \tilde{f} (\mathbf{z} ) & \stackrel{\mathrm{def}}{=} & \mathrm{E}_{\mathcal{D}} [f(\mathbf{x} )|\mathcal{R} (\mathbf{x} )=\mathbf{z} ] \end{array} PrD~[B]f~(z)=def=defPrD[R1(B)]ED[f(x)R(x)=z]

简单的说,B是在feature space上的一个时间,这里的 Pr ⁡ D ~ [ B ] \operatorname{Pr}_{\tilde{\mathcal{D}}} [B] PrD~[B]就是直接测量representation上的概率的测度。另外这里 f ~ ( z ) \displaystyle \tilde{f} (\mathbf{z} ) f~(z)是所有被z表征的f(x)的均值,,这里每个f(x)都是一个label,将他们取均值来作为表征z的label.

在DA问题中,我们用 D S \displaystyle D_{S} DS 表示source domain的分布,用 D ~ S \displaystyle \tilde{D}_{S} D~S表示是建立在feature space上的source domain的分布,也就是这个分布是经过一个z进行转换得到的,正如上述定义的公式描述的一样。

那么误差也同样可以推广到带representation的场景下,只要我们从 D ~ S \displaystyle \tilde{D}_{S} D~S从采样z就可以了,这里用h表示任意的一个分类器,于是h在source domain的误差计算如下:

ϵ S ( h ) = E z ∼ D ~ S [ E y ∼ f ~ ( z ) [ y ≠ h ( z ) ] ] = E z ∼ D ~ S ∣ f s ~ ( z ) − h ( z ) ∣ \begin{aligned} \epsilon _{S} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}} |\widetilde{f_{s}} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned} ϵS(h)=EzD~S[Eyf~(z)[y̸=h(z)]]=EzD~Sfs (z)h(z)

同理target domain的误差:

ϵ T ( h ) = E z ∼ D ~ T [ E y ∼ f ~ ( z ) [ y ≠ h ( z ) ] ] = E z ∼ D ~ T ∣ f ~ t ( z ) − h ( z ) ∣ \begin{aligned} \epsilon _{T} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}} |\tilde{f}_{t} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned} ϵT(h)=EzD~T[Eyf~(z)[y̸=h(z)]]=EzD~Tf~t(z)h(z)

也就是说 ϵ S ( h ) = ϵ S ( h , f s ~ ) \displaystyle \epsilon _{S} (h)=\epsilon _{S} (h,\widetilde{f_{s}} ) ϵS(h)=ϵS(h,fs ), ϵ T ( h ) = ϵ T ( h , f T ~ ) \displaystyle \epsilon _{T} (h)=\epsilon _{T} (h,\widetilde{f_{T}} ) ϵT(h)=ϵT(h,fT )

接下来我们开始尝试将定理1推广到带representation的情况。

Theorem 2 Let R be a fixed representation function from X to Z and H be a hypothesis space of VC-dimension d. If a random labeled sample of size m is generated by applying R to a DS-i.i.d. sample labeled according to f, then with probability at least 1−δ, for every h ∈ H:

ϵ T ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log ⁡ 2 e m d + log ⁡ 4 δ ) + d H ( D ~ S , D ~ T ) + λ \epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) +\lambda ϵT(h)ϵ^S(h)+m4(dlogd2em+logδ4) +dH(D~S,D~T)+λ

其中e是自然底数
证明
h ∗ = argmin ⁡ h ∈ H ( ϵ T ( h ) + ϵ S ( h ) ) h^{*} =\operatorname{argmin}_{h\in H}( \epsilon _{T} (h)+\epsilon _{S} (h)) h=argminhH(ϵT(h)+ϵS(h)),且 ϵ T ( h ∗ ) = λ T , ϵ S ( h ∗ ) = λ S \displaystyle \epsilon _{T} (h^{*} )=\lambda _{T} ,\epsilon _{S} (h^{*} )=\lambda _{S} ϵT(h)=λT,ϵS(h)=λS. 记 λ = λ T + λ S \displaystyle \lambda =\lambda _{T} +\lambda _{S} λ=λT+λS

ϵ T ( h ) ≤ λ T + Pr ⁡ D T [ Z h Δ Z h ∗ ] = λ T + Pr ⁡ D S [ Z h Δ Z h ∗ ] + Pr ⁡ D T [ Z h Δ Z h ∗ ] − Pr ⁡ D S [ Z h Δ Z h ∗ ] ≤ λ T + Pr ⁡ D S [ Z h Δ Z h ∗ ] + ∣ Pr ⁡ D S [ Z h Δ Z h ∗ ] − Pr ⁡ D T [ Z h Δ Z h ∗ ] ∣ ≤ λ T + Pr ⁡ D S [ Z h Δ Z h ∗ ] + d H ( D ~ S , D ~ T ) ≤ λ T + λ S + ϵ S ( h ) + d H ( D ~ S , D ~ T ) ≤ λ + ϵ S ( h ) + d H ( D ~ S , D ~ T ) \begin{aligned} \epsilon _{T} (h) & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & =\lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +| \operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] |\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda _{T} +\lambda _{S} +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) \end{aligned} ϵT(h)λT+PrDT[ZhΔZh]=λT+PrDS[ZhΔZh]+PrDT[ZhΔZh]PrDS[ZhΔZh]λT+PrDS[ZhΔZh]+PrDS[ZhΔZh]PrDT[ZhΔZh]λT+PrDS[ZhΔZh]+dH(D~S,D~T)λT+λS+ϵS(h)+dH(D~S,D~T)λ+ϵS(h)+dH(D~S,D~T)

其中 Z h = { z ∈ Z : h ( z ) = 1 } \displaystyle \mathcal{Z}_{h} =\{\mathbf{z} \in \mathcal{Z} :h(\mathbf{z} )=1\} Zh={zZ:h(z)=1},因此 Pr ⁡ D T [ Z h Δ Z h ∗ ] \displaystyle \operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] PrDT[ZhΔZh]可以看做是 ϵ T ( h , h ∗ ) \displaystyle \epsilon _{T}\left( h,h^{*}\right) ϵT(h,h)
第一条不等式来自与三角不等式: ϵ T ( h , f T ) ⩽ ϵ T ( h ∗ , f T ) + ϵ T ( h ∗ , h ) \displaystyle \epsilon _{T} (h,f_{T} )\leqslant \epsilon _{T} (h^{*} ,f_{T} )+\epsilon _{T} (h^{*} ,h) ϵT(h,fT)ϵT(h,fT)+ϵT(h,h)
第5条式子来自三角不等式: ϵ S ( h ∗ , h ) ⩽ ϵ S ( h ∗ , f T ) + ϵ S ( h , f T ) \displaystyle \epsilon _{S} (h^{*} ,h)\leqslant \epsilon _{S} (h^{*} ,f_{T} )+\epsilon _{S} (h,f_{T} ) ϵS(h,h)ϵS(h,fT)+ϵS(h,fT)
最后根据Vapnik-Chervonenkis theory (V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998)

ϵ S ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log ⁡ 2 e m d + log ⁡ 4 δ ) \epsilon _{S} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} ϵS(h)ϵ^S(h)+m4(dlogd2em+logδ4)

因此

ϵ T ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log ⁡ 2 e m d + log ⁡ 4 δ ) + d H ( D ~ S , D ~ T ) + λ \epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) +\lambda ϵT(h)ϵ^S(h)+m4(dlogd2em+logδ4) +dH(D~S,D~T)+λ

同理,对于 d H ( D ~ S , D ~ T ) \displaystyle d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) dH(D~S,D~T)的经验估计,设该分布有m’个样本,bound可以进一步写作:

ϵ T ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log ⁡ 2 e m d + log ⁡ 4 δ ) + λ + d H ( U ~ S , U ~ T ) + 4 d log ⁡ ( 2 m ′ ) + log ⁡ ( 4 δ ) m ′ \epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\frac{4}{m}\sqrt{\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +\lambda +d_{\mathcal{H}}\left(\tilde{\mathcal{U}}_{S} ,\tilde{\mathcal{U}}_{T}\right) +4\sqrt{\frac{d\log\left( 2m^{\prime }\right) +\log\left(\frac{4}{\delta }\right)}{m^{\prime }}} ϵT(h)ϵ^S(h)+m4(dlogd2em+logδ4) +λ+dH(U~S,U~T)+4mdlog(2m)+log(δ4)

证毕。

参考资料

A theory of learning fromdifferent domains

Analysis of Representations for Domain Adaptation

V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998

  • 15
    点赞
  • 45
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值