【论文阅读】Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

最新推荐文章于 2023-06-16 16:13:45 发布

petSym

最新推荐文章于 2023-06-16 16:13:45 发布

阅读量1k

点赞数

本文链接：https://blog.csdn.net/petSym/article/details/106289447

版权

迁移学习同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

论文阅读

4 篇文章 0 订阅

订阅专栏

Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

SUMMARY@ 2020/5/12

文章目录

1. Method abstract
- during training, two alternating adaptation steps:
2. Motivation
3. Challenges /Problem to be solved
4. Contribution
5. Related work
6. Settings
7. Compared with Open Set DA
8. DCTN: framework details
9 Learning
10. Experiments
11. Further Analysis

1. Method abstract

Inspired by the distribution weighted combining rule in [33], the target distribution can be represented as the weighted combination of the multi-source distributions.

An ideal target predictor can be obtained by integrating all source predictions based on the corresponding source distribution weights.

besides of the feature extractor
DCTN also includes a (multi-source) category classifier to predict the class from different sources
and a (multi-source) domain discriminator to produce multiple source-target-specific perplexity scores as the approximation of source distribution weights.

during training, two alternating adaptation steps:

domain discriminator: The multi-way adversarial adaptation implicitly reduces domain shifts among
those sources.
- deploys multi-way adversarial learning to minimize the discrepancy between the target and each of the multiple source domains,
- also predict the source-specific perplexity scores to denote the possibilities that a target sample belongs to different source domains.
feature extractor and the category classifier
- The multi-source category classifiers are integrated with the perplexity scores to classify target
  sample, and the pseudo-labeled target samples together with source samples are utilized to update the multi-source category classifier and the feature extractor

2. Motivation

This paper focuses on the problem of multi-source domain adaptation, where there is category shift between diverse sources.

Category shift is a new protocol in MDA, where domain shift and categorical disalignment co-exist among the sources.

This paper aims at domain shift and category shift all together.

3. Challenges /Problem to be solved

cannot simply apply same UDA via combining all source domains since there are possible domain shifts among sources
eliminate the distribution discrepancy between target and each source maybe too strict, and harmful.
category shift in sources

4. Contribution

1. We present a novel and realistic MDA protocol termed category shift that relaxes the requirement on the shared category set among any source domains.
1. Inspired from the distribution weighted combining rule, we proposed the deep cocktail network (DCTN) together with the alternating adaptation algorithm to learn transferable and discriminative representation.
1. We conduct comprehensive experiments on three well-known benchmarks, and testify our model in both the vanilla and the category shift settings. Our method has achieved the state of the art across most transfer tasks.

5. Related work

5.1 Unsupervised domain adaptation with single source

domain discrepancy based methods: reduce the domain shift across the source and the target
domain discrepancy based methods
deep-model-based
adversarial learning based
others: semi-supervised method [42], domain reconstruction [14], duality [19], alignments [9] [50] [44], manifold learning [15], tensor methods [24],[31], etc.

5.2 Domain adaptation with multiple sources

originates from A-SVM[49]
shallow models[8] [22] [27]
theoretical
- learning bound for multi source DA[3]
- distribution weighted combining rule[33]

5.3 two branches of transfer learning closely relate to MDA (supervised)

continual transfer learning (CTL) [43] ,[39].
- CTLs train the learner to sequentially master multiple tasks across multiple domains.
domain generalization (DG)
- uses the existing multiple labeled domains for training regardless of the unlabeled target samples.[13, 35]

6. Settings

Suppose the classifier for each source domain is known
Vanilla MDA: samples from diverse sources share a same category set
Category Shift: categories from different sources might be also different
$N$ different underlying source distributions $\{p_{\mathbf s_j}(x,y)\}_{j=1}^N$
- $X_{s_{j}}=\left\{x_{i}^{s_{j}}\right\}_{i=1}^{\left|X_{s_{j}}\right|}$
- $Y_{s_{j}}=\left\{y_{i}^{s_{j}}\right\}_{i=1}^{\left|Y_{s_{j}}\right|}$
1 target distribution $p_t(x,y)$ , no label
- $X_{t}=\left\{x_{i}^{t}\right\}_{i=1}^{\left|X_{t}\right|}$
training set ensemble: $N + 1$ datasets
testing set: from target distribution
target domain get labeled by the union of all categories in those sources

$\mathcal{C}_{t}=\bigcup\limits_{j=1}^{N} \mathcal{C}_{s_{j}}$

7. Compared with Open Set DA

The uncommon classes are unified as a negative category called “unknown”.
In contrast, category shift consider the specific disaligned categories among multiple sources to enrich the classification in transfer.

8. DCTN: framework details

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lxYJA9mW-1590145338545)(DCTN%20deep%20cocktail%20network/image-20200513104324521.png)]

8.1 Feature extractor $F$

deep convolution nets as the backbone
share weights: map all images from N sources and target into a common feature space
employ adversarial learning to obtain the optimal mapping
- because it can successfully learn both domain-invariant features and each target-source-specific relations.

8.2 (Multi-source) domain discriminator $D$

$N$ source-specific discriminators: $\left\{D_{s_j}\right\}_{j=1}^{N}$
Given image $x$ from the source $j$ or the target domain, the domain discriminator $D$ receives the features $F (x)$ , classifies whether from the source $j$ or the target
for the data flow from each target instance $x_t$ , the domain discriminators $D$ yields the $N$ source-specific discriminative results
$\left\{D_{s_j}(F(x^t))\right\}_{j=1}^{N}$
target-source perplexity scores
$\mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right)=-\log \left(1-D_{s_{j}}\left(F\left(x^{t}\right)\right)\right)+\alpha_{s_{j}}$
$\alpha_{s_{j}}$ is the source-specific concentration constant, It is obtained by averaging the source $j$ discriminator losses over $X_{s_j}$ .

in supplementary, different score, different $\alpha$ :

$\alpha_{s_{j}}=\frac{1}{N_{T}} \sum_{i}^{N_{T}}\left(D_{s_{j}}\left(1-F\left(x_{i}^{s_{j}}\right)\right)\right)^{2}$

$N_T$ denotes how many times the target samples have been visited to train our model

$x_{i}^{s_{j}}$ denotes the source j instance come along with the coupled target instances in the adversarial learning.

8.3 (Multi-source) category classifier $C$

a multi-output net composed by $N$ source-specific predictors $\left\{C_{s_j}\right\}_{j=1}^{N}$
Each predictor is softmax classifier
for the image from source $j$ : only the value from $C_{s_j}$ get activated and provides the gradient for training
For a target image $x_t$ instead, all source-specific predictors provide $N$ categorization results $\{C_{s_j} (F(x_t))\}^N_{j =1}$ to the target classification operator.

8.4 Target classification operator

for each target feature $F(x_t)$ , the target classification operator takes each source perplexity score $\mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right)$ to re-weight the corresponding source-specific prediction ${C_{s_j} (F(x_t))\}$

the confidence $x_t$ belongs to $c$ presents as

$\left(c | x^{t}\right):=\sum_{c \in \mathcal{C}_{s_{j}}} \frac{\mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right)}{\sum\limits_{c \in \mathcal{C}_{s_{k}}} \mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{k}}\right)} C_{s_{j}}\left(c | F\left(x^{t}\right)\right) \\ where\ c\in\bigcup_{j=1}^{N} \mathcal{C}_{s_{j}}\tag{2}$

$C_{s_{j}}\left(c | F\left(x^{t}\right)\right)$ denotes the softmax value of source $j$ corresponding to class $c$
$\sum\limits_{c\in \mathcal C_{s_j}}$ means only those sources with class $c$ can join the perplexity score weighting.
$\sum\limits_{c\in \mathcal C_{s_k}}$ means all the sources

8.5 Connection to distribution weighted combining rule

In the distribution weighted combining rule [33], the target distribution is treated as a mixture of the multi-source distributions with the coefficients by normalized source distributions weighted by unknown positive $\{\lambda_j\}_{j=1}^N$ , namely $\mathcal{D}_{t}(x)=\sum_{c \in \mathcal{C}_{s_k}}^{N} \lambda_{k} \mathcal{D}_{s_{k}}(x)$

$h_{\lambda}(x)=\sum_{i=1}^{k} \frac{\lambda_{i} D_{i}(x)}{\sum_{j=1}^{k} \lambda_{j} D_{j}(x)} h_{i}(x)$

note that the hypothesis is one-dimension output $h_i(x)\in \mathbb R$

in this paper

The ideal target classifier presents as the weighted combination of source classifiers.

Note that here each classifier for each source $C_{s_j}$ is a multi output softmax result.
$C_{t}\left(c | x^{t}\right)=\sum_{c \in \mathcal{C}_{s}} \frac{\lambda_{j} \mathcal{D}_{s_{j}}\left(x^{t}\right)}{\sum_{c \in \mathcal{C}_{s_{k}}} \lambda_{k} \mathcal{D}_{s_{k}}\left(x^{t}\right)} C_{s_{j}}\left(c | F\left(x^{t}\right)\right)$
with the increase of the probability that $x_t$ from source $j$ , $D_{s_{j}}\left(F\left(x^{t}\right)\right)\rightarrow 1,\mathcal D_{s_{j}}\left(x^{t}\right)\rightarrow 1$

so $\lambda_{j} \mathcal{D}_{s_{j}}\left(x^{t}\right) \propto\mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right)=-\log \left(1-D_{s_{j}}\left(F\left(x^{t}\right)\right)\right)+\alpha_{s_{j}}$

所以用score代替了distribution的weighting
target images should be categorized by the classifiers from multiple sources, with whose features more similar to target, the source classifiers’ prediction are more trustful

9 Learning

9.1 Pre-training C and F

take all source images to jointly train the feature extractor F and the category classifier C
pseudo label for target: Those networks and the target classification operator then predict categories for all target images and annotate those with high confidences.
Since the domain discriminator hasn’t been trained, we take the uniform distribution simplex weight as the perplexity scores to the target classification operator.
Finally, we obtain the pre-trained feature extractor and category classifier via further fine-tuning them with sources and the pseudo-labeled target images.

In object recognition, we initiate our DCTN by following the same way of DAN (start with an AlexNet model pretrained on ImageNet 2012 and fine-tune it).

In terms of digit recognition, we perform DCTN learning from scratch.

9.2 Multi-way Adversarial Adaptation

ref: ADDA论文Adversarial Discriminative Domain Adaptation

original GAN:(M means mapping / feature extractor)
$\begin{array}{c} \mathcal{L}_{\mathrm{adv}_{D}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, M_{s}, M_{t}\right)= \\ -\mathbb{E}_{\mathbf{x}_{s} \sim \mathbf{X}_{s}}\left[\log D\left(M_{s}\left(\mathbf{x}_{s}\right)\right)\right] \\ -\mathbb{E}_{\mathbf{x}_{t} \sim \mathbf{X}_{t}}\left[\log \left(1-D\left(M_{t}\left(\mathbf{x}_{t}\right)\right)\right)\right] \end{array}$

$\mathcal{L}_{\mathrm{adv}_{M}}=-\mathcal{L}_{\mathrm{adv}_{D}}$

$\begin{array}{c} \min _{D} \mathcal{L}_{\mathrm{adv}_{D}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, M_{s}, M_{t}\right) \\ \min _{M_{s}, M_{t}} \mathcal{L}_{\mathrm{adv}_{M}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, D\right) \end{array}$

change method 1: early on during training the discriminator converges quickly, causing the gradient to vanish, change the generator objective, splits the optimization into two independent objectives, one for the generator and one for the discriminator,
$\mathcal{L}_{\mathrm{adv}_{M}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, D\right)=-\mathbb{E}_{\mathbf{x}_{t} \sim \mathbf{X}_{t}}\left[\log D\left(M_{t}\left(\mathbf{x}_{t}\right)\right)\right] \tag{**}$
change method 2: in the setting where both distributions are changing, this objective will lead to oscillation–when the mapping converges to its optimum, the discriminator can simply flip the sign of its prediction in response.

Tzeng et al. instead proposed the domain confusion objective, under which the mapping is trained using a cross-entropy loss function against a uniform distribution

This loss ensures that the adversarial discriminator views the two domains identically.

confuse就是要让它“半信半疑”，让source和target经过mapping的marginal distribution尽量接近。来自source和target的可能性都接近一半（或者说相当于source和target中的样本的真实domain标签都是来自1和0的可能性占一半，这样最小化这个差异的交叉熵损失函数，得到的mapping后的source和target分布就都是接近均均分布，可以认为source和target被map成很相似的domain，DA的任务就完成了）

$\begin{array}{l} \mathcal{L}_{\mathrm{adv}_{M}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, D\right)= \begin{aligned} -\sum_{d \in\{s, t\}} & \mathbb{E}_{\mathbf{x}_{d} \sim \mathbf{x}_{d}}\left[\frac{1}{2} \log D\left(M_{d}\left(\mathbf{x}_{d}\right)\right)\right. \left.+\frac{1}{2} \log \left(1-D\left(M_{d}\left(\mathbf{x}_{d}\right)\right)\right)\right] \end{aligned} \end{array} \tag{*}$

注意：其实*式子在ADDA论文中结果没有用，只是用来说明related work，ADDA中用的还是(**)；

*式子是 Simultaneous Deep Transfer Across Domains and Tasks 文中提出来的；

ADDA论文改了generator的优化目标为**。

in this paper

minmax adversarial domain adaptation
$\min _{F} \max _{D} V(F, D ; \bar{C})=\mathcal{L}_{a d v}(F, D)+\mathcal{L}_{c l s}(F, \bar{C})\tag{4}$
- classifier $C$ is fixed as $\bar C$ to provide stable gradient values.
- the first term denotes our adversarial mechanism
- the second term is a multi-source classification losses.

The optimization based on Eq.4 works well for $D$ but not $F$ .

Since the feature extractor learns the mapping from the multiple sources and the target, the domain distributions become simultaneously changing in adversary, which results in an oscillation then spoils our feature extractor.

when source and target feature mappings share their architectures, the domain confusion can be introduced to replace the adversarial objective, which performs stable to learn the mapping $F$ .

multidomain confusion loss
$\begin{array}{l} \mathcal{L}_{a d v}(F, D)=\frac{1}{N} \sum_{j}^{N} \mathbb{E}_{x \sim X_{s_{j}}} \mathcal{L}_{c f}\left(x ; F, D_{s_{j}}\right) +\mathbb{E}_{x \sim X_{t}} \mathcal{L}_{c f}\left(x ; F, D_{s_{j}}\right) \end{array} \tag{6}$
where
$\begin{array}{c} \mathcal{L}_{c f}\left(x ; F, D_{s_{j}}\right)= \frac{1}{2} \log D_{s_{j}}(F(x))+\frac{1}{2} \log \left(1-D_{s_{j}}(F(x))\right) \end{array}\tag{7}$
i.e.
$\begin{array}{l} \mathcal{L}_{a d v}(F, D)=\frac{1}{N} \sum_{j}^{N} \mathbb{E}_{x \sim X_{s_{j}}} \Big[\frac{1}{2} \log D_{s_{j}}(F(x))+\frac{1}{2} \log \left(1-D_{s_{j}}(F(x))\right)\Big]\\ +\frac{1}{N} \sum_{j}^{N}\mathbb{E}_{x \sim X_{t}} \Big[\frac{1}{2} \log D_{s_{j}}(F(x))+\frac{1}{2} \log \left(1-D_{s_{j}}(F(x))\right)\Big] \end{array}$
和（*）的差别在于：
- 没有负号
- 是multi source所以有N个discriminator，每个对应一个source和target的域判别
- *中是source和target的mapping不一样，这里是feature extractor一样
- 本文中直接修改成了*是discriminator和generator公用的loss function（的相反数，因为为负数），表示的是target和每个source之间
  
  交叉熵表示的是两个分布之间的差异，注意交叉熵一定是正数的结果
  - 最大化6式，等价于最小化交叉熵损失，就是最优化discriminator

Online hard domain batch mining

samples from different sources are sometimes useless to improve the adaptation to the target, and as the training proceeds, more redundant source samples turn to draw back the whole model performance
minibatch: sample batch $M$ for target and each source domain
Each source target discriminator $D_{s_j}$ ‘s loss is viewed as the degrees to distinguish $M$ $x^t_i$ from the $j$ th source’ s $M$ samples

$\sum_i^M - \log D_{s_{j}}(F(x_i^{s_j})) - \log \left(1-D_{s_{j}}(F(x_i^{t}))\right)$

这里是交叉熵损失，是最原始GAN的形式。越大表示损失越大，表示对M个source样本和M个target样本的来自source $j$ 还是target domain的区分效果越差，即这个source $j$ 的discriminator效果不好。

find hard source domain: feature extractor $F$ performs the worst to transform the target samples to confuse the $j^*$ th source

$j^*= \mathrm{arg}\max_j^{N}\Big\{ \sum_i^M - \log D_{s_{j}}(F(x_i^{s_j})) - \log \left(1-D_{s_{j}} (F(x_i^{t}))\right) \Big\}_{j=1}^N$

we use the source $j^*$ and the target samples in the minibatchto train the feature extractor
以下是用于迭代更新、找到最好的feature extractor的算法1
$\mathcal{L}_{a d v}^{s_j^*}(F, D)=\sum_{i}^{M} \mathcal{L}_{c f}\left(x_i^{s_j^*} ; F, D_{s_j^*}\right) +\mathcal{L}_{c f}\left(x_i^t ; F, D_{s_j^*}\right)$

$\min _{F} \max _{D} V(F, D ; \bar{C})=\mathcal{L}_{a d v}^{s_j^*}(F, D)+\mathcal{L}_{c l s}(F, \bar{C})\tag{4}$

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZRWRqJY1-1590145338548)(DCTN%20deep%20cocktail%20network/image-20200514150029895.png)]

9.3 Target Discriminative Adaptation

Aided by the multi-way adversary, DCTN has been able to obtain good domain-invariant features, yet not surely classifiable in the target domain.
auto-labeling strategy: annotate target samples, then jointly train our feature extractor and multi-source category classifier with source and target images by their (pseudo-) labels
classification losses from multiple source images and target images with pseudo labels

$\min _{F, C} \mathcal{L}_{c l s}(F, C)=\sum_{j}^{N} \mathbb{E}_{(x, y) \sim\left(X_{s_{j}}, Y_{s_{j}}\right)}\left[\mathcal{L}\left(C_{s_{j}}(F(x)), y\right)\right] +\mathbb{E}_{\left(x^{t}, \hat{y}\right) \sim\left(X_{t}^{p}, Y_{t}^{p}\right)}\left[\sum_{\hat{y} \in \mathcal{C}_{\hat{s}}} \mathcal{L}\left(C_{\hat{s}}\left(F\left(x^{t}\right)\right), \hat{y}\right)\right] \tag{8}$

apply the target classification operator to assign pseudo labels, and the samples with the confidence higher than a preseted threshold will be selected into $X^P_t$ .

given a target instance $x^t$ with pseudo-labeled class $\hat y$ , we find those sources $\hat s$ include this class $(\hat y \in \mathcal C_{\hat s})$ , then update our network via the sum of the multi-source classification losses

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-drxczEjJ-1590145338549)(DCTN%20deep%20cocktail%20network/image-20200514155059169.png)]

10. Experiments

10.1 Benchmarks

3 widely used UDA benchmarks
- Office-31 [41]:
  - a object recognition benchmark with 31 categories and 4652 images unevenly spread in three visual domains A (Amazon), D (DSLR), W (Webcam).
- ImageCLEF-DA:
  - 50 images in each category
  - totally 600 images for each domain
  - derives from ImageCLEF 2014 domain adaptation challenge, and is organized by selecting 12 object categories (aeroplane, bike bird, boat, bottle, bus, car, dog, horse, monitor, motorbike, and people) shared in the three famous real-world datasets, I (ImageNet ILSVRC 2012), P (Pascal VOC 2012), C (Caltech-256).
- Digits-five
  - five digit image sets respectively sampled from following public datasets
    - mt (MNIST) [26]
    - mm (MNIST-M) [11]
    - sv(SVHN) [36]
    - up (USPS)
    - sy (Synthetic Digits) [11].
  - Towards the images in MNIST, MNIST-M, SVHN and Synthetic Digits, we draw 25000 for training and 9000 for testing in each dataset.
  - There are only 9298 images in USPS, so we choose the entire dataset as our domain.

10.2 Evaluations in the vanilla(common) setting

baseline

mullti source: two shallow methods
- sparse FRAME (sFRAME) [46]
  - a non-stationary Markov random field model that reproduces the observed statistical properties of filter responses at a subset of selected locations, scales and orientations.
  - representing a wide variety of object patterns in natural images and that the learned models are useful for object classification.
- SGF [16]
  - Motivated by incremental learning, we create intermediate representations of data between the two domains by viewing the generative subspaces (of same dimension) created from these domains as points on the Grassmann manifold, and sampling points along the geodesic between them to obtain subspaces that provide a meaningful description of the underlying domain shift.
single source models----> multi source: conventional (TCA, GFK)/ deep

Since those methods perform in single-source setting, we introduce two MDA standards for different purposes
- Source combine: all source domains are combined into a traditional single-source v.s. target setting.
  - The first standard testify whether the multi-source is valuable to exploit
- Single best: in the multi-source domains, we report the single source transfer result best-performing in the test set.
  - whether we can further improve the best single source UDA via introducing another source transfer.
source only
- as baselines in the Source combine and multisource standards
- use all images from sources to train backbone-based multi-source classifiers and directly apply them to classify target images

10.3 Evaluations in the category shift setting

depart all categories into two non-overlapped class sets and define them as the private classes
- overlap
- disjoint
DAN also suffers negative transfer gains in most situations, which
indicates the transferbility of DAN cripled in the category
shift.
In contrast, DCTN reduces the performance drops compared to the model in the vanilla setting, and obtains positive transfer gains in all situations. It reveals that DCTN can resist the negative transfer caused by the category shift

11. Further Analysis

11.1 Feature visualization.

visualize the DCTN activations before and after adaptation.

DCTN can successfully learn transferable features with multiple sources
features learned by DCTN attains desirable discriminative property

11.2 Ablation study

The adversarial-only model excludes the pseudo labels and updates the category classifier with source samples.
The pseudo-only model forbids the adversary and categorize target samples with average multi-source results
without domain batch mining technique

11.3 Convergence analysis

despite of the frequent deviation, the classification loss, adversarial loss and testing error gradually converge.

petSym

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
【论文阅读】Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category ShiftSUMMARY@ 2020/5/12文章目录1. Method abstractduring training, two alternating adaptation steps:2. Motivation3. Challenges /Problem to be solved4. Contribution5. Related wor
复制链接

扫一扫