abstract:
本文提出通过一种新的学习方法,转移成分分析(TCA)来找到领域的良好特征表示,用于领域自适应。TCA学习所有域的公共迁移成分(即不会引起域间分布变化及保持原始数据固有结构的成分), 使得不同域在投影后的子空间中分布差异减少.
I. INTRODUCTION:
内容:
Our main contribution is on proposing a novel dimensionality reduction method to reduce the distance between domains via projecting data onto a learned transfer subspace.
TCA and its semisupervised extension SSTCA are much more efficient than MMDE and can handle the outof-sample extension problem.
摘抄句:
This is an important learning problem because labeled data are often difficult to come by, making it desirable to make the best use of any related data available. For example.....
A major computational problem in domain adaptation is how to reduce the difference between the distributions of the source and target domain data. Intuitively, discovering a good feature representation across domains is crucial...
In this paper, we propose a new feature extraction approach, called transfer component analysis (TCA), for domain adaptation.例举前人的工作和它们的缺陷后提出本文自己的方法。
More specifically, if two domains are related to each other, there may exist several common
components (or latent variables) underlying them.
Our main contribution is on proposing a novel dimensionality reduction method to reduce the distance between domains via projecting data onto a learned transfer subspace.
II. PREVIOUS WORKS AND PRELIMINARIES
内容:In Section II, we first introduce the domain adaptation problem and traditional dimensionality reduction methods and describe the Hilbert space embedding for distances and dependence measure between distributions
A. Domain Adaptation
The main difference between these methods and our proposed method is that we aim to match data distributions between domains in a latent space, where data properties can be preserved, instead of matching them in the original feature space.
摘抄:The key assumption in most domain adaptation methods is that P≠Q,
but P(Ys|Xs) = P(Yt|Xt)
The main difference between these methods and our proposed method is that we aim to match data distributions between domains in a latent space, where data properties can be preserved, instead of matching them in the original feature space.
B. Hilbert Space Embedding of Distributions
在2006年, Borgwardt等人[1]提出了一种基于再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS)的分布度量准则 最大均值差异(Maximum Mean Discrepancy, MMD).
H is the RKHS norm.是核映射函数,在目标域里存在有标签数据Xt的情况下通过最小化上式,可以获得最佳函数。
C. Embedding Using HSIC
III. TCA
摘抄:
As mentioned in Section II-A, most domain adaptation methods assume that P ≠ Q, but P(Ys|Xs) = P(Yt|Xt),However, in many real-world applications, the conditional probability P(Y|X) may also change across domains due to noisy or dynamic factors underlying the observed data.
A. Minimizing Distance Between P(φ(Xs)) and P(φ(Xt))
Instead of finding the nonlinear transformation φ explicitly, we first revisit a dimensionality reduction-based domain adaptation method called MMDE.
输入是两个特征矩阵,首先计算L和H矩阵,然后选择一些常用的核函数进行映射(比如线性核、高斯核)计算K,接着求((KLK+μI)^-1)*KHK的前m个特征值,,W的解就是的前m个特征值,得到的就是源域和目标域的降维后的数据,可以在上面用传统机器学习方法了。
总结:
TCA最后的优化目标是:
H是一个中心矩阵,
W是最后所求。
一旦通过TCA的计算过程获得了迁移成分矩阵W,就可以使用它来将来自源领域和目标领域的数据投影到共同的潜在空间中。这种投影使得可以对转换后的数据应用标准机器学习方法,用于诸如分类或回归等任务。可以将转换后的数据用作任何机器学习算法或模型的输入,以在源领域中训练分类器或回归模型,并将其应用于目标领域。