两篇ICDM 2018机器学习的论文,来自http://mlda.swu.edu.cn/publication.php
- 首先记录短文(简称fPML)
- 再写长文(简称ML-JMF)
- 最后总结一下异同 (ongoing)
Feature-induced Partial Multi-label Learning (fPML)
ICDM 2018
Problem
-
However, the performance of multi-label learning may be compromised by noisy (or incorrect) labels of training instances.
-
the ground-truth labels are concealed in a set of candidate noisy labels, the number of ground-truth labels is also unknown.
Most relevant
- partial multi-label learninl [Xie et al. AAAL, 2018]
- to optimize the label confidence values and the relevance ordering of labels of each instance by exploiting structural information in feature and label spaces, and by minimizing the confidence weighted ranking loss.
- However, it has to simultaneously optimize multiple binary predictors and a very large number of confidence rankings of candidate label pairs; hence, suffers from heavy computational costs
Motivetion
Why
- Since labels are correlated, the label correlation and the ground-truth instance-label association matrices have a linear dependence structure, and thus they are low-rank [Zhu et al, TKDE, 2018, Xu et al, ICDM, 2014]
- The low-rank approximation of a noisy matrix is robust to noise [Konstantinides et al, TIP, 1997, Meng et a, ICCV, 2013]
How
- We seek the ground- truth instance-label association matrix via learning the low- rank approximation of the observed association matrix, which contains noisy associations.
- The labels of an instance depend on its features, and thus the features of instances should be used to estimate noisy labels.
Method
-
主要思想是假设一个没噪声的 Y ^ \widehat{\mathbf{Y}} Y ,用矩阵分解强制分解成低秩的 S \mathbf{S} S 和 G \mathbf{G} G
Y ^ ≃ S G T (1) \widehat{\mathbf{Y}} \simeq \mathbf{S G}^{T} \tag{1} Y ≃SGT(1)
注意这两个矩阵的维度,- S ∈ R q × k \mathbf{S} \in \mathbb{R}^{q \times k} S∈Rq×k 意义是把 q q q 个label映射成 k k k 个新的label
- G ∈ R n × k \mathbf{G} \in \mathbb{R}^{n \times k} G∈Rn×k 表示将 n n n 个样本映射成 k k k 个样本
-
此时目标函数是2式,
min S , G ∥ Y − S G T ∥ F 2 (2) \min _{\mathbf{S}, \mathbf{G}}\left\|\mathbf{Y}-\mathbf{S G}^{T}\right\|_{F}^{2} \tag{2} S,Gmin∥∥∥Y−SGT∥∥∥F2(2) -
到目前为止仅利用了 label信息, 作者此时的创新是利用了原始数据 X \mathbf{X} X 的 feature信息,对 G \mathbf{G} G 进行了约束(原文是说sharing G \mathbf{G} G),加了一层线性变换,参数是 F \mathbf{F} F,变成了3式。
min S , F , G ∥ Y − S G T ∥ F 2 + λ 1 ∥ X − F G T ∥ F 2 (3) \min _{\mathbf{S}, \mathbf{F}, \mathbf{G}}\left\|\mathbf{Y}-\mathbf{S G}^{T}\right\|_{F}^{2}+\lambda_{1}\left\|\mathbf{X}-\mathbf{F} \mathbf{G}^{T}\right\|_{F}^{2} \tag{3} S,F,Gmin∥∥∥Y−SGT∥∥∥F2+λ1∥∥X−FGT∥∥F2(3)
学习 F ∈ R d × k \mathbf{F} \in \mathbb{R}^{d \times k} F∈Rd×k 是用来抓特征之间的相互关系, λ 1 \lambda_{1} λ1 起调控作用 -
最后为了将label映射回去, 加了一层线性操作 W \mathbf{W} W,4式,转化成了5式
min W ∥ Y − W T X ∥ F 2 (4) \min _{\mathbf{W}}\left\|\mathbf{Y}-\mathbf{W}^{T} \mathbf{X}\right\|_{F}^{2} \tag{4} Wmin∥∥Y−WTX∥∥F2(4)min W ∥ S G T − W T X ∥ F 2 (5) \min _{\mathbf{W}}\left\|\mathbf{S G}^{T}-\mathbf{W}^{T} \mathbf{X}\right\|_{F}^{2} \tag{5} Wmin∥∥∥S