《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE

This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
Reading Difficulty: ⋆ ⋆ \star\star
Creativity: ⋆ ⋆ \star\star
Comprehensiveness (全面性): ⋆ ⋆ ⋆ ⋆ ⋆ \star\star\star\star\star

Symbol System:

SymbolMeaning
y i ∈ { 0 , 1 } y_i \in \{0,1\} yi{0,1}The label for class i i i
↑ ( i ) , ↓ ( i ) , ⇑ ( i ) , ⇓ ( i ) , ⇔ ( i ) \uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i) (i),(i),(i),(i),(i)The parent, children, ancestors, descentors, and sibilings of node i i i
y i ∈ { 0 , 1 } i \mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i} yi{0,1}ithe label vector for classes i \mathbf{i} i
H = { 0 , … , N − 1 } \mathcal{H} = \{0,\dots,N-1\} H={0,,N1}The class hierachy, where N N N is the number of nodes
I ( x ) I(x) I(x)An indicator function output 1 when x is true, 0 otherwise.
R \mathcal{R} RThe conditional risk

Hierarchy Constraints
In HMC, if the label structure is a tree, we have:
y i = 1 ⇒ y ↑ ( i ) = 1. y_i = 1 \Rightarrow y_{\uparrow(i)} = 1. yi=1y(i)=1.

For the DAG-type HMC with, there are two interpretations:

  1. AND-interpretation. We have y i = 1 ⇒ y ↑ ( i ) = 1 y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1} yi=1y(i)=1

  2. OR-interpretation. We have y i = 1 ⇒ ∃ y ↑ ( c ) = 1 y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1 yi=1y(c)=1

Loss functions for Flat and Hierarchical Classification
It is a review.

Zero-one loss:
ℓ 0 / 1 ( y ^ , y ) = I ( y ^ ≠ y ) \ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y}) 0/1(y^,y)=I(y^=y)

Hamming loss:
ℓ hamming ( y ^ , y ) = ∑ i ∈ H I ( y ^ i ≠ y i ) \ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i) hamming(y^,y)=iHI(y^i=yi)

Top- k k k precision:
k k k most-confident predicted positive labels for each sample.
top-k-precision ( y ^ , y ) = The number of TP predictions in the top-k labels of  y ^ k \text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k} top-k-precision(y^,y)=kThe number of TP predictions in the top-k labels of y^
So the loss is
ℓ top-k = 1 − top-k-precision \ell_{\text{top-k}} = 1 - \text{top-k-precision} top-k=1top-k-precision

Ranking loss:
ℓ rank = ∑ ( i , j ) : y i > y j ( I ( y i ^ < y ^ j ) + I ( y i ^ = y ^ j ) 2 ) \ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2}) rank=(i,j):yi>yj(I(yi^<y^j)+2I(yi^=y^j))

Hierarchical Multi-class Classificaiton
A review.
Note: Only a single path can be predicted positive.

Cai and Hofmann:
ℓ = ∑ i ∈ H c i I ( y ^ i ≠ y i ) \ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i) =iHciI(y^i=yi)
where c i c_i ci is the cost for node i i i.

Dekel et al. :
It seems that this loss is complicated.
But this paper treats this loss as similar to the former loss?

Hierarchical multi-label classification

H-Loss:
ℓ H = α ∑ i : y i = 1 , y ^ i = 0 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) + β ∑ i : y i = 0 , y ^ i = 1 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) \ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) H=αi:yi=1,y^i=0ciI(y^(i)=y(i))+βi:yi=0,y^i=1ciI(y^(i)=y(i))
where α \alpha α and β \beta β are weight for FN and FP.

Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
Thus, there are a cost assigning approach
c i = { 1 ,  i = 0, c ⇑ ( i ) n ⇔ ( i ) ,  i > 0 , c_i = \left\{ \begin{aligned} & 1, & \text{ i = 0,} \\ & \frac{c_{\Uparrow(i)}}{n_{\Leftrightarrow(i)}}, & \text{ i > 0}, \end{aligned}\right. ci= 1,n(i)c(i), i = 0, i > 0,
where n ⇔ ( i ) n_{\Leftrightarrow(i)} n(i) is the number of siblings of i i i (including i i i).

Matching Loss:
ℓ match = α ∑ i : y i = 1 ϕ ( i , y ^ ) + β ∑ i : y ^ i = 1 ϕ ( i , y ) \ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y}) match=αi:yi=1ϕ(i,y^)+βi:y^i=1ϕ(i,y).
where
$
\phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
$
where cost ( j → i ) \text{cost}(j\rightarrow i) cost(ji) is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.

Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.

Condensing (压缩) sort and Selection Algorithm for HMC
It is a review.
It can be used on both tree and DAG hierarchies.

It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:
max ⁡ { ψ i } i ∈ H ∑ i ∈ H ψ i y ~ i s . t . ψ i ≤ ψ ↑ ( i ) , ∀ i ∈ H ∖ { 0 } , ψ 0 = 1 , ψ i ∈ { 0 , 1 } , ∑ i = 0 N − 1 ψ i = L \begin{aligned} & \max_{\{\psi_i\}_{i \in \mathcal{H}}} \sum_{i \in \mathcal{H}} \psi_i \widetilde{y}_i \\ s.t. \qquad & \psi_i \leq \psi_{\uparrow(i)}, \forall i \in \mathcal{H}\setminus \{0\},\\ & \psi_0 = 1, \psi_i \in \{0, 1\}, \\ & \sum_{i=0}^{N-1} \psi_i = L \end{aligned} s.t.{ψi}iHmaxiHψiy iψiψ(i),iH{0},ψ0=1,ψi{0,1},i=0N1ψi=L

where ψ i = 1 \psi_i = 1 ψi=1 indicates that node i i i is predicted positive in y ^ \hat{\mathbf{y}} y^; and 0 otherwise.

When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
ψ i ≤ ψ j , ∀ i ∈ H ∖ { 0 } , ∀ j ∈ ⇑ ( i ) . \psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i). ψiψj,iH{0},j∈⇑(i).

Extending Flatten loss
This paper extends Hamming Loss and Ranking Loss to support hierarchy,

For hierarchical hamming loss:
ℓ H-hamming = α ∑ i : y i = 1 ∧ y ^ i = 0 c i + β ∑ i : y i = 0 ∧ y ^ i = 1 c i \ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i H-hamming=αi:yi=1y^i=0ci+βi:yi=0y^i=1ci

DAG class hierarchy derives
c i = { 1 , i = 0 , ∑ j ∈ ⇑ ( i ) c j n ↓ ( j ) , i > 0 c_i = \left\{ \begin{aligned} & 1, & i = 0, \\ & \sum_{j \in \Uparrow(i)} \frac{c_j}{n_{\downarrow(j)}}, & i > 0 \end{aligned} \right. ci= 1,j∈⇑(i)n(j)cj,i=0,i>0
where n n n is the number of children of node j j j.

There are special cases in origin papaer, but it is easy and not discussed here.

For hierarchical ranking loss:
ℓ H-rank = ∑ ( i , j ) : y i > y j c i j ( I ( y ^ i < y ^ j ) + 1 2 I ( y ^ i = y ^ j ) ) , \ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)), H-rank=(i,j):yi>yjcij(I(y^i<y^j)+21I(y^i=y^j)),

where c i j = c i c j c_{ij} = c_ic_j cij=cicj ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.

Minimizing the risk
The conditional risks (or simply the risk) R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) of predicting multilabel y ^ \hat{\mathbf{y}} y^ is the expectation of ℓ ( y ^ , y ) \ell(\mathbf{\hat{y}},\mathbf{y}) (y^,y) over all possible y y y’s as ground truth, i.e.,
( 期望风险极小化 ) arg min ⁡ y ^ ∈ Ω R ( y ^ ) = ∑ y ℓ ( y ^ , y ) P ( y ∣ x ) . (期望风险极小化) \argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}). (期望风险极小化)y^ΩargminR(y^)=y(y^,y)P(yx).

There are three issues to be addressed:
(1) Estimating P ( y ∣ x ) P(\mathbf{y}|\mathbf{x}) P(yx).
(2) Computing R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) without exhaustively searching.
(3) Minimizing R ( y ^ ) \mathcal{R}(\mathbf{\hat{y}}) R(y^).

This paper computes p i p_i pi through chain rule, and the risk is transferred into different forms for different losses.
The risk for matching loss:
R match ( y ^ ) = ∑ i : y ^ i = 0 ϕ ( i , y ^ ) + ∑ i : y ^ i q i \mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i Rmatch(y^)=i:y^i=0ϕ(i,y^)+i:y^iqi

where q i = ∑ j = 0 d ( i ) − 1 ∑ l = j + 1 d ( i ) c ⇑ l ( i ) P ( y ⇑ 0 : j ( i ) = 1 , y ⇑ j + 1 ( i ) = 0 ∣ x ) q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x}) qi=j=0d(i)1l=j+1d(i)cl(i)P(y0:j(i)=1,yj+1(i)=0∣x), d ( i ) d(i) d(i) is the depth of node i i i. ⇑ j ( i ) \Uparrow_j(i) j(i) is the i i i’s ancestor at depth j, ⇑ 0 : j ( i ) = { ⇑ 0 ( i ) , … , ⇑ j ( i ) } \Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\} 0:j(i)={0(i),,j(i)} is the set of i i i’s ancestors at depths 0 t0 j.

The risk for hierarchical hamming loss:
R H-hamming ( y ^ ) = α ∑ i : y ^ i = 0 c i p i + β ∑ i : y ^ i = 1 c i ( 1 − p i ) \mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i) RH-hamming(y^)=αi:y^i=0cipi+βi:y^i=1ci(1pi)

The risk for hierarchical ranking loss:
R H-rank ( y ^ ) = ∑ 0 ≤ i < j ≤ N − 1 c i j ( p i I ( y ^ i ≤ y ^ j ) + p j I ( y ^ i ≥ y ^ j ) + p i + p j 2 I ( y ^ i = y ^ j ) ) − C \mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C RH-rank(y^)=0i<jN1cij(piI(y^iy^j)+pjI(y^iy^j)+2pi+pjI(y^i=y^j))C

Efficient minimizing the risk:
y ^ = arg min ⁡ L = 1 , … , N R ( y ^ ( L ) ⋆ ) , \hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}), y^=L=1,,NargminR(y^(L)),
where
y ^ ( L ) ⋆ = arg min ⁡ y ^ ∈ Ω R ( y ^ ) : ∣ supp ( y ^ ) ∣ = L \mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L y^(L)=y^ΩargminR(y^):supp(y^)=L
where supp ( f ) : = { x ∈ X ∣ f ( x ) ≠ 0 } \text{supp}(f) := \{x \in X | f(x) \neq 0\} supp(f):={xXf(x)=0} is the support of f f f.

实际上是比较朴素也比较容易理解的优化目标,通过按照positive label的数量来分别优化, 也就是different L L L.

This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.

Conclusions

This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值