This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
Reading Difficulty:
⋆
⋆
\star\star
⋆⋆
Creativity:
⋆
⋆
\star\star
⋆⋆
Comprehensiveness (全面性):
⋆
⋆
⋆
⋆
⋆
\star\star\star\star\star
⋆⋆⋆⋆⋆
Symbol System:
Symbol | Meaning |
---|---|
y i ∈ { 0 , 1 } y_i \in \{0,1\} yi∈{0,1} | The label for class i i i |
↑ ( i ) , ↓ ( i ) , ⇑ ( i ) , ⇓ ( i ) , ⇔ ( i ) \uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i) ↑(i),↓(i),⇑(i),⇓(i),⇔(i) | The parent, children, ancestors, descentors, and sibilings of node i i i |
y i ∈ { 0 , 1 } i \mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i} yi∈{0,1}i | the label vector for classes i \mathbf{i} i |
H = { 0 , … , N − 1 } \mathcal{H} = \{0,\dots,N-1\} H={0,…,N−1} | The class hierachy, where N N N is the number of nodes |
I ( x ) I(x) I(x) | An indicator function output 1 when x is true, 0 otherwise. |
R \mathcal{R} R | The conditional risk |
Hierarchy Constraints
In HMC, if the label structure is a tree, we have:
y
i
=
1
⇒
y
↑
(
i
)
=
1.
y_i = 1 \Rightarrow y_{\uparrow(i)} = 1.
yi=1⇒y↑(i)=1.
For the DAG-type HMC with, there are two interpretations:
-
AND-interpretation. We have y i = 1 ⇒ y ↑ ( i ) = 1 y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1} yi=1⇒y↑(i)=1
-
OR-interpretation. We have y i = 1 ⇒ ∃ y ↑ ( c ) = 1 y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1 yi=1⇒∃y↑(c)=1
Loss functions for Flat and Hierarchical Classification
It is a review.
Zero-one loss:
ℓ
0
/
1
(
y
^
,
y
)
=
I
(
y
^
≠
y
)
\ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y})
ℓ0/1(y^,y)=I(y^=y)
Hamming loss:
ℓ
hamming
(
y
^
,
y
)
=
∑
i
∈
H
I
(
y
^
i
≠
y
i
)
\ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i)
ℓhamming(y^,y)=∑i∈HI(y^i=yi)
Top-
k
k
k precision:
k
k
k most-confident predicted positive labels for each sample.
top-k-precision
(
y
^
,
y
)
=
The number of TP predictions in the top-k labels of
y
^
k
\text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k}
top-k-precision(y^,y)=kThe number of TP predictions in the top-k labels of y^
So the loss is
ℓ
top-k
=
1
−
top-k-precision
\ell_{\text{top-k}} = 1 - \text{top-k-precision}
ℓtop-k=1−top-k-precision
Ranking loss:
ℓ
rank
=
∑
(
i
,
j
)
:
y
i
>
y
j
(
I
(
y
i
^
<
y
^
j
)
+
I
(
y
i
^
=
y
^
j
)
2
)
\ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2})
ℓrank=∑(i,j):yi>yj(I(yi^<y^j)+2I(yi^=y^j))
Hierarchical Multi-class Classificaiton
A review.
Note: Only a single path can be predicted positive.
Cai and Hofmann:
ℓ
=
∑
i
∈
H
c
i
I
(
y
^
i
≠
y
i
)
\ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i)
ℓ=∑i∈HciI(y^i=yi)
where
c
i
c_i
ci is the cost for node
i
i
i.
Dekel et al. :
It seems that this loss is complicated.
But this paper treats this loss as similar to the former loss?
Hierarchical multi-label classification
H-Loss:
ℓ
H
=
α
∑
i
:
y
i
=
1
,
y
^
i
=
0
c
i
I
(
y
^
⇑
(
i
)
=
y
⇑
(
i
)
)
+
β
∑
i
:
y
i
=
0
,
y
^
i
=
1
c
i
I
(
y
^
⇑
(
i
)
=
y
⇑
(
i
)
)
\ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)})
ℓH=α∑i:yi=1,y^i=0ciI(y^⇑(i)=y⇑(i))+β∑i:yi=0,y^i=1ciI(y^⇑(i)=y⇑(i))
where
α
\alpha
α and
β
\beta
β are weight for FN and FP.
Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
Thus, there are a cost assigning approach
c
i
=
{
1
,
i = 0,
c
⇑
(
i
)
n
⇔
(
i
)
,
i > 0
,
c_i = \left\{ \begin{aligned} & 1, & \text{ i = 0,} \\ & \frac{c_{\Uparrow(i)}}{n_{\Leftrightarrow(i)}}, & \text{ i > 0}, \end{aligned}\right.
ci=⎩
⎨
⎧1,n⇔(i)c⇑(i), i = 0, i > 0,
where
n
⇔
(
i
)
n_{\Leftrightarrow(i)}
n⇔(i) is the number of siblings of
i
i
i (including
i
i
i).
Matching Loss:
ℓ
match
=
α
∑
i
:
y
i
=
1
ϕ
(
i
,
y
^
)
+
β
∑
i
:
y
^
i
=
1
ϕ
(
i
,
y
)
\ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y})
ℓmatch=α∑i:yi=1ϕ(i,y^)+β∑i:y^i=1ϕ(i,y).
where
$
\phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
$
where
cost
(
j
→
i
)
\text{cost}(j\rightarrow i)
cost(j→i) is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.
Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.
Condensing (压缩) sort and Selection Algorithm for HMC
It is a review.
It can be used on both tree and DAG hierarchies.
It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:
max
{
ψ
i
}
i
∈
H
∑
i
∈
H
ψ
i
y
~
i
s
.
t
.
ψ
i
≤
ψ
↑
(
i
)
,
∀
i
∈
H
∖
{
0
}
,
ψ
0
=
1
,
ψ
i
∈
{
0
,
1
}
,
∑
i
=
0
N
−
1
ψ
i
=
L
\begin{aligned} & \max_{\{\psi_i\}_{i \in \mathcal{H}}} \sum_{i \in \mathcal{H}} \psi_i \widetilde{y}_i \\ s.t. \qquad & \psi_i \leq \psi_{\uparrow(i)}, \forall i \in \mathcal{H}\setminus \{0\},\\ & \psi_0 = 1, \psi_i \in \{0, 1\}, \\ & \sum_{i=0}^{N-1} \psi_i = L \end{aligned}
s.t.{ψi}i∈Hmaxi∈H∑ψiy
iψi≤ψ↑(i),∀i∈H∖{0},ψ0=1,ψi∈{0,1},i=0∑N−1ψi=L
where ψ i = 1 \psi_i = 1 ψi=1 indicates that node i i i is predicted positive in y ^ \hat{\mathbf{y}} y^; and 0 otherwise.
When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
ψ
i
≤
ψ
j
,
∀
i
∈
H
∖
{
0
}
,
∀
j
∈
⇑
(
i
)
.
\psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i).
ψi≤ψj,∀i∈H∖{0},∀j∈⇑(i).
Extending Flatten loss
This paper extends Hamming Loss and Ranking Loss to support hierarchy,
For hierarchical hamming loss:
ℓ
H-hamming
=
α
∑
i
:
y
i
=
1
∧
y
^
i
=
0
c
i
+
β
∑
i
:
y
i
=
0
∧
y
^
i
=
1
c
i
\ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i
ℓH-hamming=αi:yi=1∧y^i=0∑ci+βi:yi=0∧y^i=1∑ci
DAG class hierarchy derives
c
i
=
{
1
,
i
=
0
,
∑
j
∈
⇑
(
i
)
c
j
n
↓
(
j
)
,
i
>
0
c_i = \left\{ \begin{aligned} & 1, & i = 0, \\ & \sum_{j \in \Uparrow(i)} \frac{c_j}{n_{\downarrow(j)}}, & i > 0 \end{aligned} \right.
ci=⎩
⎨
⎧1,j∈⇑(i)∑n↓(j)cj,i=0,i>0
where
n
n
n is the number of children of node
j
j
j.
There are special cases in origin papaer, but it is easy and not discussed here.
For hierarchical ranking loss:
ℓ
H-rank
=
∑
(
i
,
j
)
:
y
i
>
y
j
c
i
j
(
I
(
y
^
i
<
y
^
j
)
+
1
2
I
(
y
^
i
=
y
^
j
)
)
,
\ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)),
ℓH-rank=(i,j):yi>yj∑cij(I(y^i<y^j)+21I(y^i=y^j)),
where c i j = c i c j c_{ij} = c_ic_j cij=cicj ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.
Minimizing the risk
The conditional risks (or simply the risk)
R
(
y
^
)
\mathcal{R}(\hat{\mathbf{y}})
R(y^) of predicting multilabel
y
^
\hat{\mathbf{y}}
y^ is the expectation of
ℓ
(
y
^
,
y
)
\ell(\mathbf{\hat{y}},\mathbf{y})
ℓ(y^,y) over all possible
y
y
y’s as ground truth, i.e.,
(
期望风险极小化
)
arg min
y
^
∈
Ω
R
(
y
^
)
=
∑
y
ℓ
(
y
^
,
y
)
P
(
y
∣
x
)
.
(期望风险极小化) \argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}).
(期望风险极小化)y^∈ΩargminR(y^)=y∑ℓ(y^,y)P(y∣x).
There are three issues to be addressed:
(1) Estimating
P
(
y
∣
x
)
P(\mathbf{y}|\mathbf{x})
P(y∣x).
(2) Computing
R
(
y
^
)
\mathcal{R}(\hat{\mathbf{y}})
R(y^) without exhaustively searching.
(3) Minimizing
R
(
y
^
)
\mathcal{R}(\mathbf{\hat{y}})
R(y^).
This paper computes
p
i
p_i
pi through chain rule, and the risk is transferred into different forms for different losses.
The risk for matching loss:
R
match
(
y
^
)
=
∑
i
:
y
^
i
=
0
ϕ
(
i
,
y
^
)
+
∑
i
:
y
^
i
q
i
\mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i
Rmatch(y^)=i:y^i=0∑ϕ(i,y^)+i:y^i∑qi
where q i = ∑ j = 0 d ( i ) − 1 ∑ l = j + 1 d ( i ) c ⇑ l ( i ) P ( y ⇑ 0 : j ( i ) = 1 , y ⇑ j + 1 ( i ) = 0 ∣ x ) q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x}) qi=∑j=0d(i)−1∑l=j+1d(i)c⇑l(i)P(y⇑0:j(i)=1,y⇑j+1(i)=0∣x), d ( i ) d(i) d(i) is the depth of node i i i. ⇑ j ( i ) \Uparrow_j(i) ⇑j(i) is the i i i’s ancestor at depth j, ⇑ 0 : j ( i ) = { ⇑ 0 ( i ) , … , ⇑ j ( i ) } \Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\} ⇑0:j(i)={⇑0(i),…,⇑j(i)} is the set of i i i’s ancestors at depths 0 t0 j.
The risk for hierarchical hamming loss:
R
H-hamming
(
y
^
)
=
α
∑
i
:
y
^
i
=
0
c
i
p
i
+
β
∑
i
:
y
^
i
=
1
c
i
(
1
−
p
i
)
\mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i)
RH-hamming(y^)=αi:y^i=0∑cipi+βi:y^i=1∑ci(1−pi)
The risk for hierarchical ranking loss:
R
H-rank
(
y
^
)
=
∑
0
≤
i
<
j
≤
N
−
1
c
i
j
(
p
i
I
(
y
^
i
≤
y
^
j
)
+
p
j
I
(
y
^
i
≥
y
^
j
)
+
p
i
+
p
j
2
I
(
y
^
i
=
y
^
j
)
)
−
C
\mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C
RH-rank(y^)=0≤i<j≤N−1∑cij(piI(y^i≤y^j)+pjI(y^i≥y^j)+2pi+pjI(y^i=y^j))−C
Efficient minimizing the risk:
y
^
=
arg min
L
=
1
,
…
,
N
R
(
y
^
(
L
)
⋆
)
,
\hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}),
y^=L=1,…,NargminR(y^(L)⋆),
where
y
^
(
L
)
⋆
=
arg min
y
^
∈
Ω
R
(
y
^
)
:
∣
supp
(
y
^
)
∣
=
L
\mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L
y^(L)⋆=y^∈ΩargminR(y^):∣supp(y^)∣=L
where
supp
(
f
)
:
=
{
x
∈
X
∣
f
(
x
)
≠
0
}
\text{supp}(f) := \{x \in X | f(x) \neq 0\}
supp(f):={x∈X∣f(x)=0} is the support of
f
f
f.
实际上是比较朴素也比较容易理解的优化目标,通过按照positive label的数量来分别优化, 也就是different L L L.
This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.
Conclusions
This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.