《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE

摆烂办不到

于 2022-09-10 16:27:18 发布

阅读量187

点赞数

分类专栏： Machine learning

本文链接：https://blog.csdn.net/wuyanxue/article/details/126797398

版权

深度学习算法机器学习

Machine learning 专栏收录该内容

28 篇文章 5 订阅

订阅专栏

This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
Reading Difficulty： $\star\star$
Creativity: $\star\star$
Comprehensiveness (全面性)： $\star\star\star\star\star$

Symbol System:

Symbol	Meaning
$y_i \in \{0,1\}$	The label for class $i$
$\uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i)$	The parent, children, ancestors, descentors, and sibilings of node $i$
$\mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i}$	the label vector for classes $\mathbf{i}$
$\mathcal{H} = \{0,\dots,N-1\}$	The class hierachy, where $N$ is the number of nodes
$I (x)$	An indicator function output 1 when x is true, 0 otherwise.
$\mathcal{R}$	The conditional risk

Hierarchy Constraints
In HMC, if the label structure is a tree, we have:
$y_i = 1 \Rightarrow y_{\uparrow(i)} = 1.$

For the DAG-type HMC with, there are two interpretations:

AND-interpretation. We have $y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1}$
OR-interpretation. We have $y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1$

Loss functions for Flat and Hierarchical Classification
It is a review.

Zero-one loss:
$\ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y})$

Hamming loss:
$\ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i)$

Top- $k$ precision:
$k$ most-confident predicted positive labels for each sample.
$\text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k}$
So the loss is
$\ell_{\text{top-k}} = 1 - \text{top-k-precision}$

Ranking loss:
$\ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2})$

Hierarchical Multi-class Classificaiton
A review.
Note: Only a single path can be predicted positive.

Cai and Hofmann:
$\ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i)$
where $c_i$ is the cost for node $i$ .

Dekel et al. :
It seems that this loss is complicated.
But this paper treats this loss as similar to the former loss?

Hierarchical multi-label classification

H-Loss:
$\ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)})$
where $\alpha$ and $\beta$ are weight for FN and FP.

Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
Thus, there are a cost assigning approach
$c_i = \left\{ \begin{aligned} & 1, & \text{ i = 0,} \\ & \frac{c_{\Uparrow(i)}}{n_{\Leftrightarrow(i)}}, & \text{ i > 0}, \end{aligned}\right.$
where $n_{\Leftrightarrow(i)}$ is the number of siblings of $i$ (including $i$ ).

Matching Loss:
$\ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y})$ .
where
$
\phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
$
where $\text{cost}(j\rightarrow i)$ is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.

Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.

Condensing (压缩) sort and Selection Algorithm for HMC
It is a review.
It can be used on both tree and DAG hierarchies.

It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:
$\begin{aligned} & \max_{\{\psi_i\}_{i \in \mathcal{H}}} \sum_{i \in \mathcal{H}} \psi_i \widetilde{y}_i \\ s.t. \qquad & \psi_i \leq \psi_{\uparrow(i)}, \forall i \in \mathcal{H}\setminus \{0\},\\ & \psi_0 = 1, \psi_i \in \{0, 1\}, \\ & \sum_{i=0}^{N-1} \psi_i = L \end{aligned}$

where $\psi_i = 1$ indicates that node $i$ is predicted positive in $\hat{\mathbf{y}}$ ; and 0 otherwise.

When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
$\psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i).$

Extending Flatten loss
This paper extends Hamming Loss and Ranking Loss to support hierarchy,

For hierarchical hamming loss:
$\ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i$

DAG class hierarchy derives
$c_i = \left\{ \begin{aligned} & 1, & i = 0, \\ & \sum_{j \in \Uparrow(i)} \frac{c_j}{n_{\downarrow(j)}}, & i > 0 \end{aligned} \right.$
where $n$ is the number of children of node $j$ .

There are special cases in origin papaer, but it is easy and not discussed here.

For hierarchical ranking loss:
$\ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)),$

where $c_{ij} = c_ic_j$ ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.

Minimizing the risk
The conditional risks (or simply the risk) $\mathcal{R}(\hat{\mathbf{y}})$ of predicting multilabel $\hat{\mathbf{y}}$ is the expectation of $\ell(\mathbf{\hat{y}},\mathbf{y})$ over all possible $y$ ’s as ground truth, i.e.,
$\argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}).$

There are three issues to be addressed:
(1) Estimating $P(\mathbf{y}|\mathbf{x})$ .
(2) Computing $\mathcal{R}(\hat{\mathbf{y}})$ without exhaustively searching.
(3) Minimizing $\mathcal{R}(\mathbf{\hat{y}})$ .

This paper computes $p_i$ through chain rule, and the risk is transferred into different forms for different losses.
The risk for matching loss:
$\mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i$

where $q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x})$ , $d (i)$ is the depth of node $i$ . $\Uparrow_j(i)$ is the $i$ ’s ancestor at depth j, $\Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\}$ is the set of $i$ ’s ancestors at depths 0 t0 j.

The risk for hierarchical hamming loss:
$\mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i)$

The risk for hierarchical ranking loss:
$\mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C$

Efficient minimizing the risk:
$\hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}),$
where
$\mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L$
where $\text{supp}(f) := \{x \in X | f(x) \neq 0\}$ is the support of $f$ .

实际上是比较朴素也比较容易理解的优化目标，通过按照positive label的数量来分别优化，也就是different $L$ .

This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.

Conclusions

This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.