[ICLR 2023] Learning Structured Representations by Embedding Class Hierarchy

连理o

已于 2023-02-09 13:41:42 修改

阅读量555

点赞数

文章标签： ICLR 2023

于 2023-02-09 11:37:56 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42437114/article/details/128898869

版权

papers 专栏收录该内容

39 篇文章 1 订阅

订阅专栏

Contents

Introduction
Method
Experiments
- Structure of the Learnt Representations
- Generalization on Datasets with a Shared Hierarchy
References

Introduction

给定类别层次树，我们的目标是学得类别的特征表征，使得不同类别中心之间的欧氏距离近似于这些类别在类别层次树中的结点间最短距离.为此，作者提出使用 CPCC 来对样本的输出特征进行正则化

Method

Tree Metric

给定类别层次树 $\mathcal T=(V,E,d)$ ，其中 $d$ 为结点间的距离函数，可以定义 tree metric $d_{\mathcal T}(v,v')$ 为结点 $v, v^{'}$ 间的最短路径加权长度

Structured Representations by Embedding the Tree Metric

在这里插入图片描述

$d$ 为 Tree Metric， $\rho$ 为特征空间中类别间的距离
CoPhenetic Correlation Coefficient (CPCC) is the Pearson’s correlation coefficient between two sequences of pairwise distances.
$\operatorname{CPCC}(d, \rho):=\frac{\sum_{i<j}\left(d\left(v_i, v_j\right)-\bar{d}\right)\left(\rho\left(v_i, v_j\right)-\bar{\rho}\right)}{\left(\sum_{i<j}\left(d\left(v_i, v_j\right)-\bar{d}\right)^2\right)^{1 / 2}\left(\sum_{i<j}\left(\rho\left(v_i, v_j\right)-\bar{\rho}\right)^2\right)^{1 / 2}}\in[-1,1]$ 其中 $\bar{d}:=2 \sum_{i<j} d\left(v_i, v_j\right) / k(k-1), \bar{\rho}:=2 \sum_{i<j} \rho\left(v_i, v_j\right) / k(k-1)$ 为 $d,\rho$ 中所有 pairwise distances 的平均距离。 $\rho(v_i, v_j)$ 可以看作是数据集中 $v_i$ 类别样本特征均值和 $v_j$ 类别样本特征均值间的欧氏距离，样本特征为 $f_{\theta}(x)$
为了降低计算量，作者在计算 CPCC 时并没有使用全部数据集中的样本，而是在每个训练迭代时只使用一个 batch 的数据计算 CPCC. 当 batch 内样本都属于同一类别时， $d$ 方差为 0，CPCC 无法计算，此时需要手动将 CPCC 值设为 0

The Benefits of Structured Representations

Interpretability. CPCC regularization can potentially lead to more interpretable features, as closer classes (in the sense of the tree metric) are closer to each other in feature space.
Generalization. If the hierarchy correctly captures coarse-fine relationship, future unseen fine-grained classes from the same coarse category will be further away from those under a different coarse category. This may help generalize to unseen fine-grained classes in zero or few-shot learning.

Experiments

Structure of the Learnt Representations

在距离矩阵中可以看到，within-coarse cluster distance 远小于 between-coarse cluster distance (对应于对角的 $5\times5$ blocks)

Generalization on Datasets with a Shared Hierarchy

Baselines. (1) Flat $ℓ_{CE}$ ； (2) Multi-task Learning. 同时训练两个分类头来对细粒度类别和粗粒度类别进行分类，损失函数为两个分类头 CE 损失之和；(3) Curriculum Learning. 先用 CE 损失在粗粒度类别上训练，然后去掉分类头，在细粒度类别上重新用 CE 损失训练分类头；(4) Sum Loss. $\sum \ell_{\mathrm{CE}}\left(y_{\text {coarse }}, \mathbf{W} h(x)\right)+\ell_{\mathrm{CE}}\left(y_{\text {fine }}, h(x)\right)$ ，其中 $\mathbf{W}\in\R^{k_1\times k_2}$ 代表了标签间的关系，如果类别 $i$ 属于类别 $j$ ，则 $\mathbf{W}_{ji}=1$ ；(5) HXE. The Hierarchical Cross Entropy；(6) Soft. The soft labels objective；(7) Quad. The Quadruplet multi-task loss.
Metrics. (1) Silhouette scores to measure the salience of clustering patterns at the coarse level；(2) CPCC to measure how the whole representation structure is similar to the fine-coarse hierarchy；(3) FineAcc & CoarseAcc (In-hierarchy). 模型在细粒度类别和粗粒度类别上的分类性能，训练时使用的就是这两个层级的类别；(4) MidAcc & CoarserAcc (Out-of-hierarchy). 模型训练时没有这两个层级的概念，测试时直接通过 simple marginalization (e.g., the probability of a mid class is the sum of the probabilities of all fine classes that belong to it) 获取这两个新层级上的分类结果

在这里插入图片描述

作者还测试了 CPCC structured representations 在 unseen classes 上是否具备泛化性 (Assume a model has learned that cats and dogs are animals. Does knowing the animal concept help it understand giraffes or horses better?). 具体而言，作者在训练集的 mid and fine classes 上训练模型，在测试集上测试模型在 coarse and coarser level 以及新的 mid and fine classes 上的性能。其中，coarse and coarser level 可以通过 marginalization 实现 “zero-shot” transfer，而对于 new mid and fine classes，作者则采用 one-shot generalization，先冻结 $f_\theta$ 然后使用每个新类别的一张图像重新微调两个分类头用于 mid 和 fine 类别图像的分类。从实验结果上可以看出，(1) under this subpopulation shift, using CPCC still outperforms the original loss functions on coarse and coarser levels (zero-shot). (2) In one-shot generalization to new mid levels, CPCC gives an often large advantage. 这是因为每个 coarse class 中都包含几种 mid classes，如果模型具备足够好的 coarse class 分类性能，那么直观上也会具备较好的 mid class 分类性能 (属于 coarse class 的样本特征都比较接近)。唯一例外是 ENTITY13，该数据集上每个 coarse class 都包含很多种 mid classes，即使能成功分类 coarse class，也很难从中区分出 mid classes (It becomes difficult to recover the decision boundary of mid classes as it is close to that of other fine classes in the same coarse class.). (3) CPCC regularization is often harmful to one-shot fine level generalization due to coarse grouping. New fine classes are close together at the coarse level, making them hard to linearly separate.

在这里插入图片描述

OOD Detection. The OOD score is the probability that the model gives a higher likelihood to the true class of a random CIFAR100 image than to the maximum class likelihood of a random CIFAR10 one. (Why OOD Task? - OOD task benefits from the more structured representation. Hoffmann, David T., et al. “Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 1. 2022.)

References

Zeng, Siqi, et al. “Learning Structured Representations by Embedding Class Hierarchy.” ICLR 2023.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。