Contents
Introduction
- 给定类别层次树,我们的目标是学得类别的特征表征,使得不同类别中心之间的欧氏距离近似于这些类别在类别层次树中的结点间最短距离.为此,作者提出使用 CPCC 来对样本的输出特征进行正则化
Method
Tree Metric
- 给定类别层次树 T = ( V , E , d ) \mathcal T=(V,E,d) T=(V,E,d),其中 d d d 为结点间的距离函数,可以定义 tree metric d T ( v , v ′ ) d_{\mathcal T}(v,v') dT(v,v′) 为结点 v , v ′ v,v' v,v′ 间的最短路径加权长度
Structured Representations by Embedding the Tree Metric
- d d d 为 Tree Metric, ρ \rho ρ 为特征空间中类别间的距离
- CoPhenetic Correlation Coefficient (CPCC) is the Pearson’s correlation coefficient between two sequences of pairwise distances.
CPCC ( d , ρ ) : = ∑ i < j ( d ( v i , v j ) − d ˉ ) ( ρ ( v i , v j ) − ρ ˉ ) ( ∑ i < j ( d ( v i , v j ) − d ˉ ) 2 ) 1 / 2 ( ∑ i < j ( ρ ( v i , v j ) − ρ ˉ ) 2 ) 1 / 2 ∈ [ − 1 , 1 ] \operatorname{CPCC}(d, \rho):=\frac{\sum_{i<j}\left(d\left(v_i, v_j\right)-\bar{d}\right)\left(\rho\left(v_i, v_j\right)-\bar{\rho}\right)}{\left(\sum_{i<j}\left(d\left(v_i, v_j\right)-\bar{d}\right)^2\right)^{1 / 2}\left(\sum_{i<j}\left(\rho\left(v_i, v_j\right)-\bar{\rho}\right)^2\right)^{1 / 2}}\in[-1,1] CPCC(d,ρ):=(∑i<j(d(vi,vj)−dˉ)2)1/2(∑i<j(ρ(vi,vj)−ρˉ)2)1/2∑i<j(d(vi,vj)−dˉ)(ρ(vi,vj)−ρˉ)∈[−1,1]其中 d ˉ : = 2 ∑ i < j d ( v i , v j ) / k ( k − 1 ) , ρ ˉ : = 2 ∑ i < j ρ ( v i , v j ) / k ( k − 1 ) \bar{d}:=2 \sum_{i<j} d\left(v_i, v_j\right) / k(k-1), \bar{\rho}:=2 \sum_{i<j} \rho\left(v_i, v_j\right) / k(k-1) dˉ:=2∑i<jd(vi,vj)/k(k−1),ρˉ:=2∑i<jρ(vi,vj)/k(k−1) 为 d , ρ d,\rho d,ρ 中所有 pairwise distances 的平均距离。 ρ ( v i , v j ) \rho(v_i, v_j) ρ(vi,vj) 可以看作是数据集中 v i v_i vi 类别样本特征均值和 v j v_j vj 类别样本特征均值间的欧氏距离,样本特征为 f θ ( x ) f_{\theta}(x) fθ(x) - 为了降低计算量,作者在计算 CPCC 时并没有使用全部数据集中的样本,而是在每个训练迭代时只使用一个 batch 的数据计算 CPCC. 当 batch 内样本都属于同一类别时, d d d 方差为 0,CPCC 无法计算,此时需要手动将 CPCC 值设为 0
The Benefits of Structured Representations
- Interpretability. CPCC regularization can potentially lead to more interpretable features, as closer classes (in the sense of the tree metric) are closer to each other in feature space.
- Generalization. If the hierarchy correctly captures coarse-fine relationship, future unseen fine-grained classes from the same coarse category will be further away from those under a different coarse category. This may help generalize to unseen fine-grained classes in zero or few-shot learning.
Experiments
Structure of the Learnt Representations
- 在距离矩阵中可以看到,within-coarse cluster distance 远小于 between-coarse cluster distance (对应于对角的
5
×
5
5\times5
5×5 blocks)
Generalization on Datasets with a Shared Hierarchy
- Baselines. (1) Flat ℓ C E ℓ_{CE} ℓCE; (2) Multi-task Learning. 同时训练两个分类头来对细粒度类别和粗粒度类别进行分类,损失函数为两个分类头 CE 损失之和;(3) Curriculum Learning. 先用 CE 损失在粗粒度类别上训练,然后去掉分类头,在细粒度类别上重新用 CE 损失训练分类头;(4) Sum Loss. ∑ ℓ C E ( y coarse , W h ( x ) ) + ℓ C E ( y fine , h ( x ) ) \sum \ell_{\mathrm{CE}}\left(y_{\text {coarse }}, \mathbf{W} h(x)\right)+\ell_{\mathrm{CE}}\left(y_{\text {fine }}, h(x)\right) ∑ℓCE(ycoarse ,Wh(x))+ℓCE(yfine ,h(x)),其中 W ∈ R k 1 × k 2 \mathbf{W}\in\R^{k_1\times k_2} W∈Rk1×k2 代表了标签间的关系,如果类别 i i i 属于类别 j j j,则 W j i = 1 \mathbf{W}_{ji}=1 Wji=1;(5) HXE. The Hierarchical Cross Entropy;(6) Soft. The soft labels objective;(7) Quad. The Quadruplet multi-task loss.
- Metrics. (1) Silhouette scores to measure the salience of clustering patterns at the coarse level;(2) CPCC to measure how the whole representation structure is similar to the fine-coarse hierarchy;(3) FineAcc & CoarseAcc (In-hierarchy). 模型在细粒度类别和粗粒度类别上的分类性能,训练时使用的就是这两个层级的类别;(4) MidAcc & CoarserAcc (Out-of-hierarchy). 模型训练时没有这两个层级的概念,测试时直接通过 simple marginalization (e.g., the probability of a mid class is the sum of the probabilities of all fine classes that belong to it) 获取这两个新层级上的分类结果
- 作者还测试了 CPCC structured representations 在 unseen classes 上是否具备泛化性 (Assume a model has learned that cats and dogs are animals. Does knowing the animal concept help it understand giraffes or horses better?). 具体而言,作者在训练集的 mid and fine classes 上训练模型,在测试集上测试模型在 coarse and coarser level 以及新的 mid and fine classes 上的性能。其中,coarse and coarser level 可以通过 marginalization 实现 “zero-shot” transfer,而对于 new mid and fine classes,作者则采用 one-shot generalization,先冻结 f θ f_\theta fθ 然后使用每个新类别的一张图像重新微调两个分类头用于 mid 和 fine 类别图像的分类。从实验结果上可以看出,(1) under this subpopulation shift, using CPCC still outperforms the original loss functions on coarse and coarser levels (zero-shot). (2) In one-shot generalization to new mid levels, CPCC gives an often large advantage. 这是因为每个 coarse class 中都包含几种 mid classes,如果模型具备足够好的 coarse class 分类性能,那么直观上也会具备较好的 mid class 分类性能 (属于 coarse class 的样本特征都比较接近)。唯一例外是 ENTITY13,该数据集上每个 coarse class 都包含很多种 mid classes,即使能成功分类 coarse class,也很难从中区分出 mid classes (It becomes difficult to recover the decision boundary of mid classes as it is close to that of other fine classes in the same coarse class.). (3) CPCC regularization is often harmful to one-shot fine level generalization due to coarse grouping. New fine classes are close together at the coarse level, making them hard to linearly separate.
- OOD Detection. The OOD score is the probability that the model gives a higher likelihood to the true class of a random CIFAR100 image than to the maximum class likelihood of a random CIFAR10 one. (Why OOD Task? - OOD task benefits from the more structured representation. Hoffmann, David T., et al. “Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 1. 2022.)