知识蒸馏原理与PVKD论文阅读

昼行plus

已于 2022-11-05 12:47:16 修改

阅读量1.8k

点赞数 2

分类专栏：目标检测与识别文章标签：论文阅读

于 2022-11-03 22:14:46 首次发布

本文链接：https://blog.csdn.net/weixin_48592526/article/details/127665058

版权

目标检测与识别专栏收录该内容

6 篇文章

订阅专栏

文章目录

知识蒸馏（Distilling the knowledge）
- 1、基础概念
- 2、网络模型与损失函数
PVKD （Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation）
Reference

知识蒸馏（Distilling the knowledge）

PPT来自b站

1、基础概念

在这里插入图片描述

个人理解：知识蒸馏的初始思想是将高精度的复杂网络迁移到（本身可能精度较低的）小网络中以便于部署？

在这里插入图片描述

引入温度T后，T的值越大，softmax结果会更soft，但有什么意义？

在这里插入图片描述
2. 模型差异大，温度应当更小，为什么？

上面两个问题在知识蒸馏过程（网络结构图）部分后根据个人理解进行解答

2、网络模型与损失函数

在这里插入图片描述

loss① 就是为了让学生学会老师的经验，前提应该是老师要够强

在这里插入图片描述

引入温度T后，T的值越大，softmax结果会更soft，但有什么意义？
模型差异大，温度应当更小，为什么？

观察网络结构图发现，student loss 即一般的监督学习网络loss的T=1，并不改变其强度。
而distillation loss 的 T = t，当T比较大时，对student 与teacher 网络的预测差异也会有更大容忍度，有一种"可以学我，但不要完全照搬" 的意思。
另外，对于问题2，当两模型差异较大的情况下，微小的输入变化可能会使二者的输出有较大的差异。这种情况下降低T值，可以使student的输出更趋向于拟合teacher,而减少自己的"个性"

在这里插入图片描述

up自己也做了个测试，学生精度不出所料地没有高于教师
在这里插入图片描述

核心问题：知识蒸馏下student能够超越teacher吗？

PVKD似乎给了一个肯定的答案

PVKD （Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation）

看论文名字可能会以为是基于point的方法和基于Voxel的方法的融合？
其实并不是。
在这里插入图片描述

论文主要内容：

为了解决从点云中提取知识的困难，即稀疏性、随机性和变密度，我们提出了点到体素的知识蒸馏。
To address the difficulty of distilling knowledge from point cloud, namely sparsity, randomness and varying density, we propose the point-to-voxel knowledge distillation.
提出了超体素划分方法，使亲和精馏过程具有可操作性。
we put forward the supervoxel partition to make the affinity distillation tractable.
在这些超体素上，我们提出点与点之间和体素与体素之间的亲和蒸馏，点与体素之间的相似信息可以帮助学生模型更好地捕捉周围环境的结构信息。
On these supervoxels, we propose inter-point and inter-voxel affinity distillation, where the similarity information between points and voxels can help the student model better capture the structural information of the surrounding environment.
对于含有少数类和遥远物体的超体素，采用困难感知采样策略进行更频繁的采样，从而显著提高了对这些困难情况的蒸馏效果。
A difficulty-aware sampling strategy is also employed to more frequently sample supervoxels that contain minority classes and distant objects, thus remarkably enhancing the distillation efficacy on these hard cases.
在具有挑战性的nuScenes和SemanticKITTI数据集上，PVKD可以实现大约75%的mac减少和2倍于Cylinder3D 模型的加速，并在所有已发表的算法中排名SemanticKITTI排行榜第一。
Notably, on the challenging nuScenes and SemanticKITTI datasets, our method can achieve roughly 75% MACs reduction and 2× speedup on the competitive Cylinder3D model and rank 1st on the SemanticKITTI leaderboard among all published algorithms1.

下图展示了所提PVD方法更加精确（特别是对于小物体和较远的物体）
在这里插入图片描述

接下来我们分别细说前三条。

一、基本原理

1.1、point-to-voxel knowledge distillation

在这里插入图片描述

$C$ ：标签数量
$N$ ：点云点数
$R, A, H$ ：体素个数= $R\times A \times H$
$K L (.)$ ：KL散度

1.2、supervoxel partition

将整个点云划分为几个超体素，每个超体素的大小为 $R_s × A_s × H_s$ 。每个超体素由固定数量的体素组成，超体素的总数为 $N_s = [\frac{R}{R_s}] \times [\frac{A}{A_s}] \times [\frac{H}{H_s}]$ ，其中 $[.]$ 是向上取整。系统将抽取K个超体素进行亲和蒸馏
we divide the whole point cloud into several supervoxels whose size is $R_s × A_s × H_s$ . Each supervoxel is comprised of a fixed number of voxels and the total number of supervoxels is $N_s = [\frac{R}{R_s}] \times [\frac{A}{A_s}] \times [\frac{H}{H_s}] $ where ⌈.⌉ is the ceiling function. We will sample K supervoxels to perform the affinity distillation

1.3、difficulty-aware sampling strategy

首先得明确采样的目的（个人理解）：

SuperVoxel的数量一定是一个不小的数目，如果对每一个SuperVoxel都进行亲和蒸馏的损失计算的话，以PVKD的计算方法（本文第1.4部分），再考虑到点云的数量级，恐怕会发生维数灾难！

为使包含较不频繁的类和较远的物体的超体素更容易被采样，提出了困难感知采样策略。
To make supervoxels that contain less frequent classes and faraway objects more likely to be sampled, we present the difficulty-aware sampling strategy.

第i个超体素的权值计算如下：
在这里插入图片描述

1.4、Point/voxel features processing

提取按点和体素输出的知识是不够的，因为它只考虑每个元素的知识，不能捕获周围环境的结构信息
Distilling the knowledge of the pointwise and voxelwise outputs is insufficient as it merely considers the knowledge of each element and fails to capture the structural information of the surrounding environment.

在损失函数的计算上，最好保持特征的数量不变。因此，我们将保留点特征和非空体素特征的数量分别设为 $N_p$ 和 $N_v$
As to the calculation of loss function, it is desirable to keep the number of features fixed. Hence, we set the number of retained point features and non-empty voxel features as $N_p$ and $N_v$ , respectively.

如果点特征的数量大于 $N_p$ ，则通过随机丢弃多数类的附加点特征来保留 $N_p$ 点特征。
如果点特征的数量小于 $N_p$ ，将在当前特征的基础上附加全零特征，得到 $N_p$ 特征，如图3 (a)所示。
体素特征的处理方法与此类似。(方法属实粗暴)

在这里插入图片描述

从图4可以看出，PVD使学生和教师之间的亲和关系映射更紧密。通过PVD，属于同一类的特征被拉得更近，而不同类的特征在特征空间中被分开，从而产生一个非常清晰的亲和映射。与CD方法相比，PVD能更好地将结构知识从老师传递给学生，这有力地验证了PVD在提取激光雷达分割模型方面的优势。
在这里插入图片描述

疑惑： teacher的affinity map真的就这么清晰吗？那么PVD是如何超越teacher（cylinder3D）的呢？

1.5、Final objective

在这里插入图片描述

二、Experiments

2.1 Implementation details

在这里插入图片描述

2.2 Experiments results

在这里插入图片描述

在这里插入图片描述

看这些结果都验证了前面的想法：知识蒸馏不可能达到优于原始Teacher模型的效果，最多与其持平

那么Table 1 的星号就很值得关注了！！！
论文的解释是这样：
在这里插入图片描述
可以理解为做了一些数据增强。 那么Cylinder3D如果也做了这些数据增强呢？

三、代码部分

本想对应上面所说的各项关键点找到对应代码进行分析。
但很不幸，PVKD的开源代码几乎与Cylinder3D无异！所有PVKD所提的关键方法，即第一节基本原理所有内容，都未被开源！

Reference

Hou Y, Zhu X, Ma Y, et al. Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 8479-8488.
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015, 2(7).
【论文精讲|无废话版】知识蒸馏