HuggingFace又出炼丹神器！稀疏矩阵运算进入平民化时代！

夕小瑶

于 2020-09-30 12:05:00 发布

阅读量496

点赞数

文章标签：人工智能深度学习编程语言 css3 css

本文链接：https://blog.csdn.net/xixiaoyaoww/article/details/108892020

版权

HuggingFace宣布支持稀疏矩阵运算，尽管初始效果不完美，但能节省4倍内存并提升2倍速度。借助CUTLASS库的不足，实际提升未达理论预期。未来有望通过定制库优化。文章介绍了稀疏矩阵优势及使用方法，包括自定义网络和模型转换。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文 | rumor酱

编 | YY

一提到模型加速，大家首先想到的就是蒸馏、（结构性）剪枝、量化（FP16），然而稀疏矩阵（sparse matrix）运算一直不被大家青睐。原因也很简单，一是手边没有现成的代码（懒），二是即使用了，速度也不一定有之前的稠密矩阵（dense matrix）快。

不过，框架的开发者们并没有停下他们的脚步，就在不久前，HuggingFace开心地宣布，他们可以支持稀疏矩阵运算啦！75%的sparsity换来了1/4的内存和2倍的速度提升！

这个消息还是比较令人激动的，首先稀疏矩阵在存储上省略了0值，另外在计算上，也没必要计算和0值相关的结果。所以稀疏矩阵能显著提升运算速度，并节约大量存储空间。

不过老司机们的第一反应肯定是：效率不错，但效果（精度）怎么样？

普普通通……（注意上图高亮的modest，感觉效果的确一般，否则就直接放结果了=。=）

Anyway，虽然精度有些美中不足，但单从速度上讲已经很好了。技术的进步要一步步来，以HuggingFace的效率，之后应该还会有更多动作。

细心的同学们看到这里一定很疑惑，为啥压缩了4倍，但只提升了2倍速呢？

在pytorch_block_sparse^[1]的Github库中，官方详细解释了这个问题：主要是当前使用的CUTLASS库还不够快。

在继续下文的讨论前，先介绍些GPU编程的小知识：

CUDA(Compute Unified Device Architecture)：Nvidia家的编程平台，帮大家把C++等程序转换为GPU指令。
BLAS(Basic Linear Algebra Subprograms)：一个线性代数计算的API标准。
cuBLAS：用cuda实现的GPU BLAS计算库。像我们所用的Pytorch、Tensorflow都是基于一系列的cuda库开发的。只用于dense矩阵运算，已经配合GPU优化得很好了。这也就是为什么之前大家不在意稀疏矩阵，因为这样就不能用cuBLAS了，同时还得加上更多的逻辑，可能还不如用cuBLAS直接运算dense要快。
CUTLASS：CUDA Templates for Linear Algebra Subroutines，一个CUDA C++ 模板集，用于在CUDA上实现更多样的矩阵乘法计算（GEMM)。

HuggingFace为了实现稀疏矩阵，选取了CUTLASS库，其本身在计算矩阵乘法时就比cuBLAS库要慢上两倍。所以即使理论上75%稀疏度应该加速4倍，最后测出来也只提升了2倍。

可见如果深入研究出定制化的稀疏矩阵运算库，速度上可能还会有所提升。

对于想试用的同学，HuggingFace也一如既往地重视“拿来即用”的体验，提供了两种使用方法：

自己写网络时，可以直接用BlockSparseLinear替换Linear层

# from torch.nn import Linear
from pytorch_block_sparse import BlockSparseLinear

# self.fc = nn.Linear(1024, 256)
self.fc = BlockSparseLinear(1024, 256, density=0.1)

想转换别人已经写完的网络，可以直接转整个模型。可惜不能自动转参数，需要重新训练。

from pytorch_block_sparse import BlockSparseModelPatcher
# Create a model patcher
mp = BlockSparseModelPatcher()

# Selecting some layers to sparsify.
# This is the "artful" part, as some parts are more prone to be sparsified, other may impact model precision too much.

# Match layers using regexp (we escape the ., just because, it's more correct, but it does not change anything here)
# the [0-9]+ match any layer number.
# We setup a density of 0.5 on these layers, you can test other layers / densities .
mp.add_pattern("roberta\.encoder\.layer\.[0-9]+\.intermediate\.dense", {"density":0.5})
mp.add_pattern("roberta\.encoder\.layer\.[0-9]+\.output\.dense", {"density":0.5})
mp.add_pattern("roberta\.encoder\.layer\.[0-9]+\.attention\.output\.dense", {"density":0.5})
mp.patch_model(model)

print(f"Final model parameters count={model.num_parameters()}")

# => 68 million parameters instead of 84 million parameters (embeddings are taking a lof of space in Roberta)

目前HuggingFace只迈出了一小步，后续CUTLASS还会继续提升，作者也会复现更多的学术成果。除了他们之外，OpenAI在20年初也宣布要将Tensorflow的部分计算代码移植到Pytorch，谷歌和斯坦福在6月的Paper Sparse GPU Kernels for Deep Learning^[2] 也承诺会放出源码，大家可以把稀疏矩阵的优化学习提上日程啦。

文末福利
后台回复关键词【入群】
加入卖萌屋NLP/IR/Rec与求职讨论群
有顶会审稿人、大厂研究员、知乎大V和妹纸
等你来撩哦~

参考文献

[1] pytorch_block_sparse:
https://github.com/huggingface/pytorch_block_sparse
[2] Sparse GPU Kernels for Deep Learning:
https://arxiv.org/abs/2006.10901