机器学习周报第十八周

最新推荐文章于 2024-09-20 23:32:26 发布

Ramos_zl

最新推荐文章于 2024-09-20 23:32:26 发布

阅读量147

点赞数

文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/2301_78609379/article/details/134163013

版权

本文概述了优化Transformer模型推理的几种关键技术，如知识蒸馏、剪枝和量化，讨论了它们如何压缩大型模型、提高性能和适应资源受限环境。同时，文中提到了RNN在序列数据处理中的作用，强调了这些技术在深度学习中的重要性。

摘要由CSDN通过智能技术生成

摘要

知识蒸馏是一种模型压缩技术，通常用于将大型、复杂的神经网络压缩成小型、高效的模型。这通过在大型教师模型的输出上训练小型学生模型来实现。知识蒸馏通过传输教师模型的“知识”或输出概率分布到学生模型，从而提高学生模型的性能和泛化能力。剪枝是一种神经网络优化技术，旨在减小模型的大小和计算复杂度，同时保持性能。通过剪除不必要的神经元或连接，剪枝可以精简模型，减小存储和计算需求，同时保持模型的预测性能。量化是将神经网络的权重和激活从浮点数转换为更低位宽度的整数或定点数的过程。这减小了模型的存储需求，提高了计算效率，尤其在嵌入式设备等资源受限的环境中非常有用。

Abstract

Knowledge distillation is a model compression technique typically used to compress large, complex neural networks into small, efficient models. This is achieved by training small student models on the output of large teacher models. Knowledge distillation improves the performance and generalisation of the student model by transferring the “knowledge” or output probability distribution of the teacher model to the student model. Pruning is a neural network optimisation technique that aims to reduce the size and computational complexity of the model while maintaining performance. By cutting out unnecessary neurons or connections, pruning streamlines the model and reduces storage and computational requirements while maintaining the predictive performance of the model. Quantisation is the process of converting the weights and activations of a neural network from floating point numbers to lower bit-width integers or fixed point numbers. This reduces the storage requirements of the model and improves computational efficiency, especially useful in resource-constrained environments such as embedded devices.

1. 文献阅读：A survey of techniques for optimizing transformer inference

论文摘要：
这篇论文综述了优化Transformer网络推理阶段的技术，包括知识蒸馏、剪枝、量化、神经架构搜索和轻量级网络设计等算法层面的优化方法，以及硬件层面的优化技术和为Transformer设计的新型硬件加速器，总结了多个模型/技术的参数/FLOPs数量和准确性的定量结果，为这个快速发展的研究领域提供了未来的方向。

背景信息：
1.论文背景: 近年来，Transformer神经网络在性能和应用方面取得了巨大的进展，包括Bidirectional Encoder Representations from Transformer (BERT)、Generative Pretrained Transformer (GPT)和Vision Transformer (ViT)等。然而，为了追求更高的预测性能，Transformer的内存和计算开销呈指数级增长，因此需要优化Transformer推理的技术。

2.解决方案: 为了解决Transformer推理中的内存和计算开销问题，研究人员提出了各种优化技术，包括知识蒸馏、剪枝、量化、神经架构搜索和轻量级网络设计等算法层面的优化方法，以及硬件层面的优化技术和为Transformer设计的新型硬件加速器。

3.论文的Motivation: 由于Transformer模型的规模越来越大，推理过程中的计算和存储需求也越来越高，因此需要优化Transformer推理的技术来提高效率和节约资源。此外，随着硬件平台的不断发展，设计适用于Transformer的专用硬件加速器也成为一种解决方案。

优化方法：
1.知识蒸馏
1.1 知识蒸馏的定义
知识蒸馏( Knowledge Distillation，KD ) [ 45 ]是一种广泛使用的模型压缩技术，它将知识从一个大的预训练教师模型转移到一个小的学生模型，从而可以复制或模仿教师模型的行为。通常，蒸馏方法利用教师模型的预测来指导学生模型的训练。该过程首先创建一个大型的神经网络，任务是让一个较小的Transformer网络逼近较大网络学习到的函数。训练学生模型，使其既能预测正确的输出，又能预测教师模型产生的软目标。这里的软目标指的是教师对给定输入进行预测时产生的概率。这是通过最小化教师模型产生的目标和学生模型产生的预测之间的蒸馏损失来完成的。
在这里插入图片描述
1.2 知识蒸馏的分类
根据知识从教师模型转移到学生模型的任务特异性程度，KD方法可以大致分为两类。现将其总结如下。

任务不确定KD
任务无关的KD指的是提取"通用"知识，即不考虑任何特定的任务，这可以对一些下游应用有用。同伦蒸馏( HomoDistil ) [ 72 ]是一种结合迭代剪枝和逐层(注意层和隐藏层)迁移学习的任务无关蒸馏方法。学生模型从教师模型初始化，并迭代剪枝，直到达到目标宽度。迭代剪枝方法根据参数相对于最终得分的重要性，删除整个蒸馏过程中最不重要的参数。
任务确定KD
针对同一下游应用，任务特定蒸馏将知识传递到一个小模型中。这种蒸馏方法有利于针对特定的任务获得最佳的性能。相比之下，任务无关蒸馏只适用于传递一般性知识，在目标任务上可能无法获得最佳性能。Dei T [ 52 ]是第一个用于Vi T的蒸馏方法。作者在目标Imagenet数据集上训练了一个学生变压器模型来匹配预训练的CNN教师网络提供的硬标签。作者只利用了教师和学生模型的最终输出，而忽略了两个网络中的中间层信息。

1.3 基于蒸馏粒度的方法
蒸馏粒度是指教师和学生网络之间发生信息传递的程度。

网络级蒸馏
网络/模型级蒸馏仅在模型输出层面进行知识传递。在该方法中，通过考虑训练使学生网络与教师模型的输出相匹配，以最小化教师模型和学生模型之间的损失。这种技术也被称为预测层蒸馏，因为学生模型被训练来匹配预测。
层次性蒸馏
层次性蒸馏是指在个体层次上进行知识的传递。在该方法中，训练学生模型产生与教师模型相似的选定层输出。隐状态级迁移学习是一种层次化的学习，旨在最小化教师网络和学生网络隐状态之间的损失。隐藏状态表示编码器或解码器的MHA和FNN模块的输出。
基于注意力的蒸馏
注意力蒸馏从教师网络中训练学生网络的注意力矩阵，从而传递语言信息。该方法的动机来自BERT学习注意力权重的能力，它能够捕获丰富的语言知识，包括语法和共指信息。
嵌入层蒸馏
除了模型级、注意力级和隐藏状态，还可以将来自教师嵌入层的知识迁移到学生的等价层来学习嵌入层。
2.剪枝
神经网络剪枝是一种通过去除冗余的权值和激活来降低网络规模和计算复杂度的方法。剪枝算法在推理过程中使权值/节点/神经元/头尽可能为零。
2.1 基于矩阵稀疏模式的分类
一个神经网络可以在不同的层次上进行剪枝，从而产生不同的稀疏模式。方法分为非结构化、半结构化和结构化方法。这些技术在表3中描述，并在图9中说明。

2.RNN

以下主要有两种实现方式

基于pytorch API 的实现单/双循环神经网络
自己根据公式手写的单/双循环神经网络

import torch
from torch import nn

hidden_in = 4
hidden_out = 3
num_layers = 1

# define the RNN layers
rnn_layer = nn.RNN(input_size=hidden_in, hidden_size=hidden_out, num_layers=num_layers,
				   batch_first=True)
batch_size = 2
sequence_length = 4

# random init the input
my_input = torch.randn(batch_size, sequence_length, hidden_in)

# random init the init hidden state
h_prev = torch.zeros(batch_size, hidden_out)
# my_output is all the state of h_n
# h_n is the final state
my_output, h_n = rnn_layer(my_input, h_prev.unsqueeze(0))


# print(f"my_output={my_output}")

# print(f"my_output.shape={my_output.shape}")

# print(f"h_n={h_n}")

# print(f"h_n.shape={h_n.shape}")


# custom_rnn_function
def custom_rnn_function(input, w_ih, w_hh, b_ih, b_hh, h_prev):
	"""
	formula:

	h_t = tanh(w_{ih}*x_t+b_{ih}+w_{hh}*h_{t-1}+b_{hh})
	x_t is the input at time t

	:param input: input(batch_size,sequence_length,hidden_in)
	:param w_ih: weight w_ih (hidden_out,hidden_in)
	:param w_hh: weight w_hh (hidden_out,hidden_out)

	:param b_ih: bias b_ih (hidden_out)
	:param b_hh: bias b_hh (hidden_out)
	:param h_prev: previous hidden h_prev (1,batch_size,hidden_out)
	:return: output ,h_n

	"""
	batch_size, sequence_length, hidden_in = input.shape
	hidden_out, hidden_in = w_ih.shape
	output = torch.zeros(batch_size, sequence_length, hidden_out)
	for t in range(sequence_length):
		# input[:,t,:].shape = [batch_size,hidden_in] -> (batch_size,hidden_in,1)
		x_t = input[:, t, :].unsqueeze(2)

		# w_ih.shape = [hidden_out,hidden_in] -> (batch_size,hidden_out,hidden_in)
		w_ih_batch = w_ih.unsqueeze(0).tile(batch_size, 1, 1)

		# w_hh = [hidden_out,hidden_out] -> (batch_size,hidden_out,hidden_out)
		# h_prev = [batch_size,hidden_out]

		w_hh_batch = w_hh.unsqueeze(0).tile(batch_size, 1, 1)

		# w_ih_times_x.shape=(batch_size,hidden_out,1) -> (batch_size,hidden_out)
		w_ih_times_x = torch.bmm(w_ih_batch, x_t).squeeze(-1)

		# w_hh_times_h.shape =(batch_size,hidden_out,1)->(batch_size,hidden_out)
		# h_prev = [batch_size,hidden_out] -> (batch_size,hidden_out,1)
		# w_hh = [hidden_out,hidden_out] -> (batch_size,hidden_out,hidden_out)
		w_hh_times_h = torch.bmm(w_hh_batch, h_prev.unsqueeze(2)).squeeze(-1)
		h_prev = torch.tanh((w_ih_times_x + b_ih + w_hh_times_h + b_hh))
		output[:, t, :] = h_prev

	return output, h_prev.unsqueeze(0)


# get the rnn_layers weights data
custom_w_ih = rnn_layer.weight_ih_l0
custom_w_hh = rnn_layer.weight_hh_l0
custom_bias_ih = rnn_layer.bias_ih_l0
custom_bias_hh = rnn_layer.bias_hh_l0

# sent rnn_layers'weight to custom_rnn_function
# if the output and h_n are the same with two function
# so that our custom function is correct.
custom_output, custom_hn = custom_rnn_function(my_input, custom_w_ih, custom_w_hh,
											   custom_bias_ih, custom_bias_hh, h_prev)


# print(f"custom_output={custom_output}")
# print(f"custom_hn={custom_hn}")
# print(f"my_output={my_output}")
# print(f"h_n={h_n}")
# print("check whether custom_output is equal to my_output")
# print(torch.isclose(custom_output, my_output))
# print("check whether custom_hn is equal to h_n")
# print(torch.isclose(custom_hn, h_n))


# custom_rnn_function
def bicstm_rnn_function(input, w_ih, w_hh, b_ih, b_hh, h_prev,
						w_ih_reverse, w_hh_reverse, b_ih_reverse, b_hh_reverse):
	batch_size, sequence_length, hidden_in = input.shape
	hidden_out, hidden_in = w_ih.shape
	output = torch.zeros(batch_size, sequence_length, hidden_out * 2)


	forward_output = custom_rnn_function(input, w_ih, w_hh, b_ih, b_hh, h_prev)[0]
	backward_output = custom_rnn_function(torch.flip(input, [1]), w_ih_reverse, w_hh_reverse, b_ih_reverse,
													 b_hh_reverse,h_prev)[0]
	output[:, :, :hidden_out] = forward_output
	output[:, :, hidden_out:] = torch.flip(backward_output,[1])


	# old
	# return output, output[:, -1, :].reshape((batch_size, 2, hidden_out)).transpose(0, 1)
	return output, torch.cat([forward_output[:,-1,:].unsqueeze(0),backward_output[:,-1,:].unsqueeze(0)],dim=0)


bi_rnn_layer = nn.RNN(input_size=hidden_in, hidden_size=hidden_out, num_layers=num_layers,
					  batch_first=True, bidirectional=True)

bi_h_prev = torch.zeros(2, batch_size, hidden_out)

bi_my_output, bi_h_n = bi_rnn_layer(my_input, bi_h_prev)
print(f"bi_my_output={bi_my_output}")
print(f"bi_h_n={bi_h_n}")

for k, v in bi_rnn_layer.named_parameters():
	print(k, v, v.shape)

bicstm_weight_ih_l0 = bi_rnn_layer.weight_ih_l0
bicstm_weight_hh_l0 = bi_rnn_layer.weight_hh_l0
bicstm_bias_ih_l0 = bi_rnn_layer.bias_ih_l0
bicstm_bias_hh_l0 = bi_rnn_layer.bias_hh_l0
bicstm_weight_ih_l0_reverse = bi_rnn_layer.weight_ih_l0_reverse
bicstm_weight_hh_l0_reverse = bi_rnn_layer.weight_hh_l0_reverse
bicstm_bias_ih_l0_reverse = bi_rnn_layer.bias_ih_l0_reverse
bicstm_bias_hh_l0_reverse = bi_rnn_layer.bias_hh_l0_reverse

bicstm_output, bicstm_h_n = bicstm_rnn_function(my_input, bicstm_weight_ih_l0, bicstm_weight_hh_l0, bicstm_bias_ih_l0,
												bicstm_bias_hh_l0,
												bi_h_prev[0], bicstm_weight_ih_l0_reverse, bicstm_weight_hh_l0_reverse,
												bicstm_bias_ih_l0_reverse,
												bicstm_bias_hh_l0_reverse)

print("pytorch API rnn")
# bi_my_output, bi_h_n
print(f"bi_my_output={bi_my_output}")
print(f"bi_h_n={bi_h_n}")
print("bicstm_output is equal to bi_my_output")
print(torch.isclose(bicstm_output,bi_my_output))
print(torch.isclose(bicstm_output,bi_my_output).shape)
print("bicstm_h_n is equal to bi_h_n")
print(torch.isclose(bicstm_h_n,bi_h_n))
print(torch.isclose(bicstm_h_n,bi_h_n).shape)

print("custom bidirectional rnn")
print(f"bicstm_output={bicstm_output}")
print(f"bicstm_output.shape={bicstm_output.shape}")
print(f"bicstm_h_n={bicstm_h_n}")
print(f"bicstm_h_n.shape={bicstm_h_n.shape}")
print("*"*50)
print(f"bi_my_output={bi_my_output}")
print(f"bi_my_output.shape={bi_my_output.shape}")
print(f"bi_h_n={bi_h_n}")
print(f"bi_h_n.shape={bi_h_n.shape}")

总结

知识蒸馏有助于将大型模型的知识传输到小型模型，以实现模型压缩和性能提升。剪枝可用于减小模型规模，提高效率。量化降低了模型的计算和存储成本。RNN则是处理序列数据的关键工具，用于各种自然语言处理和时间序列任务。这些技术在深度学习中起着不同但重要的作用，根据任务和资源要求，可以选择合适的技术进行模型优化。下周我将继续学习该文献。