论文复现-1论文重读：Black-Box Tuning for Language-Model-as-a-Service

最新推荐文章于 2024-05-26 09:35:17 发布

YingJingh

最新推荐文章于 2024-05-26 09:35:17 发布

阅读量872

点赞数 1

分类专栏：论文复现记录文章标签：深度学习计算机视觉神经网络

本文链接：https://blog.csdn.net/Hekena/article/details/128312418

版权

论文复现记录专栏收录该内容

35 篇文章 2 订阅

订阅专栏

论文核心：使用PLM的API进而完成prompt learning，微调完成任务。

具体来说，是采用连续prompts拼接在input_text之后，然后，通过derivative-free的框架，完成任务。

一、背景（Introduction）

大前提：

由于商业原因等，大模型的参数是不会公开的。其次，微调一个大模型是比较昂贵的。

但是一般大模型会开放API端口，供users使用。这种情形称为：“Language-Model-as-a-Service (LMaaS)”

users can solve the language tasks of interest using the black-box APIs by crafting task-specific text prompts or including training samples in the input texts

技术相关背景：

在不使用梯度更新参数的方式下，称为“derivative-free optimization”——（DFO更新）。受限于参数大小，DFO在大规模参数时，更新速度极慢。
大模型尽管参数众多，但只有一小部分内在维度，intrinsic dimensionality,对于模型微调有较大的帮助。

总思路：

大模型高维向量空间，采用线性映射方式，将向量维度压缩在一个低维度空间。
在低维度空间内，使用derivative-free的方式，解决optimization problem。

二、Background（文献回顾）

intrinsic dimensionality of PTMs (Pretrained language model)
derivative-free optimization: realize optimization only via the function values f(x) on the sampled solutions x (实现优化仅仅通过在采用的x数据上，通过f(x)函数实现）

三、Approach（方法）

1 问题定义

给出X和Y，经过一些engineering后，比如verbalizer engineering...，形成X^和Y^,然后通过API f,在连续prompt的条件下，实现Y^的预测，具体可以表示为：

Y^=f(p;X^)

Y^表示待预测的变量；p表示连续prompts;X^表示输入。p的维度是D。

可是，作者提到了“our goal is to find the optimal prompt P*=argmin L (Y^,Y~)”，最终目标是要找到最优的prompt吗？

2 问题求解

为了简化操作，将p映射到一个低维空间内，使用映射矩阵matrix A (A 采用的正态分布矩阵），A的维度是D*d,将上边的p=Az+p0，p0作为优化的变量。

目标函数为：

$z{^{*}}=argmin L(f(Az+p_{0};X^{_{~}}),Y^{_{~}})$

z*的维度缩小到d维度内。

损失函数：

损失函数尝试了交叉熵、hinge loss，负正确率三种。

评测函数：

CMA-ES (Covariance Matrix Adaptation Evolution strategy)——多被用来测评non_convex black box optimization 在continuous domain

query是采用多元正态分布采用得到的？

3 模型图

四、实验

实验数据集：自然语言理解任务，包括情感分析、主题分类、自然语言推理和改述。 sentiment analysis，topic classification、natural language inference，paraphrase。

小样本设置：随机从每个class中选择k samples形成k-shot setting，数据集数量为： $|D_{test}|>>|D_{train}|=|D_{dev}|$

PLM backbone 模型：Roberta large model

baseline model: 基于梯度的方法和基于non-gradient 的方法。大致有

基于梯度的方法：

prompt tuning：keep PLM paramters frozen, only fine tune contiunous prompt.
P-tuning v2: inject contiunous prompt to the input layer ,optimize the prompts at every layer of PLM
Model tuning: fine tune the entire PLM model

基于gradient-free的方法：

手工设计prompt
incontext learning：从train中随机选择32个samples和input texts做concatenate.
基于feature 的方法：使用PLM的embedding，然后训练一个分类器，完成整个任务。

实施方法上，采用feature-MLP，即训练一个两层的MLP 分类器在CLS上，完成分类任务。

采用feature-BiLSTM，在顶部训练一个Bi-LSTM +分类器的结构，完成任务。

五、实验结果

参数设置

Overall comparison

论文中得到的结论有：

gradient-based optimization tends to overfit the small training data
tuning performs much better than prompt tuning and black box tuning when number of classes is large
pre-trained prompt embedding (§ 3.4), prompt tuning and black-box tuning significantly outperform model tuning

We suspect ....（注意用词），We find ....