背景
文本相似度旨在识别两段文本在语义上是否相似。文本相似度在自然语言处理领域是一个重要研究方向,同时在信息检索、新闻推荐、智能客服等领域都发挥重要作用,具有很高的商业价值。
目前学术界的一些公开中文文本相似度数据集,在相关论文的支撑下对现有的公开文本相似度模型进行了较全面的评估,具有较高权威性。因此,本开源项目收集了这些权威的数据集,期望对模型效果进行综合的评价,旨在为研究人员和开发者提供学术和技术交流的平台,进一步提升文本相似度的研究水平,推动文本相似度在自然语言处理领域的应用和发展。
- 比赛地址:https://aistudio.baidu.com/aistudio/competition/detail/45/0/task-definition
- 数据集: https://download.csdn.net/download/turkeym4/82336620
数据
赛题提供3份不同的数据文本数据。官方需要我们分别对3份数据进行预测,然后打包至一个压缩包内上传。
数据集名字 | 数据集简介 | 训练集大小 | 开发集大小 | 测试集大小 |
---|---|---|---|---|
LCQMC | 百度知道中文问题对 | 238,766 | 8,802 | 12,500 |
BQ Corpus | 银行金融领域问题对 | 100,000 | 10,000 | 10,000 |
PAWS-X | 谷歌语言释义对 | 49,401 | 2,000 | 2,000 |
解题思路
通过数据可知,这是一个典型的文本相似度问题,也可以是一个二分类问题。接解题方案可分为两部分:
- tfidf+机器学习
- 传统的神经网络解题
- 基于BERT解题
数据分析
以第一句话作为目标分别进行长度统计。发现不同数据集可能采用的长度不一样。我比较懒惰,直接使用97.5分为作为数据集的最大长度。最终得到的结果如下:
数据集 | 截取句子长度 |
---|---|
paws-x | 88 |
lcqmc | 22 |
bq_corpus | 30 |
# 以paws-x为例子统计长度
import pandas as pd
import numpy as np
train = pd.read_csv('data/paws-x/train.tsv', sep='\t',names=['text_a', 'text_b', 'label'])
train['len_a'] = train['text_a'].apply(lambda x:len(x))
p = np.percentile(train['len_a'].tolist(), [75,90,97.5]) # return 50th percentile, e.g median.
tfidf+机器学习
统一导包
import numpy as np
import pandas as pd
import jieba
import distance
from tqdm import tqdm
from gensim import corpora,models,similarities
from gensim.test.utils import common_texts
from gensim.models import Word2Vec,TfidfModel
from gensim import corpora
具体思路
这个思路相对简单,也是进入这个课题的baseline。具体步骤如下:
- 分别对文本进行分词
- 计算tfidf值
- 计算文本的tfidf距离
- 计算文本长度差
- 计算文本单词差
- LightGBM训练二分类模型
def cut(content):
try:
seg_list = jieba.lcut(content, cut_all=True)
except AttributeError as ex:
print(content)
raise ex
return seg_list
def rate(words_1, words_2):
int_list = list(set(words_1).intersection(set(words_2)))
return len(int_list)/len(set(words_1))
def edit_distance(s1, s2):
return distance.levenshtein(s1, s2)
def data_anaysis(df):
# 编辑距离
df['edit_dist'] = df.apply(lambda row: edit_distance(row['text_a'], row['text_b']), axis=1)
# 分词
df['words_a'] = df['text_a'].apply(lambda x: cut(x))
df['words_b'] = df['text_b'].apply(lambda x: cut(x))
# 统计字符数
df['text_a_len'] = df['text_a'].apply(lambda x: len(x))
df['text_b_len'] = df['text_b'].apply(lambda x: len(x))
# 统计词个数
df['words_a_len'] = df['words_a'].apply(lambda x: len(x))
df['words_b_len'] = df['words_b'].apply(lambda x: len(x))
# 单词个数比
df['rate_a'] = df.apply(lambda row: rate(row['words_a'], row['words_b']), axis=1)
df['rate_b'] = df.apply(lambda row: rate(row['words_b'], row['words_a']), axis=1)
return df
train = pd.read_csv('data/paws-x-zh/train.tsv', sep='\t',names=['text_a', 'text_b', 'label'])
test = pd.read_csv('data/paws-x-zh/test.tsv', sep='\t',names=['text_a', 'text_b', 'label'])
# train = train[train['label'].isin(['0','1'])]
test['label'] = -1
train = train.dropna()
test = test.dropna()
train = data_anaysis(train)
test = data_anaysis(test)
test
# 统计tfidf距离
def tfidf_word_match_share(row, stops):
q1words = {}
q2words = {}
for word in row['words_a']:
if word not in stops:
q1words[word] = 1
for word in row['words_b']:
if word not in stops:
q2words[word] = 1
if len(q1words) == 0 or len(q2words) == 0:
# The computer-generated chaff includes a few questions that are nothing but stopwords
return 0
shared_weights = [weights.get(w, 0) for w in q1words.keys() if w in q2words] + [weights.get(w, 0) for w in q2words.keys() if w in q1words]
total_weights = [weights.get(w, 0) for w in q1words] + [weights.get(w, 0) for w in q2words]
R = np.sum(shared_weights) / np.sum(total_weights)
return R
train['tfidf_word_match'] = train.apply(lambda row: tfidf_word_match_share(row, stop_words), axis=1)
test['tfidf_word_match'] = test.apply(lambda row: tfidf_word_match_share(row, stop_words), axis=1)
# 最后数据处理
train['text_len_diff'] = abs(train['text_a_len'] - train['text_b_len'])
train['word_len_diff'] = abs(train['words_a_len'] - train['words_b_len'])
test['text_len_diff'] = abs(test['text_a_len'] - test['text_b_len'])
test['word_len_diff'] = abs(test['words_a_len'] - test['words_b_len'])
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
# 建模
fretures = ['text_len_diff','word_len_diff','word_match','tfidf_word_match']
X = train[fretures]
y = train['label']
test_features = test[fretures]
model = lgb.LGBMClassifier(num_leaves=128,
max_depth=10,
learning_rate=0.01,
n_estimators=2000,
subsample=0.8,
feature_fraction=0.8,
reg_alpha=0.5,
reg_lambda=0.5,
random_state=2022,
metric='auc',
boosting_type='gbdt',
subsample_freq=1,
bagging_fraction=0.8)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2022)
prob = []
mean_acc = 0
for k,(train_index, test_index) in enumerate(skf.split(X, y)):
print(k)
X_train, X_val = X.iloc[train_index], X.iloc[test_index]
y_train, y_val = y.iloc[train_index], y.iloc[test_index]
# 训练
print(y_val)
model = model.fit(X_train,
y_train,
eval_set=[(X_val, y_val)],
eval_metric='auc',
verbose = True)
# 正式预测
test_y_pred = model.predict_proba(test_features)
prob.append(test_y_pred)
传统的神经网络
传统的神经网络分为两部分:
- 文本的embedding阶段
- 网络训练推理
通常情况下,embedding可以通过网络训练得到,也可以通过Word2Vec得到。
Word2Vec部分
Word2Vec就是大家最熟悉的词向量。我们可以先对句子进行分词,然后对每一个词训练一个对应的词向量,最后基于这个词向量融合成一个句向量。
而据向量有下面几种不同的方式:
- 基于词向量累加的max-pooling
- 基于词向量平均的mean-pooling
- 基于IDF加权平均的IDF-mean-pooling
- 基于sif甲醛的SIF-mean-pooling
import re
import math
from sklearn.decomposition import TruncatedSVD
# 读取数据
def get_stopwords():
stop_words = []
with open('baidu_stopwords.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
stop_words.append(line.replace('\n', ''))
return stop_words
# jieba分词
def cut(content, stop_words):
# 去除符号
content = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]", "",content)
result = []
try:
seg_list = jieba.lcut(content, cut_all=True)
for i in seg_list:
if i not in stop_words:
result.append(i)
except AttributeError as ex:
print(content)
raise ex
return result
# 统计相同的词比例
def rate(words_1, words_2):
int_list = list(set(words_1).intersection(set(words_2)))
return len(int_list)/len(set(words_1))
# 统计距离
def edit_distance(s1, s2):
return distance.levenshtein(s1, s2)
def data_anaysis(df, stop_words):
# 编辑距离
df['edit_dist'] = df.apply(lambda row: edit_distance(row['text_a'], row['text_b']), axis=1)
# 分词
df['words_a'] = df['text_a'].apply(lambda x: cut(x, stop_words))
df['words_b'] = df['text_b'].apply(lambda x: cut(x, stop_words))
# 统计字符数
df['text_a_len'] = df['text_a'].apply(lambda x: len(x))
df['text_b_len'] = df['text_b'].apply(lambda x: len(x))
# 统计词个数
df['words_a_len'] = df['words_a'].apply(lambda x: len(x))
df['words_b_len'] = df['words_b'].apply(lambda x: len(x))
# 单词个数比
df['rate_a'] = df.apply(lambda row: rate(row['words_a'], row['words_b']), axis=1)
df['rate_b'] = df.apply(lambda row: rate(row['words_b'], row['words_a']), axis=1)
return df
# 获取停用词
stop_words = get_stopwords()
train = pd.read_csv('data/paws-x-zh/train.tsv', sep='\t',names=['text_a', 'text_b', 'label'])
test = pd.read_csv('data/paws-x-zh/test.tsv', sep='\t',names=['text_a', 'text_b', 'label'])
test['label'] = -1
train = train.dropna()
test = test.dropna()
train = data_anaysis(train, stop_words)
test = data_anaysis(test, stop_words)
# 训练词向量
context = []
for i in tqdm(range(len(train))):
row = train.iloc[i]
context.append(row['words_a'])
context.append(row['words_b'])
for i in tqdm(range(len(test))):
row = test.iloc[i]
context.append(row['words_a'])
context.append(row['words_b'])
wv_model = Word2Vec(sentences=context, vector_size=100, window=5, min_count=1, workers=4)
wv_model.train(context, total_examples=1, epochs=1)
# 统计全文的count
count_list = []
words_num = 0
for i in tqdm(range(len(train))):
count_list += list(set(train.iloc[i]['words_a']))
count_list += list(set(train.iloc[i]['words_b']))
words_num +=2
for i in tqdm(range(len(test))):
count_list += list(set(test.iloc[i]['words_a']))
count_list += list(set(test.iloc[i]['words_b']))
words_num +=2
count = Counter(count_list)
# 计算idf列表
idf = {}
for k, v in tqdm(dict(count).items()):
idf[k] = math.log(words_num/(v+1))
# 转换句向量
def text_to_wv(model, data, operation='max_pooling',key='wv'):
full_wv_a = []
full_wv_b = []
# 每句话转词向量表达
for i in tqdm(range(len(data))):
row = data.iloc[i]
wv_a = []
words_a = row['words_a']
for i in words_a:
wv_a.append(model.wv[i])
if operation == 'max_pooling':
full_wv_a.append(np.amax(wv_a, axis=0))
elif operation == 'mean_pooling':
full_wv_a.append(np.mean(wv_a, axis=0))
wv_b = []
words_b = row['words_b']
for i in words_b:
wv_b.append(model.wv[i])
if operation == 'max_pooling':
full_wv_b.append(np.amax(wv_b, axis=0))
elif operation == 'mean_pooling':
full_wv_b.append(np.mean(wv_b, axis=0))
data[key + '_a'] = full_wv_a
data[key + '_b'] = full_wv_b
# idf加权的句向量
def idf_to_wv(model, data, idf):
full_wv_a = []
full_wv_b = []
# 每句话转词向量表达
for i in tqdm(range(len(data))):
row = data.iloc[i]
wv_a = []
words_a = row['words_a']
for i in words_a:
wv_a.append(model.wv[i] * idf[i])
full_wv_a.append(np.mean(wv_a, axis=0))
wv_b = []
words_b = row['words_b']
for i in words_b:
wv_b.append(model.wv[i] * idf[i])
full_wv_b.append(np.mean(wv_b, axis=0))
data['idf_wv_a'] = full_wv_a
data['idf_wv_b'] = full_wv_b
# 最大池化句向量
text_to_wv(wv_model, train, 'max_pooling','max_wv')
text_to_wv(wv_model, test, 'max_pooling','max_wv')
# 平均池化句向量
text_to_wv(wv_model, train, 'mean_pooling','mean_wv')
text_to_wv(wv_model, test, 'mean_pooling','mean_wv')
# idf加权平均句向量
idf_to_wv(wv_model, train, idf)
idf_to_wv(wv_model, test, idf)
# sif词向量
# 计算主成分,npc为需要计算的主成分的个数
def compute_pc(X, npc):
svd = TruncatedSVD(n_components=npc, n_iter=5, random_state=0)
svd.fit(X)
return svd.components_
# 去除主成分
def remove_pc(X, npc=1):
pc = compute_pc(X, npc)
if npc == 1:
XX = X - X.dot(pc.transpose()) * pc
else:
XX = X - X.dot(pc.transpose()).dot(pc)
return XX
# 更新词权重
def sif_weight(count, a=3e-5):
# 统计所有词频
word_num = 0
for k,v in dict(count).items():
word_num += v
# 更新权重
sif = {}
for k,v in dict(count).items():
sif[k] = a / (a + v/word_num)
return sif
# sif加权的句向量
def sif_to_wv(model, data, sif):
full_wv_a = []
full_wv_b = []
# 每句话转词向量表达
for i in tqdm(range(len(data))):
row = data.iloc[i]
wv_a = []
words_a = row['words_a']
# 统计词向量
for i in words_a:
wv_a.append(model.wv[i] * sif[i])
# 记录结果
full_wv_a.append(np.mean(wv_a, axis=0))
wv_b = []
words_b = row['words_b']
for i in words_b:
wv_b.append(model.wv[i] * sif[i])
full_wv_b.append(np.mean(wv_b, axis=0))
# 扣除第一主成分
full_wv_a = remove_pc(np.array(full_wv_a))
full_wv_b = remove_pc(np.array(full_wv_b))
data['sif_wv_a'] = list(full_wv_a)
data['sif_wv_b'] = list(full_wv_b)
# 更新词权重
sif = sif_weight(count)
sif_to_wv(wv_model, train, sif)
sif_to_wv(wv_model, test, sif)
# 打印
print(train[['max_wv_a', 'max_wv_b', 'mean_wv_a', 'mean_wv_b', 'idf_wv_a',
'idf_wv_b', 'sif_wv_a', 'sif_wv_b']][:5])
通常情况下,如果是机器学习(可作为上面的LGB分类模型的特征)的或者无监督相似度计算。我们只需要直接使用句向量即可。如果是神经网络的话,我们需要把训练好的词向量初始化到embedding层里面,下面以pytorch为例子
import torch
from torch import nn
# 初始化词向量矩阵
word_vectors = torch.randn([config.vocab_size, config.embed_dim])
# 将词向量模型复制给矩阵
for i in range(0, config.vocab_size):
word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i])
# 创建embedding层,并把矩阵初始化到embnedding内
self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed)
网络部分
有的embedding之后,剩余就是网络推理的部分。在网络结构设计,我写了3个方案不同的方案:
网络结构 | 注释 |
---|---|
Siamese Net Work | 1. 基于孪生网络的模型,其核心可基于CNN或RNN 2. 对CNN或RNN后的编码进行拼接推理 |
InferSent | 1. 类似SiamNet,但对RNN出来的编码进行拼接、乘、减操作 2. 对操作后的数据进行拼接最后推理 |
ESIM | 1. 基于RNN、注意力、组合、推理的复杂网络 2. 设计多个数学公式,代码在下面,具体公式可自行查看论文 |
由于整个网络的代码较长,下面代码只列出网络结构。整体代码请查看源码
Siam_CNN
class LinModel(nn.Module):
def __init__(self, in_features, out_features):
super(LinModel, self).__init__()
self.fc_1 = nn.Sequential(
nn.Linear(in_features, 256),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_2 = nn.Sequential(
nn.Linear(256, 32),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_3 = nn.Sequential(
nn.Linear(32, 4),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_4 = nn.Sequential(
nn.Linear(4, out_features),
)
self.softmax = nn.Softmax(1)
def forward(self, X):
X = self.fc_1(X)
X = self.fc_2(X)
X = self.fc_3(X)
output = self.fc_4(X)
return self.softmax(output)
class SiamCNN(nn.Module):
def __init__(self, wv_mode, config):
super(SiamCNN, self).__init__()
self.device = config.device
word_vectors = torch.randn([config.vocab_size, config.embed_dim])
for i in range(0, config.vocab_size):
word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i])
# 创建embedding层
self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100)
if config.update_embed is False:
self.embedding.weight.requires_grad = False
self.conv_1 = nn.Sequential(
nn.Conv1d(in_channels=config.seq_len, out_channels=16, kernel_size=2, stride=1),
nn.ReLU(),
nn.MaxPool1d(3))
self.conv_2 = nn.Sequential(
nn.Conv1d(in_channels=config.seq_len, out_channels=16, kernel_size=3, stride=1),
nn.ReLU(),
nn.MaxPool1d(3))
self.conv_3 = nn.Sequential(
nn.Conv1d(in_channels=config.seq_len, out_channels=16, kernel_size=5, stride=1),
nn.ReLU(),
nn.MaxPool1d(3))
self.flattern = nn.Flatten()
# 定义池化层
self.max_pool = nn.MaxPool1d(3)
# 定义线性层
self.lin_model = LinModel(1552, 2)
# 计算两个向量的相似度
def cos_sim(self, vector_a, vector_b):
"""
计算两个向量之间的余弦相似度
:param vector_a: 向量 a
:param vector_b: 向量 b
:return: sim
"""
return torch.tensor([torch.cosine_similarity(vector_a, vector_b, 0, 1e-8)])
def forward_one(self, text):
# 计算句子A
x = self.embedding(text)
conv_1 = self.conv_1(x)
conv_2 = self.conv_2(x)
conv_3 = self.conv_3(x)
# 合并各卷积结果取最大值
x = torch.cat([conv_1, conv_2, conv_3], 2)
x = x.view(x.size(0), -1)
return self.lin_model(x)
def forward(self, words_a, words_b):
# words_a (batch_size, seq_len)(32, 27)
# 计算句子A
x_a = self.forward_one(words_a)
# 计算句子B
x_b = self.forward_one(words_b)
return x_a, x_b
Siam_RNN
class LinModel(nn.Module):
def __init__(self, in_features, out_features):
super(LinModel, self).__init__()
self.fc_1 = nn.Sequential(
nn.Linear(in_features, 256),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_2 = nn.Sequential(
nn.Linear(256, 32),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_3 = nn.Sequential(
nn.Linear(32, 4),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_4 = nn.Sequential(
nn.Linear(4, out_features),
)
self.softmax = nn.Softmax(1)
def forward(self, X):
X = self.fc_1(X)
X = self.fc_2(X)
X = self.fc_3(X)
output = self.fc_4(X)
return self.softmax(output)
class SiamLSTM(nn.Module):
def __init__(self, wv_mode, config):
super(SiamLSTM, self).__init__()
self.device = config.device
word_vectors = torch.randn([config.vocab_size, config.embed_dim])
for i in range(0, config.vocab_size):
word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i])
# 创建embedding层
self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100)
if config.update_embed is False:
self.embedding.weight.requires_grad = False
# 创建rnn
self.rnn = nn.LSTM(input_size=config.embed_dim, hidden_size=10, num_layers=1)
# 创建线性层
self.lin_model = LinModel(270, 2)
def forward_one(self, text):
# 计算a
x = self.embedding(text) # embedding转换
# rnn
x = x.transpose(0, 1) # 交换维度,因为RNN的输入是 (L, D, H)
x, _ = self.rnn(x)
x = x.transpose(0, 1) # 还原维度,因为RNN的输出是 (L, D, H)
x = x.contiguous().view(x.size(0), -1)
return self.lin_model(x)
def forward(self,words_a, words_b):
# 计算a
x_a = self.forward_one(words_a) # embedding转换
# 计算b
x_b = self.forward_one(words_b)
return x_a, x_b
InferSent
class LinModel(nn.Module):
def __init__(self, in_features, out_features):
super(LinModel, self).__init__()
self.fc_1 = nn.Sequential(
nn.Linear(in_features, 256),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_2 = nn.Sequential(
nn.Linear(256, 32),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_3 = nn.Sequential(
nn.Linear(32, 4),
nn.ReLU(),
nn.Dropout(0.02)
)
self.fc_4 = nn.Sequential(
nn.Linear(4, out_features),
)
self.softmax = nn.Softmax(1)
def forward(self, X):
X = self.fc_1(X)
X = self.fc_2(X)
X = self.fc_3(X)
output = self.fc_4(X)
return self.softmax(output)
class InferSent(nn.Module):
def __init__(self, wv_mode, config):
super(InferSent, self).__init__()
self.device = config.device
word_vectors = torch.randn([config.vocab_size, config.embed_dim])
for i in range(0, config.vocab_size):
word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i])
# 创建embedding层
self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100)
if config.update_embed is False:
self.embedding.weight.requires_grad = False
# 创建 双向 两层 RNN
self.rnn = nn.LSTM(input_size=config.embed_dim, hidden_size=10, num_layers=2, bidirectional=True)
# 创建线性层
self.lin_model = LinModel(2160, 2)
def forward(self, words_a, words_b):
# 计算a
x_a = self.embedding(words_a) # embedding转换
# rnn
x_a = x_a.transpose(0, 1) # 交换维度,因为RNN的输入是 (L, D, H)
x_a, _ = self.rnn(x_a)
x_a = x_a.transpose(0, 1) # 还原维度,因为RNN的输出是 (L, D, H)
# 计算b
x_b = self.embedding(words_b)
x_b = x_b.transpose(0, 1)
x_b, _ = self.rnn(x_b)
x_b = x_b.transpose(0, 1)
'''
三种编码的交叉方式
shape:
句子1编码 x_a: (128, 27, 20)
句子2编码 x_b: (128, 27, 20)
拼接交叉 X_1: (128, 27, 40)
乘法交叉 X_2: (128, 27, 20)
减法交叉 X_3: (128, 27, 20)
'''
# 方法一:拼接
X_1 = torch.cat([x_a, x_b], 2) #
# 方法二:乘法
X_2 = torch.mul(x_a, x_b)
# 方法三:减法
X_3 = torch.sub(x_a, x_b)
# 拼接3种方式,展平张量
X = torch.cat([X_1, X_2, X_3], 2) # (128, 27, 80)
X = X.view(X.size(0), -1) # (128, 27, 2160)
# 线性推理
output = self.lin_model(X)
return output
ESIM
class RNNDropout(nn.Dropout):
# 将词向量 某些维度 清0
def forward(self, sequences_batch): # (B, L, D)
# 创建相同的全1张量 (B, D)
ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1])
# 创建随机mask (B, D)
dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False)
# dropout原数据 (B, L, D), 这里需要给mask加一个维度
return dropout_mask.unsqueeze(1) * sequences_batch
# 自定义RNN
class StackedBRNN(nn.Module):
def __init__(self, input_size, hidden_size, num_layers,
dropout_rate=0, dropout_output=False, rnn_type=nn.LSTM,
concat_layers=False):
super().__init__()
# 获取参数
self.dropout_output = dropout_output
self.dropout_rate = dropout_rate
self.num_layers = num_layers
self.concat_layers = concat_layers # 使用最后一层结果或叠加结果
self.rnns = nn.ModuleList()
# 遍历设计的RNN层数,使用Modulelist堆叠
for i in range(num_layers):
# 如果不是第一层,把lstm的两个hidden_size作为输入
if i != 0:
input_size = 2 * hidden_size
self.rnns.append(rnn_type(input_size, hidden_size, num_layers=1, bidirectional=True))
def forward(self, x): # (B, L, D)
# 转化成RNN能接收的维度
x = x.transpose(0, 1) # (L, B, D)
# 用于记录不同层的RNN结果,初始是x
outputs = [x]
for i in range(self.num_layers):
rnn_input = outputs[-1]
# dropout
if self.dropout_rate > 0:
rnn_input = F.dropout(rnn_input, p=self.dropout_rate, training=self.training)
# 取上一层的RNN结果传入当前层的RNN
rnn_output = self.rnns[i](rnn_input)[0] # 只获取output,无需使用h_n,c_n
# 添加结果
outputs.append(rnn_output)
if self.concat_layers: # 如果使用拼接作为结果
# 这里0是X输入,所以只需要1开始取各层RNN的结果
output = torch.cat(outputs[1:], 2) # (L, B, D)
else: # 如果使用最后一层RNN作为结果
output = outputs[-1] # (L, B, D)
# 还原维度
output = output.transpose(0, 1) # (B, L, D)
# dropout
if self.dropout_output and self.dropout_rate > 0:
output = F.dropout(output, p=self.dropout_rate, training=self.training) # (B, L, D)
# 进行 transpose之后,tensor在内存中不连续, contiguous将output内存连续
return output.contiguous()
class BidirectionalAttention(nn.Module):
def __init__(self):
super().__init__()
def forward(self, v1, v1_mask, v2, v2_mask):
'''
v1 (B, L, H)
v1_mask (B, L)
v2 (B, R, H)
v2_mask (B, R)
'''
# v2:a v1:b
# 1.计算矩阵相似度 v1@v2
similarity_matrix = v1.bmm(v2.transpose(2, 1).contiguous()) # (B, L, R)
# 2.计算attention时没有必要计算pad=0, 要进行mask操作 3.进行softmax
# 将similarity_matrix v1中pad对应的权重给mask
# v1_mask (B, L) 加一维到第三维度成 (B, L, unsqueeze)
v2_v1_attn = F.softmax(
similarity_matrix.masked_fill(
v1_mask.unsqueeze(2), -1e7), dim=1) # (B, L, R)
# 将similarity_matrix v2中pad对应的权重给mask
# 21_mask (B, R) 加一维到第三维度成 (B, unsqueeze, R)
v1_v2_attn = F.softmax(
similarity_matrix.masked_fill(
v2_mask.unsqueeze(1), -1e7), dim=2) # (B, L, R)
# 4.计算attention
# 句子b 对a的影响
# attented_v1 (B, L, R) @ (B, R, H)
attented_v1 = v1_v2_attn.bmm(v2) # (B, L, H)
# 句子b 对a的影响
# v2_v1_attn (B, L, R) -> (B, R, L) @(B, L, H)
attented_v2 = v2_v1_attn.transpose(1, 2).bmm(v1) # (B, R, H)
# attented_v1 将v1对应的pad填充为0
# attented_v2 将v2对应的pad填充为0
attented_v1.masked_fill(v1_mask.unsqueeze(2), 0)
attented_v2.masked_fill(v2_mask.unsqueeze(2), 0)
return attented_v1, attented_v2
class ESIM(nn.Module):
def __init__(self, wv_mode, config: Config):
super(ESIM, self).__init__()
# ----------------------- encoding ---------------------#
word_vectors = torch.randn([config.vocab_size, config.embed_dim])
for i in range(0, config.vocab_size):
word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i])
# 创建embedding层
self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100)
if config.update_embed is False:
self.embedding.weight.requires_grad = False
# 创建rnn的dropout
self.rnn_dropout = RNNDropout(config.dropout)
rnn_size = config.hidden_size
if config.concat_layers is True:
rnn_size //= config.num_layers
config.hidden_size = rnn_size // 2 *2 *2 # 第一个*2是双向 第二个*2是cat拼接
self.input_encoding = StackedBRNN(input_size=config.embed_dim,
hidden_size=rnn_size // 2,
num_layers=config.num_layers,
rnn_type=nn.LSTM,
concat_layers=config.concat_layers)
# ----------------------- encoding ---------------------#
# ----------------------- 注意力 ---------------------#
self.attention = BidirectionalAttention()
# ----------------------- 注意力 ---------------------#
# ----------------------- 组合层 ---------------------#
self.projection = nn.Sequential(
nn.Linear(4 * config.hidden_size, config.hidden_size),
nn.ReLU()
)
self.composition = StackedBRNN(input_size=config.hidden_size,
hidden_size=rnn_size // 2,
num_layers=config.num_layers,
rnn_type=nn.LSTM,
concat_layers=config.concat_layers)
# ----------------------- 组合层 ---------------------#
# ----------------------- 推理层 ---------------------#
self.classification = nn.Sequential(
nn.Dropout(p=config.dropout),
nn.Linear(4 * config.hidden_size, config.hidden_size),
nn.Tanh(),
nn.Dropout(p=config.dropout))
self.out = nn.Linear(config.hidden_size, config.num_labels)
# ----------------------- 推理层 ---------------------#
def forward(self, words_a, words_b):
'''
维度说明表
B: batch_size
L: 句子a的长度
R: 句子b的长度
D: embedding长度
H: hidden长度
'''
# 读取数据
query = words_a # (B, L)
doc = words_b # (B, R)
# ----------------------- encoding ---------------------#
# 获取mask,判断query,doc种每个数是不是0
# 是0则表示该位置是PAD
# 是1则表示该位置不是PAD
# query: [2,3,4,5,0,0,0] -> query_mask: [0,0,0,0,1,1,1]
query_mask = (query == 0) # (B, L)
doc_mask = (query == 0) # (B, R)
# 转换词向量
query = self.embedding(query) # (B, L, D)
doc = self.embedding(doc) # (B, R, D)
# dropout,随机对输出清零
query = self.rnn_dropout(query) # (B, L, D)
doc = self.rnn_dropout(doc) # (B, R, D)
# 使用ESIM叠加的双向RNN 进行编码
query = self.input_encoding(query) # (B, L, H)
doc = self.input_encoding(doc) # (B, R, H)
# ----------------------- encoding ---------------------#
# ----------------------- 注意力 ---------------------#
'''
1. 计算两个句子的矩阵相似度
2. 把PAD填充去掉,因为计算attention时先进行mask操作
3. 进行softmax
3. 计算attention
'''
attended_query, attended_doc = self.attention(query, query_mask, doc, doc_mask)
# ----------------------- 注意力 ---------------------#
# ----------------------- 拼接层 ---------------------#
# 得到拼接embedding和attention得到加强信息版query和doc, 对应论文中的m
enhanced_query = torch.cat([query, attended_query, query - attended_query, query * attended_query],
dim=-1) # (B, L, 4*H)
enhanced_doc = torch.cat([doc, attended_doc, query - attended_doc, query * attended_doc], dim=-1) # (B, R, 4*H)
# ----------------------- 拼接层 ---------------------#
# ----------------------- 组合层 ---------------------#
# 推理拼接后的张量, 对应论文中的F(m)
projected_query = self.projection(enhanced_query) # (B, L, H)
projected_doc = self.projection(enhanced_doc) # (B, R, H)
# 使用双向RNN
query = self.composition(projected_query) # (B, L, H)
doc = self.composition(projected_doc) # (B, R, H)
# ----------------------- 组合层 ---------------------#
# ----------------------- 池化层 ---------------------#
'''
1. 平均池化
2. 最大池化
3. 拼接 4个结果张量
'''
# 由于部分句子被pad,使用平均池化会不准,所以需要反推mask,然后求句子长度
# 0的位置为pad
reverse_query_mask = 1. - query_mask.float() # (B, L)
reverse_doc_mask = 1. - doc_mask.float() # (B, R)
# 平均池化
query_avg = torch.sum(query * reverse_query_mask.unsqueeze(2), dim=1) / (
torch.sum(reverse_query_mask, dim=1, keepdim=True) + 1e-8) # (B, L, H)
doc_avg = torch.sum(doc * reverse_doc_mask.unsqueeze(2), dim=1) / (
torch.sum(reverse_doc_mask, dim=1, keepdim=True) + 1e-8) # (B, R, H)
# 防止取出pad(也许部分值是负数,小于0)
query = query.masked_fill(query_mask.unsqueeze(2), -1e7)
doc = doc.masked_fill(doc_mask.unsqueeze(2), -1e7)
# 最大池化
query_max, _ = query.max(dim=1) # (B, L, H)
doc_max, _ = doc.max(dim=1) # (B, R, H)
# 拼接
X = torch.cat([query_avg, query_max, doc_avg, doc_max], dim=-1)
# ----------------------- 池化层 ---------------------#
# ----------------------- 推理层 ---------------------#
X = self.classification(X)
output = self.out(X)
# ----------------------- 推理层 ---------------------#
return output
传统网络结论
从综合分数上看,准确率由孪生网络到InferSent再到ESIM均有0.02的提升。所以得到在文本相似度任务上:tfidf < Siamese < InferSent < ESIM。当然实际超惨并未做太多的修改,所以这个准确率仅供参考使用。
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
tfidf | tfidf.py | 1. 求字数差 2. 使用百度停用词 3. 去除停用词就词数差 4. tfidf | bq_corpus:0.6533 lcqmc:0.7343 paws-x:0.5585 score:0.6487 |
SiamCNN | SiamCNN_LSTM.py | 1. 基于gensim的wv作为深度学习embeeding层的初始化参数 2.使用孪生CNN卷积+线性层 | bq_corpus:0.6849 lcqmc:0.753 paws-x:0.5405 score:0.6595 |
SiamLSTM | SiamCNN_LSTM.py | 1. SiamCNN把模型改为孪生LSTM+线性层 | bq_corpus:0.6964 lcqmc:0.77 paws-x:0.5735 score:0.68 |
InferSent | InferSent.py | 1. SiamCNN把模型改为InferSent | bq_corpus:0.7264 lcqmc:0.778 paws-x:0.6055 score:0.7033 |
ESIM | ESIM.py | 1. 把PAD对应字典改为作为0 2.SiamCNN把模型改为ESIM | bq_corpus:0.7557 lcqmc:0.7744 paws-x:0.632 score:0.7207 |
BERT模型
现在NLP任务基本没有不使用BERT完成的,主要它能为我们提供更好的模型性能。BERT能完成很多NLP任务。而针对这一次任务,我们可以使用BERTForSequenceClassification。但得益于BERT具备着端到端的特性,所以过程中我会基于BERT进行一些列的魔改调试。
值得注意的是,BERT属于大型预训练模型。所以极度消耗GPU资源,所以运行时需考虑自身GPU环境。这里推荐使用百度AIStudio提供的免费GPU资源。
具体每个魔改的方案如下:
构建思路
简单裸跑BERT
首先就是什么都不做,直接裸跑一个BERT。也就是完完全全相信BERT能帮我们完成。大概的流程如下:
- 读取数据集,对训练于验证集做简单的去空处理
- 确定BERT的模型,初步使用百度的Ernie-Gram
- 使用BERT的tokenizer对文本数据进行编码,得到input_ids、token_type_ids、attention_mask
- 构造Dataset和Dataloader,在collator_fn处对编码后的文本进行裁剪填充操作
- 创建BERT模型,训练并检验
- 预测结果
最终得分0.8299,比ESIM高了0.1。可以发现仅仅裸跑BERT,结果也较传统的神经网络好很多。下面是训练部分的代码,剩余部分在gitee仓库
def train(config: Config, train_dataloader: DataLoader, dev_dataloader: DataLoader):
# 创建模型
model = AutoModelForSequenceClassification.from_pretrained(config.model_path,num_classes=config.num_labels)
# 定义优化器
opt = optimizer.AdamW(learning_rate=config.learning_rate, parameters=model.parameters())
# 定义损失函数
loss_fn = nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
# 遍历训练次数训练
for epoch in range(config.epochs):
model.train()
for iter_id, mini_batch in enumerate(train_dataloader):
input_ids = mini_batch['input_ids']
token_type_ids = mini_batch['token_type_ids']
attention_mask = mini_batch['attention_mask']
labels = mini_batch['labels']
# ---------------------------- 报错部分, 百度paddle的bug, 已提交PR修复 ---------------------------- #
# logits = model(input_ids=input_ids, token_type_ids=token_type_ids,attention_mask=attention_mask)
# ---------------------------- 报错部分, 百度paddle的bug, 已提交PR修复 ---------------------------- #
# ---------------------------- 正常部分部分 ---------------------------- #
logits = model(input_ids=input_ids, token_type_ids=token_type_ids)
# ---------------------------- 正常部分部分 ---------------------------- #
# 计算损失值
loss = loss_fn(logits, labels)
# 计算具体值并校验
probs = paddle.nn.functional.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
# 反向传播
loss.backward()
opt.step()
opt.clear_grad()
# 打印模型性能
if iter_id%config.print_loss == 0:
print('epoch:{}, iter_id:{}, loss:{}, acc:{}'.format(epoch, iter_id, loss, acc))
# 运行完一个epoch验证机校验
avg_val_loss, acc = evaluation(model, loss_fn, metric, dev_dataloader)
print('-' * 50)
print('epoch: {}, val_loss: {}, val_acc: {}'.format(epoch, avg_val_loss, acc))
print('-' * 50)
return model
基于BERT的对抗训练
所谓的对抗训练就是在训练的过程中为模型的某部分参数添加白噪声再训练一次。所以在这种设计下,一个batch的数据需要训练两次(正常参数训练,加白噪声的参数训练)。这样的好处是可以让模型具备更好的泛化能力。具体流程如下:
- 读取数据集,对训练于验证集做简单的去空处理
- 确定BERT的模型,初步使用百度的Ernie-Gram
- 使用BERT的tokenizer对文本数据进行编码,得到input_ids、token_type_ids、attention_mask
- 构造Dataset和Dataloader,在collator_fn处对编码后的文本进行裁剪填充操作
- 创建BERT模型,并使用原参数训练
- 反向传播
- 对模型部分参数加入白噪声,使用同样的batch再训练一次
- 反向传播,并去掉白噪声还原模型参数
- 预测结果
最终得分0.8304,比裸跑BERT高了0.07。所以在这个算法命题下,对抗训练有助于模型更好的识别数据。下面是训练部分的代码,剩余部分在gitee仓库
# 训练
def train(config: Config, train_dataloader: DataLoader, dev_dataloader: DataLoader):
# 创建模型
model = AutoModelForSequenceClassification.from_pretrained(config.model_path,num_classes=config.num_labels)
# 定义优化器
opt = optimizer.AdamW(learning_rate=config.learning_rate, parameters=model.parameters())
# 定义损失函数
loss_fn = nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
# 检测是否添加对抗训练
if conf.adv == 'fgm':
adver_method = extra_fgm.FGM(model)
best_acc = 0
# 遍历训练次数训练
for epoch in range(config.epochs):
model.train()
for iter_id, mini_batch in enumerate(train_dataloader):
input_ids = mini_batch['input_ids']
token_type_ids = mini_batch['token_type_ids']
attention_mask = mini_batch['attention_mask']
labels = mini_batch['labels']
logits = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
# 计算损失值
loss = loss_fn(logits, labels)
# 计算具体值并校验
probs = paddle.nn.functional.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
opt.clear_grad()
loss.backward()
# 检测是否使用对抗训练
if conf.adv == 'fgm':
# 计算x+r的前向loss, 反向传播得到梯度,然后累加到(1)的梯度上;
adver_method.attack(epsilon=conf.eps)
# 计算x+r的前向loss
logits_adv = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
loss_adv = loss_fn(logits_adv, labels)
# 反向传播得到梯度,然后累加到(1)的梯度上;
loss_adv.backward()
# 将embedding恢复为(1)时的embedding;
adver_method.restore()
# 反向传播
opt.step()
# 打印模型性能
if iter_id%config.print_loss == 0:
print('epoch:{}, iter_id:{}, loss:{}, acc:{}'.format(epoch, iter_id, loss, acc))
# 运行完一个epoch验证机校验
avg_val_loss, acc = evaluation(model, loss_fn, metric, dev_dataloader)
print('-' * 50)
print('epoch: {}, val_loss: {}, val_acc: {}'.format(epoch, avg_val_loss, acc))
print('-' * 50)
# 保存最优模型
if best_acc < acc:
best_acc = acc
# 保存模型
model.save_pretrained('./checkpoint/'+conf.dataset+'/'+conf.model_path+conf.model_suffix)
conf.tokenizer.save_pretrained('./checkpoint/'+conf.dataset+'/'+conf.model_path+conf.model_suffix)
return model
数据增强
文本类数据增强常用的就是EDA,也就是通过同义词替换、新增、删除等操作创造新的句子。我尝试过这样的数据增强手段,实际效果不佳。我认为核心问题是通过这种增强,让训练时的数据发生了畸形。最后导致模型也学习了畸形的文本。
最后我采用拓展式的数据增强手段。假设 A 与B是关联的、B与C是关联的,那么我认为A与C也是关联的。通过这种拓展,可以让学习的数据更多,且数据不会发生畸形。这里训练的内容其实没有变化,变的是增强的部分,所以只提供增强的样例:
# 数据增强
def aug_group_by_a(df):
aug_data = defaultdict(list)
# 以text_a中的句子为 a
for g, data in df.groupby(by=['text_a']):
if len(data) < 2:
continue
for i in range(len(data)):
for j in range(i + 1, len(data)):
# 取出b的值,a,b的label
row_i_text = data.iloc[i, 1]
row_i_label = data.iloc[i, 2]
# 取出c的值,a,c的label
row_j_text = data.iloc[j, 1]
row_j_label = data.iloc[j, 2]
if row_i_label == row_j_label == 0:
continue
aug_label = 1 if row_i_label == row_j_label == 1 else 0
aug_data['text_a'].append(row_i_text)
aug_data['text_b'].append(row_j_text)
aug_data['label'].append(aug_label)
return pd.DataFrame(aug_data)
实验最终得分:0.832,对比仅使用对抗训练有了微弱的提升。最终可以确定采用对抗训练+数据增强能有效提高模型的性能。
分开截断与填充
一开始我的最大句子长度是以一句为目标的,然后截断与填充的方式是两句话拼接后再截断。经过实验发现这样容易出现一个问题。就是截断时会保留句子A的全部内容而截掉句子B的内容,使得模型无法有效衡量模型的重要性。
为了解决这个问题,我做了如下两部操作:
- 最大长度全部 x2,目的是为了雨露均沾两边的句子
- 截断部分完全交由tokenizer处理,假设最大长度88x2。那么句子A是长度88,句子B长度也是88。防止输入序列“偏袒”与某一句
部分代码如下:
# 读取数据
def read_data(config: Config):
if config.operation == 'train':
train = pd.read_csv('data/data52714/' + config.dataset + '/train.tsv', sep='\t',
names=['text_a', 'text_b', 'label'])
dev = pd.read_csv('data/data52714/' + config.dataset + '/dev.tsv', sep='\t',
names=['text_a', 'text_b', 'label'])
test_size = len(dev) / (len(train)+len(dev))
if len(set(train['label'])) > 2:
train = train[train['label'].isin(['0', '1'])]
train['label'] = train['label'].astype('int')
train = train.dropna()
if len(set(train['label'])) > 2:
dev = dev[dev['label'].isin(['0', '1'])]
dev['label'] = dev['label'].astype('int')
dev = dev.dropna()
# 最终返回的数据
data = pd.concat([train, dev])
# 数据增强,加大训练集数据量
if config.need_data_aug is True:
aug_train = aug_group_by_a(train)
aug_dev = aug_group_by_a(dev)
# 拼接数据
data = pd.concat([data, aug_train, aug_dev])
# 随机切分数据
X = data[['text_a', 'text_b']]
y = data['label']
X_train, X_dev, y_train, y_dev = train_test_split(
X, y, random_state=config.random_seed, test_size=test_size)
X_train['label'] = y_train
X_dev['label'] = y_dev
# tokenizer
tokenizer = config.tokenizer
data_df = {'train': X_train, 'dev': X_dev}
full_data_dict = {}
for k, df in data_df.items():
inputs = defaultdict(list)
for i, row in tqdm(df.iterrows(), desc='encode {} data'.format(k), total=len(df)):
seq_a = row[0]
seq_b = row[1]
label = row[2]
inputs_dict = tokenizer.encode(seq_a, seq_b, return_special_tokens_mask=True,
return_token_type_ids=True,
return_attention_mask=True, max_seq_len=config.max_seq_len,
pad_to_max_seq_len=True)
inputs['input_ids'].append(inputs_dict['input_ids'])
inputs['token_type_ids'].append(inputs_dict['token_type_ids'])
inputs['attention_mask'].append(inputs_dict['attention_mask'])
inputs['labels'].append(label)
full_data_dict[k] = inputs
return full_data_dict['train'], full_data_dict['dev']
elif config.operation == 'predict':
test = pd.read_csv('data/data52714/' + config.dataset + '/test.tsv', sep='\t', names=['text_a', 'text_b'])
test['label'] = 0
# tokenizer
tokenizer = config.tokenizer
data_df = {'test': test}
full_data_dict = {}
for k, df in data_df.items():
inputs = defaultdict(list)
for i, row in tqdm(df.iterrows(), desc='encode {} data'.format(k), total=len(df)):
seq_a = row[0]
seq_b = row[1]
label = row[2]
inputs_dict = tokenizer.encode(seq_a, seq_b, return_special_tokens_mask=True,
return_token_type_ids=True,
return_attention_mask=True, max_seq_len=config.max_seq_len,
pad_to_max_seq_len=True)
inputs['input_ids'].append(inputs_dict['input_ids'])
inputs['token_type_ids'].append(inputs_dict['token_type_ids'])
inputs['attention_mask'].append(inputs_dict['attention_mask'])
inputs['labels'].append(label)
full_data_dict[k] = inputs
return full_data_dict['test'], len(test)
else:
raise Exception('错误的模型行为!')
最终得分:0.855,相对上一个模型提升巨大。说明猜测没错,需要尽可能的提供均匀的信息给模型。
5折模型融合
最后就是简单的对模型进行五折模型融合。
# 训练
def train(config: Config):
# 多折交叉训练
for k in range(config.k_flod):
k += 4
# 读取数据
train_dataloader, dev_dataloader = create_dataloader(conf)
# 创建模型
model = AutoModelForSequenceClassification.from_pretrained(config.model_path,num_classes=config.num_labels)
# 定义优化器
num_training_steps = len(train_dataloader) * config.epochs
lr_scheduler = LinearDecayWithWarmup(config.learning_rate, num_training_steps, 0.1)
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
opt = optimizer.AdamW(learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=0.01,
apply_decay_param_fun=lambda x: x in decay_params)
# 定义损失函数
loss_fn = nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
# 检测是否添加对抗训练
if conf.adv == 'fgm':
adver_method = extra_fgm.FGM(model)
best_acc = 0
# 遍历训练次数训练
for epoch in range(config.epochs):
model.train()
for iter_id, mini_batch in enumerate(train_dataloader):
input_ids = mini_batch['input_ids']
token_type_ids = mini_batch['token_type_ids']
attention_mask = mini_batch['attention_mask']
labels = mini_batch['labels']
logits = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
# 计算损失值
loss = loss_fn(logits, labels)
# 计算具体值并校验
probs = paddle.nn.functional.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
loss.backward()
# 检测是否使用对抗训练
if conf.adv == 'fgm':
# 计算x+r的前向loss, 反向传播得到梯度,然后累加到(1)的梯度上;
adver_method.attack(epsilon=conf.eps)
# 计算x+r的前向loss
logits_adv = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
loss_adv = loss_fn(logits_adv, labels)
# 反向传播得到梯度,然后累加到(1)的梯度上;
loss_adv.backward()
# 将embedding恢复为(1)时的embedding;
adver_method.restore()
# 反向传播
opt.step()
lr_scheduler.step()
opt.clear_grad()
# 打印模型性能
if iter_id%config.print_loss == 0:
print('k:{}, epoch:{}, iter_id:{}, loss:{}, acc:{}'.format(k, epoch, iter_id, loss, acc))
# 运行完一个epoch验证机校验
avg_val_loss, avg_val_acc = evaluation(model, loss_fn, metric, dev_dataloader)
print('-' * 50)
print('k:{}, epoch: {}, val_loss: {}, val_acc: {}'.format(k, epoch, avg_val_loss, avg_val_acc))
print('-' * 50)
model.save_pretrained('./checkpoint/'+conf.dataset+'/k_flod/'+conf.model_path+'_'+str(k))
conf.tokenizer.save_pretrained('./checkpoint/'+conf.dataset+'/k_flod/'+conf.model_path+'_'+str(k))
return model
最终得分:0.864,排名38。由于继续提升需要花费大量的研究与训练时间,所以整个赛题的研究就到此未知。
BERT网络总结
综合整个研究过程,最终采用的方案是:
- 采用百度的Ernie-Gram作为基础预训练模型
- 采用数据增强,加大训练数据量
- 对两边句子进行分开截断与填充
- 加入对抗训练,提升模型抗干扰能力
- 5折模型融合,加强模型精度
最终得分如下:
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
裸跑Ernie-Gram | PaddleBERT.py | 1. 裸跑Ernie-Gram | bq_corpus:0.8412 lcqmc:0.8639 paws-x:0.7845 score:0.8299 |
裸跑数据增强的BERT | TorchBERT | 1. 裸跑chinese-bert-wwm-ext 2. 添加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.7495 score:0.8112 |
对抗训练 | TorchBERTFGM | 1. 基于数据增强版添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.76 score:0.8147 |
对抗训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.786 score:0.8304 |
对抗训练+数据增强 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 2. 增加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.791 score:0.832 |
分开配置训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram-FGM针对3个数据集分开训练参数 2. paws-x调小batch_size与学习率 | bq_corpus:0.8363 lcqmc: paws-x:0.8605 score: |
BERT+HEADER | Ernie-Gram-FGM-Header.ipynb | 1. 基于Ernie-Gram-FGM加入三层线性层header | bq_corpus:0.8363 lcqmc:0.8596 paws-x:0.852 score:0.8493 |
分开填充 | Ernie-Gram-分开填充.ipynb | 1. 基于Ernie-Gram-FGM 2. encoder阶段对数据分开对称填充 | bq_corpus:0.8353 lcqmc:0.8696 paws-x:0.86 score:0.855 |
5折 | 多模型五折.ipynb | 1. 分开填充 2. 五折融合 | bq_corpus:0.848 lcqmc:0.875 paws-x:0.869 score:0.864 |
赛题总结
综合整个训练流程,使用的模型以及具体分数如下(其实是上文表格的合并):
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
tfidf | tfidf.py | 1. 求字数差 2. 使用百度停用词 3. 去除停用词就词数差 4. tfidf | bq_corpus:0.6533 lcqmc:0.7343 paws-x:0.5585 score:0.6487 |
SiamCNN | SiamCNN_LSTM.py | 1. 基于gensim的wv作为深度学习embeeding层的初始化参数 2.使用孪生CNN卷积+线性层 | bq_corpus:0.6849 lcqmc:0.753 paws-x:0.5405 score:0.6595 |
SiamLSTM | SiamCNN_LSTM.py | 1. SiamCNN把模型改为孪生LSTM+线性层 | bq_corpus:0.6964 lcqmc:0.77 paws-x:0.5735 score:0.68 |
InferSent | InferSent.py | 1. SiamCNN把模型改为InferSent | bq_corpus:0.7264 lcqmc:0.778 paws-x:0.6055 score:0.7033 |
ESIM | ESIM.py | 1. 把PAD对应字典改为作为0 2.SiamCNN把模型改为ESIM | bq_corpus:0.7557 lcqmc:0.7744 paws-x:0.632 score:0.7207 |
裸跑Ernie-Gram | PaddleBERT.py | 1. 裸跑Ernie-Gram | bq_corpus:0.8412 lcqmc:0.8639 paws-x:0.7845 score:0.8299 |
裸跑数据增强的BERT | TorchBERT | 1. 裸跑chinese-bert-wwm-ext 2. 添加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.7495 score:0.8112 |
对抗训练 | TorchBERTFGM | 1. 基于数据增强版添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.76 score:0.8147 |
对抗训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.786 score:0.8304 |
对抗训练+数据增强 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 2. 增加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.791 score:0.832 |
分开配置训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram-FGM针对3个数据集分开训练参数 2. paws-x调小batch_size与学习率 | bq_corpus:0.8363 lcqmc: paws-x:0.8605 score: |
BERT+HEADER | Ernie-Gram-FGM-Header.ipynb | 1. 基于Ernie-Gram-FGM加入三层线性层header | bq_corpus:0.8363 lcqmc:0.8596 paws-x:0.852 score:0.8493 |
分开填充 | Ernie-Gram-分开填充.ipynb | 1. 基于Ernie-Gram-FGM 2. encoder阶段对数据分开对称填充 | bq_corpus:0.8353 lcqmc:0.8696 paws-x:0.86 score:0.855 |
5折 | 多模型五折.ipynb | 1. 分开填充 2. 五折融合 | bq_corpus:0.848 lcqmc:0.875 paws-x:0.869 score:0.864 |
由于篇幅问题,文中的仅提供核心代码。想要完整代码的可以移步到我的gitee仓库中。尽管如此,项目仍存在很大的优化空间:
- 使用UDA半监督学习,让我们也学会部分测集的内容
- 更换其他版本的BERT,如:nezha、roberta等
- 使用large版的预训练模型
- 数据分析,查看是否存在干扰数据
- 引入知识图谱,利用其辅助模型(需要条件允许)