赛事介绍
赛题背景
本次迁移学习任务中,讯飞智能汽车BU将提供较多的车内人机交互中文语料,以及少量的中英、中日、中阿平行语料作为训练集,参赛选手通过提供的数据构建模型,进行意图分类及关键信息抽取任务,最终使用英语、日语、阿拉伯语进行测试评判。
该赛题存在低资源、跨语言、多任务、类别不平衡等特点。
数据说明
赛题数据集提供了三类车内交互功能语料,其中包括命令控制类、导航类、音乐类。较多的中文语料和较少的多语种平行语料均带有意图分类和关键信息,选手需充分利用所提供数据,在英、日、阿拉伯语料的意图分类和关键信息抽取任务上取得较好效果。数据所含标签种类及取值类型如下表所示。
平行预料:中文和英文成对出行
评价指标
意图分类accuracy = 意图正确数目 / 总数据量
关键信息抽取accuracy = 关键信息完全正确数目 / 总数据量
解决方案
意图分类可以抽象为文本分类问题,槽值识别为命名实体识别任务;可以按照子任务分别构建模型,也可以构建单一联合模型。 优先选择构建多语言预训练模型,联合解决两个子任务
多语言预训练模型:
bert-base-multilingual-uncased
bert-base-multilingual-cased
Task1:比赛报名
数据读取与数据类型查看
import pandas as pd
train_cn = pd.read_excel('./中文_trian.xlsx')
train_ja = pd.read_excel('./日语_train.xlsx')
train_en = pd.read_excel('./英文_train.xlsx')
test_ja = pd.read_excel('testA.xlsx', sheet_name='日语_testA')
test_en = pd.read_excel('testA.xlsx', sheet_name='英文_testA')
train_cn.shape,train_ja.shape,train_en.shape
((32275, 4), (1002, 5), (1001, 5))
df = pd.concat([pd.DataFrame(train_cn.dtypes),pd.DataFrame(train_ja.dtypes),pd.DataFrame(train_en.dtypes)],axis=1)
df.columns = ['train_cn_dtypes','train_ja_dtypes','train_en_dtypes']
EDA
df_ana = pd.DataFrame()
for label,df in zip(['中文','英文','日文'],[train_cn,train_en,train_ja]):
temp = pd.DataFrame(df.意图.value_counts()).reset_index()
temp.columns = ['%s意图'%label,'个数']
df_ana = pd.concat([df_ana,temp],axis=1)
意图分布
# 统计 label_i 的数目
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context='notebook',font='simhei',style='whitegrid')
plt.title("中文标签分布")
sns.countplot(y='意图',data=train_ja)
plt.show()
plt.title("英文标签分布")
sns.countplot(y='意图',data=train_en)
plt.show()
plt.title("日文标签分布")
sns.countplot(y='意图',data=train_cn)
plt.show()
槽值(slot)分布
# 统计slot标签
SLOT_LIST = set()
SLOT_LIST.add(' ')
for idx,rows in train_cn.iterrows():
if rows['槽值1'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值1"].split(":")[0])
SLOT_LIST.add(slot_name)
if rows['槽值2'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值2"].split(":")[0])
SLOT_LIST.add(slot_name)
for idx,rows in train_en.iterrows():
if rows['槽值1'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值1"].split(":")[0])
SLOT_LIST.add(slot_name)
if rows['槽值2'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值2"].split(":")[0])
SLOT_LIST.add(slot_name)
for idx,rows in train_ja.iterrows():
if rows['槽值1'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值1"].split(":")[0])
SLOT_LIST.add(slot_name)
if rows['槽值2'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值2"].split(":")[0])
SLOT_LIST.add(slot_name)
SLOT_LIST = list(set(SLOT_LIST))
SLOT_LIST = sorted(SLOT_LIST)
print(f"intent-slot标签共{len(SLOT_LIST)}个\n分别是{SLOT_LIST}")
intent-slot标签共11个
分别是[' ', 'adjust_ac_temperature_to_number-offset', 'adjust_ac_windspeed_to_number-offset', 'close_car_device-device', 'music_search_artist_song-singer', 'music_search_artist_song-song', 'navigate_landmark_poi-landmark', 'navigate_landmark_poi-poi', 'navigate_poi-poi', 'open_ac_mode-mode', 'open_car_device-device']
文本长度分布
fig,axes=plt.subplots(1,3, figsize=(15, 5)) #创建一个1行三列的图片
train_cn['Chinese_text_len'] = [len(i) for i in train_cn["原始文本"]]
train_en['English_text_len'] = [len(i.split(" ")) for i in train_en["原始文本"]]
train_ja['Japan_text_len'] = [len(i) for i in train_ja["原始文本"]]
sns.distplot(train_cn['Chinese_text_len'],bins=10,ax=axes[0])
sns.distplot(train_en['English_text_len'],bins=10,ax=axes[1])
sns.distplot(train_ja['Japan_text_len'],bins=10,ax=axes[2])
- 在三类训练预料中意图标签极度不平衡; 原始文本中并不是每一个文本都包含slot标签,slot标签也是不平衡的
- 文本长度的最大值均未超过40,在max_len的选择上不要设置太大,本文选取48作为max_len值,可能并非最优max_len
Task2:文本分析与文本分词
使用jieba对中文进行分词
import jieba
import jieba.posseg as pseg
words = jieba.lcut("今天北京天气不错。")
print(words)
words = pseg.lcut("今天北京天气不错。")
print(words)
['今天', '北京', '天气', '不错', '。']
[pair('今天', 't'), pair('北京', 'ns'), pair('天气', 'n'), pair('不错', 'a'), pair('。', 'x')]
def cutword_cn(txt):
return jieba.lcut(txt)
train_cn['words'] = train_cn['原始文本'].apply(cutword_cn)
train_cn.head(10)
使用negisa对日语进行分词
# https://github.com/taishi-i/nagisa
import nagisa
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
# Get a list of words
print(words.words)
# Get a list of POS-tags
print(words.postags)
Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞
['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']
def cutword_ja(txt):
words = nagisa.tagging(txt)
return words.words
train_ja['words'] = train_ja['原始文本'].apply(cutword_ja)
train_ja.head(10)
Task3:TFIDF与文本分类
TF-IDF 原理浅析
步骤1:学习TFIDF的使用,提取语料的TFIDF特征;
TF-IDF(Term Frequency-inverse Document Frequency)词频-逆向文档频,特征向量化方法。
TF-IDF是一种针对关键词的统计分析方法,用于评估一个词对一个文件集或者一个语料库的重要程度。一个词的重要程度跟它在文章中出现的次数成正比,跟它在语料库出现的次数成反比。这种计算方式能有效避免常用词对关键词的影响,提高了关键词与文章之间的相关性。
TF:词频,词语在文章中出现的总次数。通常会被归一化定义为TF=(某词在文档中出现的次数/文档的总词量),这样可以防止结果偏向过长的文档(同一个词语在长文档里通常会具有比短文档更高的词频)
D:语料库,d:文档
F(t,d):词频T,词语t在文档中d中出现的次数
DF(t,D):文件频率,包含词t的文档的个数
IDF:逆文档频率,包含某词语的文档越少。IDF值越大,说明该词语具有很强的区分能力,IDF=log(语料库中文档总数/包含该词的文档数+1),+1的原因是避免分母为0。TF-IDF=TF·IDF,TFIDF值越大表示该特征词对这个文本的重要性越大。
TF-IDF 实现
API:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'今天天气不错.',
'I am happy everyday.',
'How are you.',
'Are you happy and smile everydat',
]
vectorizer = TfidfVectorizer()
tdm = vectorizer.fit_transform(corpus)
space = vectorizer.vocabulary_
print(space)
{'今天天气不错': 9, 'am': 0, 'happy': 5, 'everyday': 4, 'how': 6, 'are': 2, 'you': 8, 'and': 1, 'smile': 7, 'everydat': 3}
vectorizer.get_feature_names()
['am',
'and',
'are',
'everydat',
'everyday',
'happy',
'how',
'smile',
'you',
'今天天气不错']
print(tdm.shape)
(4, 10)
函数解析:
- vocabulary_:dict,特征和特征在TD-IDF中位置的一个对应关系,比如上例中vocabulary_的输出为,可以看出每个特征词和TD-IDF矩阵列的对应关系:
参数解析:
- stop_words:set,停用词集合,可以设置文档中的停用词(没有具体意义的词,如助词,语气词等),使得停用词不纳入计算范围,提高算法的精确性。当为’english’时,ENGLISH_STOP_WORDS中定义的词会被忽略,如果为list,list中的单词即为要忽略的词;
- max_df:设定当某个词超过一个df(document frequency)的上限时就忽略该词。当为0~1的float时表示df的比例,当为int时表示df数量;
- get_feature_names():返回特征列表,接上例vectorizer.get_feature_names()返回结果如下:
fit:load数据,并计算tf-idf值;
transform:将数据转化为matrix的形式;
fit_transform:load数据并将数据转化为matrix形式,等于fit+trasform;
意图识别和槽位填充
From https://zhuanlan.zhihu.com/p/165963264 大话知识图谱--意图识别和槽位填充
自然语言理解(NLU):
问答机器人要能准确的利用知识库回答用户的问题的前提是机器人能够准确理解用户的问题,也就是能够对用户输入的文本或者语音进行自然语言理解。它包含了两部分:意图识别和槽位填充。
意图识别,顾名思义就是判断用户要做什么,比如一个用户向机器人问了一个问题,于是机器人就需要判断这个用户他问的是天气、是旅游还是问某部电影的信息呢?说到底意图识别就是一个文本分类的问题。既然对应文本分类,那么首先就需要将明确有哪几类意图,也就是说我们需要预先定义好意图的类别然后才能考虑意图识别的问题。那么如何定义意图类别呢,它不像情感分类任务那样,不管什么场景,都能划分成积极、消极和中立情感三分类;我们讲意图分类是需要将其置于特定的场景中来考虑的,不同应用场景就有不同的意图分类。比如在美团APP上,它把用户的搜索意图分为订外卖、订酒店、订旅游门票、订电影票、订机票等类别。
识别出用户的意图后,比如用户输入“订一张今天下午场次的战狼电影票”,系统识别出“订电影票”的意图,于是系统(我们暂且称这个系统叫张三吧)开始操作订电影票的事了,但是张三犯迷糊了,是给用户订战狼1还是战狼2呢?订哪个电影院的呢?张三查了一下最近上映的电影,发现当前只有战狼2在重映;张三又查了一下离用户当前所在位置最近的电影院是一家叫XX影城的电影院;张三又计算了一下时间,现在是下午1点,XX影城战狼2在下午一点半有一个场次,而稍远一点的YY影城在下午两点半有一个场次;于是张三和用户进行交互互动:
“您好,战狼2在XX影城于下午一点半上映,YY影城将在下午两点半上映;你需要去哪个电影院呢?”
“去YY影城”
“您好,您可以选择电影票张数和座位号完成预定”
张三太聪明了吧,它竟然会像人一样思考并且和用户互动!其实张三会有针对性的“思考”是全靠槽位填充来实现的,当张三识别出你的“订电影票”意图之后,它掏出了该意图所对应的语义槽来做“填空”(先不管这个槽是啥,是怎么来的,我们先看看它长什么样子吧):
它发现需要填的空包含了电影名、影院名称、时间、数量、座位位置等信息,那么张三是如何填空的呢?答案是命名实体识别和槽位预测,它扫描一遍用户的输入“订一张今天下午场次的战狼电影票”,识别出电影名是“战狼”,时间是“今天下午”;没有识别到影院名称,于是张三根据用户当前的位置,将其预测为XX影院或YY影院,数量和座位位置就没办法预测。针对没办法预测的槽位,张三决定向用户发问或者提供选择来确认“您可以选择电影票张数和座位号完成预定”;对于预测到槽位值不唯一的情况,比如XX影院或YY影院,张三决定让用户自己进行二选一;对于识别到的槽位存在歧义的问题,张三决定进行“澄清”,比如战狼实体不是很明确,需要澄清是战狼1还是战狼2,张三根据现在正上映的是战狼2这一情况来进行自动澄清,但如果此时两部电影都在上映的话,张三就得向用户发问澄清了,比如问“您是想要看战狼1还是战狼2?”。
可以看到张三的思考过程完全是按照语义槽来进行的,有什么槽位它就思考什么。当语义槽完全填充且消除了歧义之后,也就完成了整个自然语言理解任务,开始利用知识库回答用户问题或者完成某种操作。但是等等,张三掏出来的语义槽到底是怎么来的呢?它是如何和用户进行“发问”交互的呢?
这就是语义槽设计的问题了,上文中我说了意图类别是事先预定义好的,因此自然地,我们同时也会预定义好意图所对应的语义槽,也就是你的应用场景有多少个意图,那么你就需要事先定义好多少个所对应的语义槽。在上文的例子中我把“订电影票”意图所对应的语义槽定义成图2的形式,这似乎有点简单,并没有体现出张三是依靠什么技能来和用户进行交互的,其实只要在语义槽的每个槽位加上相对应的“话术”就行了,张三发现哪个槽位没填充或者有歧义就使用该槽位事先预定义好的话术去“发问”就好了。因此我们可以把上述“订电影票”的语义槽改为如下图所示:
上述语义槽的设计并不是最好或者最合适的,实际业务中的语义槽更为复杂。
如何做意图识别
**(1)规则模板 **
.*?[地名]{到|去|飞|飞往}[地名].*?{机票|飞机票|航班}.*?
.*? 表示任意字符
[]表示实体类型或词性
{}表示关键词
|表示或。
当用户输入“查询后天广州到上海的航班”,我们对query进行分词和词性标注,匹配到地名“广州”、“上海”,关键词“到”、“航班”,他们的组合和预定义好的模板高度匹配,于是我们确认该query是“订机票”意图;另外对于“有没有下周二到贵阳的航班”只能匹配到一个地名和关键词“航班”,也算是和模板比较匹配了,而且如果这个匹配度和其与其他意图的模板的匹配度相比更高的话,那也可以大胆的认为该query是“订机票”意图。
使用规则模板做意图识别精确率较高,但是召回率较低,特别是对长尾query;另外该方法需要大量人工参与制定规则模板,不易自动化,更难以将其移植。
(2)统计机器学习
使用统计机器学习算法做文本分类,需要人工提取文本特征,比如ngram、词性特征、实体类型特征;提完特征后进行tf-idf向量化表示,然后使用支持向量机、逻辑回归、随机森林等算法进行训练。该方法显然也需要大量人工操作设计领域相关的特征,且用统计机器学习算法做文本分类大家都知道效果是不怎么理想的。
(3)深度学习
使用神经网络来建模做文本分类,它省略了人工设计特征、提取特征的过程,还能借助预训练好的具有语义知识的词向量进行训练;但是该方法需要很多训练数据,所谓大力出奇迹,这就需要依赖很多人工标注数据了,相比之下规则模板方法则不需要标注数据。
意图识别的难点
虽然说意图识别就是文本分类,但是和文本分类还是有点区别的,这个区别造就了它更“难一些”,主要难点如下:
- 输入不规范:错别字、堆砌关键词、非标准自然语言
- 多意图:输入的语句信息量太少造成意图不明确,且有歧义。比如输入仙剑奇侠传,那么是想获得游戏下载、电视剧、电影、音乐还是小说下载呢
- 意图强度:输入的语句好像即属于A意图,又属于B意图,每个意图的的得分都不高
- 时效性:用户的意图是有较强时效性的,用户在不同时间节点的相同的query可能是属于不同意图的,比如query为“战狼”,在当前时间节点可能是想在线观看战狼1或者战狼2,而如果是在战狼3开拍的时间节点搜的话,可能很大概率是想了解战狼3的一些相关新闻了。
如何做槽位填充
槽位填充包括命名实体识别和槽位预测,其实说命名实体识别是不严谨的,比如在“订机票”意图下的语义槽中,应该有“出发地”和“目的地”,虽然他们都是地名,但是有区别,他们的顺序不能变,也就是不能用“地名”来统一代替,而命名实体识别的做法就是将他们都当做“地名”了。我们只能称槽位填充是一个序列标注任务,但绝不能说序列标注任务就是命名实体识别,且我们在标注数据的时候也不能一样标注,下图表明了两者的区别:
可以看到槽位填充标注中,针对订机票意图的城市实体的标注,我们使用dept和arr来区分出发地和目的地实体,所以我们应该使用上图中第二种标注方式来标注数据,然后训练序列标注模型,最后用该模型去做槽位值识别。
槽位预测,是当用户的query中识别不到某些槽位值时,我们首先需要去做一个较为靠谱的预测,而不是凡事都去跟用户瞎互动问东问西来获取这些槽位值。那么槽位预测该如何做呢?从上文的订电影票例子中我们可以看到针对不同槽位,张三调用了用户的地理位置、当前时间的数据来辅助预测(这就是某些APP要获取你手机的地理位置权限的原因),当然做槽位预测还有更多的操作。
通常,将意图识别和槽位填充进行联合训练模型。
对于本次赛题任务,
-
意图识别任务的编码方式可以统一为 {‘texts’:[],‘intent_labels’:[]}
-
Slot(槽值)识别任务采用简化的BIO编码,对于slot位置编码对应的slot类别,对于非slot位置使用’O’编码;其中,navigate_poi-poi的前半部分由意图组成,后半部分由slot组成,通过"-"连接,即navigate_poi为意图,poi是槽值名,这样设计易于统计所有意图和slot的组合。采取联合训练方式,将意图标签和slot标签合并,处理后的训练数据如下:
{ 'words': ['目', '的', '地', 'を', '阿', '波', '加', '茂', '駅', 'に', '設', '定', 'す', 'る'], 'intents': 'navigate_poi', 'slots': ['O', 'O', 'O', 'O', 'navigate_poi-poi', 'navigate_poi-poi', 'navigate_poi-poi', 'navigate_poi-poi', 'navigate_poi-poi', 'O', 'O', 'O', 'O', 'O'] }
模型构建——意图识别
逻辑回归结合TFIDF进行训练(所有的语言语料),并对测试集的意图进行分类;
# 文本分词
train_ja['words'] = train_ja['原始文本'].apply(lambda x: ' '.join(nagisa.tagging(x).words))
train_en['words'] = train_en['原始文本'].apply(lambda x: x.lower())
test_ja['words'] = test_ja['原始文本'].apply(lambda x: ' '.join(nagisa.tagging(x).words))
test_en['words'] = test_en['原始文本'].apply(lambda x: x.lower())
# 训练TFIDF和逻辑回归
pipline = make_pipeline(
TfidfVectorizer(),
LogisticRegression()
)
pipline.fit(
train_ja['words'].tolist() + train_en['words'].tolist(),
train_ja['意图'].tolist() + train_en['意图'].tolist()
)
# 模型预测
test_ja['意图'] = pipline.predict(test_ja['words'])
test_en['意图'] = pipline.predict(test_en['words'])
test_en['槽值1'] = np.nan
test_en['槽值2'] = np.nan
test_ja['槽值1'] = np.nan
test_ja['槽值2'] = np.nan
# 写入提交文件
writer = pd.ExcelWriter('submit.xlsx')
test_en.drop(['words'], axis=1).to_excel(writer, sheet_name='英文_testA', index=None)
test_ja.drop(['words'], axis=1).to_excel(writer, sheet_name='日语_testA', index=None)
writer.save()
writer.close()
划分训练集和验证集,进行线下验证
# 文本分词
train_ja['words'] = train_ja['原始文本'].apply(lambda x: ' '.join(nagisa.tagging(x).words))
train_en['words'] = train_en['原始文本'].apply(lambda x: x.lower())
test_ja['words'] = test_ja['原始文本'].apply(lambda x: ' '.join(nagisa.tagging(x).words))
test_en['words'] = test_en['原始文本'].apply(lambda x: x.lower())
from sklearn.model_selection import train_test_split
train_en1, dev_en = train_test_split(train_en, test_size=.25, random_state=2048)
train_jp1, dev_jp = train_test_split(train_ja, test_size=.25, random_state=1024)
# 训练TFIDF和逻辑回归
pipline = make_pipeline(
TfidfVectorizer(),
LogisticRegression()
)
pipline.fit(
train_jp1['words'].tolist() + train_en1['words'].tolist(),
train_jp1['意图'].tolist() + train_en1['意图'].tolist()
)
train_x = train_en1['words'].tolist() + train_jp1['words'].tolist()
train_y = train_en1['意图'].tolist() + train_jp1['意图'].tolist()
train_y_pred = pipline.predict(train_x )
num_train = len(train1)
tmp = 0
for i in range(num_train):
if(train_y_pred [i] == train_y[i]):
tmp += 1
acc_train = tmp / num_train
print(acc_train)
dev_x = dev_en['words'].tolist() + dev_jp['words'].tolist()
dev_y = dev_en['意图'].tolist() + dev_jp['意图'].tolist()
dev_y_pred = pipline.predict(dev_x)
num_dev = len(dev_x)
tmp = 0
for i in range(num_dev):
if(dev_y_pred [i] == dev_y [i]):
tmp += 1
acc = tmp / num_dev
print(acc)
0.966688874083944
0.9063745019920318
Task4:正则表达式
for i in ['offset','poi','mode','song','device','singer']:
print(i,":",list(set(train_cn[train_cn['intent']==i].意图)))
offset : [‘adjust_ac_windspeed_to_number’, ‘adjust_ac_temperature_to_number’]
poi : [‘navigate_poi’, ‘navigate_landmark_poi’]
mode : [‘open_ac_mode’]
song : [‘music_search_artist_song’]
device : [‘close_car_device’, ‘open_car_device’]
singer : [‘music_search_artist_song’]
按照意图分类,逐个进行槽值匹配:
import re
def get_num_cn(txt):
return re.findall("\d+\.?\d*", txt)
train_cn['num'] = train_cn['原始文本'].apply(get_num_cn)
train_cn
def get_num_ja(txt):
return re.findall("[一|二|三|四|五|六|七|八|九|十]+", txt)
train_ja['num'] = train_ja['原始文本'].apply(get_num_ja)
train_ja[train_ja['num'].apply(len)>0]
mode_lst = "["
for i in list(set(train_cn[train_cn['intent']=='mode']['value'])):
s1 = ''.join(i+'|')
mode_lst += s1
mode_lst = mode_lst[:-1]+']+'
# mode_lst
def getcaozhi_mode(txt):
mode_v = re.findall(mode_lst, txt)
if len(mode_v) > 0:
return "mode:" + mode_v[0]
else:
return np.nan
train_cn['槽值_test'] = np.nan
for i in range(len(train_cn)):
if train_cn.iloc[i, 1] == "open_ac_mode":
train_cn['槽值_test'][i] = getcaozhi_mode(train_cn.iloc[i, 0])
特殊字符 | 含义 |
---|---|
$ | 定位符,匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性,则 $ 也匹配 ‘\n’ 或 ‘\r’。要匹配 $ 字符本身,请使用 $。 |
^ | 定位符,匹配输入字符串的开始位置,除非在方括号表达式中使用,当该符号在方括号表达式中使用时,表示不接受该方括号表达式中的字符集合。要匹配 ^ 字符本身,请使用 ^。 |
\b | 定位符,匹配一个单词边界,即字与空格间的位置。 |
\B | 定位符,非单词边界匹配。 |
* | 匹配前面的子表达式零次或多次(0次、或1次、或多次)。要匹配 * 字符,请使用 *。 |
+ | 匹配前面的子表达式1次或多次。要匹配 + 字符,请使用 +。 |
. | 匹配除换行符 \n 之外的任何单字符。要匹配 . ,请使用 . 。 |
? | 匹配前面的子表达式0次或1次,或指明一个非贪婪限定符。要匹配 ? 字符,请使用 ?。 |
. | 匹配除换行符 \n 之外的任何单字符。要匹配 . ,请使用 . 。 |
\w | 匹配字母、数字、下划线。等价于 [A-Za-z0-9_] |
[ABC] | 匹配 […] 中的所有字符,例如 [aeiou] 匹配字符串 “google runoob taobao” 中所有的 e o u a 字母。 |
[^ABC] | 匹配除了 […] 中字符的所有字符,例如 [^aeiou] 匹配字符串 “google runoob taobao” 中除了 e o u a 字母的所有字母。 |
[A-Z] | 表示一个区间,匹配所有大写字母 |
[a-z] | 表示所有小写字母 |
[\s\S] | 匹配所有。\s 是匹配所有空白符,包括换行,\S 非空白符,不包括换行。 |
{n} | n 是一个非负整数。匹配确定的 n 次。例如,‘o{2}’ 不能匹配 “Bob” 中的 ‘o’,但是能匹配 “food” 中的两个 o |
{n,} | n 是一个非负整数。至少匹配n 次。例如,‘o{2,}’ 不能匹配 “Bob” 中的 ‘o’,但能匹配 “foooood” 中的所有 o。‘o{1,}’ 等价于 ‘o+’。‘o{0,}’ 则等价于 ‘o*’。 |
{n,m} | m 和 n 均为非负整数,其中n <= m。最少匹配 n 次且最多匹配 m 次。例如,“o{1,3}” 将匹配 “fooooood” 中的前三个 o。‘o{0,1}’ 等价于 ‘o?’。请注意在逗号和两个数之间不能有空格。 |
( ) | 标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。要匹配这些字符,请使用 ( 和 )。 |
Task5:BERT模型入门
谷歌推出的BERT模型在11项NLP任务中夺得SOTA结果,引爆了整个NLP界。参考论文《Attention is All You Need》,现在是谷歌云TPU推荐的参考模型。论文相关的Tensorflow的代码可以从GitHub获取,其作为Tensor2Tensor包的一部分。哈佛的NLP团队也实现了一个基于PyTorch的版本,并注释该论文。
Attention is All You Need:https://arxiv.org/abs/1706.03762
BERT取得成功的一个关键因素是Transformer的强大作用。谷歌的Transformer模型最早是用于机器翻译任务,当时达到了SOTA效果。Transformer改进了RNN不能并行训练的缺点,利用self-attention机制实现快速并行。并且Transformer可以增加到非常深的深度,充分发掘DNN模型的特性,提升模型准确率。
模型结构:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# AutoTokenizer:分词器
# Auto:自动识别的
model_name = "bert-base-chinese"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig, BertModel, AutoModel
model = AutoModel.from_pretrained("hfl/chinese-roberta-wwm-ext")
inputs = tokenizer(["把闭锁自动升窗功能关闭。", "帮我查询附近国家电网充电站"], truncation=True, max_length=20, padding=True)
print(inputs)
# input_ids:这个字在vocab次序
# token_type_ids:字符是第一个句子的,还是第二个句子的
# attention_mask:字符是不是padding的
import torch
item = {key: torch.tensor(inputs[key]).view(1, -1) for key, val in inputs.items()}
output = model(input_ids = item['input_ids'], attention_mask = item['attention_mask'])
print(output)
print(output.last_hidden_state.shape, output.pooler_output.shape)
Task6:BERT文本分类
模型构建——意图识别
for tag in ['intent', 'device', 'mode', 'offset', 'endloc', 'landmark', 'singer', 'song']:
train_ja['槽值1'] = train_ja['槽值1'].str.replace(f'{tag}:', '')
train_ja['槽值2'] = train_ja['槽值2'].str.replace(f'{tag}:', '')
train_cn['槽值1'] = train_cn['槽值1'].str.replace(f'{tag}:', '')
train_cn['槽值2'] = train_cn['槽值2'].str.replace(f'{tag}:', '')
train_en['槽值1'] = train_en['槽值1'].str.replace(f'{tag}:', '')
train_en['槽值2'] = train_en['槽值2'].str.replace(f'{tag}:', '')
train_df = pd.concat([
train_ja[['原始文本', '意图', '槽值1', '槽值2']],
train_cn[['原始文本', '意图', '槽值1', '槽值2']],
#train_cn[['原始文本', '意图', '槽值1', '槽值2']].sample(10000),
train_en[['原始文本', '意图', '槽值1', '槽值2']],
],axis = 0)
train_df = train_df.sample(frac=1.0)
train_df['意图_encode'], lbl_ecode = pd.factorize(train_df['意图'])
from torch.utils.data import Dataset, DataLoader, TensorDataset
import torch
from torch import nn
from torch.nn import CrossEntropyLoss
from torch.optim import AdamW
# 数据集读取
class Load_Dataset(Dataset):
def __init__(self, encodings, intent):
self.encodings = encodings
self.intent = intent
# 读取单个样本
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['label'] = torch.tensor(int(self.intent[idx]))
return item
def __len__(self):
return len(self.intent)
class Model(nn.Module):
def __init__(self, num_labels):
super(Model,self).__init__()
self.model = model = AutoModel.from_pretrained("bert-base-multilingual-cased")
self.dropout = nn.Dropout(0.1)
self.classifier = nn.Linear(768, num_labels)
def forward(self, input_ids=None, attention_mask=None,labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state
logits = self.classifier(sequence_output[:,0,:].view(-1,768))
return logits
def train():
model.train()
total_train_loss = 0
iter_num = 0
total_iter = len(train_loader)
for batch in train_loader:
# 正向传播
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
label = batch['label'].to(device)
pred = model(
input_ids,
attention_mask
)
loss = loss_fn(pred, label)
# 反向梯度信息
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# 参数更新
optim.step()
iter_num += 1
if(iter_num % 100 == 0):
print("iter_num: %d, loss: %.4f, %.2f%% %.4f" % (
iter_num, loss.item(), iter_num/total_iter*100,
(pred.argmax(1) == label).float().data.cpu().numpy().mean(),
))
def validation():
model.eval()
label_acc = 0
for batch in val_dataloader:
with torch.no_grad():
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
label = batch['label'].to(device)
pred = model(
input_ids,
attention_mask
)
label_acc += (pred.argmax(1) == label).float().sum().item()
label_acc = label_acc / len(val_dataloader.dataset)
print("-------------------------------")
print("Accuracy: %.4f" % (label_acc))
print("-------------------------------")
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
config = AutoConfig.from_pretrained("bert-base-multilingual-cased")
train_encoding = tokenizer(train_df['原始文本'].tolist()[:-500], truncation=True, padding=True, max_length=40)
val_encoding = tokenizer(train_df['原始文本'].tolist()[-500:], truncation=True, padding=True, max_length=40)
train_dataset = Load_Dataset(train_encoding, train_df['意图_encode'].tolist()[:-500])
val_dataset = Load_Dataset(val_encoding, train_df['意图_encode'].tolist()[-500:])
# 单个读取到批量读取
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False)
model = Model(18)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = 'cpu'
model = model.to(device)
loss_fn = CrossEntropyLoss() # ingore index = -1
optim = AdamW(model.parameters(), lr=5e-5)
for epoch in range(2):
train()
validation()
iter_num: 100, loss: 0.8919, 13.91% 0.6250
iter_num: 200, loss: 0.3708, 27.82% 0.8750
iter_num: 300, loss: 0.3024, 41.72% 0.9375
iter_num: 400, loss: 0.2056, 55.63% 0.8750
iter_num: 500, loss: 0.0389, 69.54% 1.0000
iter_num: 600, loss: 1.5263, 83.45% 0.6875
iter_num: 700, loss: 0.2882, 97.36% 0.9375
-------------------------------
Accuracy: 0.9380
-------------------------------
iter_num: 100, loss: 0.0069, 13.91% 1.0000
iter_num: 200, loss: 0.2506, 27.82% 0.9375
iter_num: 300, loss: 1.1997, 41.72% 0.7500
iter_num: 400, loss: 0.0121, 55.63% 1.0000
iter_num: 500, loss: 0.0082, 69.54% 1.0000
iter_num: 600, loss: 0.0483, 83.45% 1.0000
iter_num: 700, loss: 0.3702, 97.36% 0.8750
-------------------------------
Accuracy: 0.9700
-------------------------------
def prediction():
model.eval()
test_label = []
for batch in test_dataloader:
with torch.no_grad():
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
pred = model(input_ids, attention_mask)
test_label += list(pred.argmax(1).data.cpu().numpy())
return test_label
test_encoding = tokenizer(test_en['原始文本'].tolist(), truncation=True, padding=True, max_length=40)
test_dataset = Load_Dataset(test_encoding, [0] * len(test_en))
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False)
test_en_intent = prediction()
test_encoding = tokenizer(test_ja['原始文本'].tolist(), truncation=True, padding=True, max_length=40)
test_dataset = Load_Dataset(test_encoding, [0] * len(test_ja))
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False)
test_ja_intent = prediction()
test_ja['意图'] = [lbl_ecode[x] for x in test_ja_intent]
test_en['意图'] = [lbl_ecode[x] for x in test_en_intent]
test_en['槽值1'] = np.nan
test_en['槽值2'] = np.nan
test_ja['槽值1'] = np.nan
test_ja['槽值2'] = np.nan
writer = pd.ExcelWriter('submit.xlsx')
test_en[['意图', '槽值1', '槽值2']].to_excel(writer, sheet_name='英文_testA', index=None)
test_ja[['意图', '槽值1', '槽值2']].to_excel(writer, sheet_name='日语_testA', index=None)
writer.save()
writer.close()
模型构建—— 槽值识别
# 数据集读取
class XunFeiDataset2(Dataset):
def __init__(self, encodings, label_i, label_j):
self.encodings = encodings
self.intent = intent
self.tag1 = tag1
self.tag2 = tag2
# 读取单个样本
def __getitem__(self, idx):
tags_single = [0] + list(tags_single) + [0]
tags_single = tags_single + [0] * (32 - len(tags_single))
tags_single = tags_single[:maxlen]
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['label_i'] = torch.tensor(int(self.label_i[idx]))
item['label_j'] = torch.tensor(int(self.label_j[idx]))
return item
def __len__(self):
return len(self.label_i)
class XunFeiModel2(nn.Module):
def __init__(self, num_labels_i, num_labels_j):
super(XunFeiModel,self).__init__()
#Load Model with given checkpoint and extract its body
self.model = AutoModel.from_pretrained("hfl/chinese-roberta-wwm-ext")
self.dropout = nn.Dropout(0.1)
self.classifier_i = nn.Linear(768, num_labels_i)
self.classifier_j = nn.Linear(768, num_labels_j)
def forward(self, input_ids=None, attention_mask=None,labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state
# 问诊方向
logits_i = self.classifier_i(sequence_output[:,0,:].view(-1,768))
# 疾病标签
logits_j = self.classifier_j(sequence_output[:,0,:].view(-1,768))
return logits_i, logits_j
Task7:BER实体抽取
构建联合模型
—— 构建意图识别和槽位填充进行联合训练模型
import os
import time
import json
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import langid
import paddle
import paddlenlp
import paddle.nn.functional as F
from functools import partial
from paddlenlp.data import Stack, Dict, Pad
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers.bert.tokenizer import BertTokenizer
import matplotlib.pyplot as plt
import seaborn as sns
# 设定随机种子,固定结果
seed = 1024
def set_seed(seed):
paddle.seed(seed)
random.seed(seed)
np.random.seed(seed)
set_seed(seed)
# 超参数设置
# 预训练模型
# 1.bert-base-multilingual-cased:BERT多语言预训练模型,cased 区分大小写
# 2.bert-base-multilingual-uncased:BERT多语言预训练模型,uncased 不区分大小写
MODEL_NAME = 'bert-base-multilingual-cased' # bert-base-multilingual-cased
max_seq_length = 48 # 最大文本长度
intent_train_batch_size = 16
intent_valid_batch_size = 64
intent_test_batch_size = 32
ignore_label = -100 # pad slot时设置igore_label的值
learning_rate = 4e-5
epochs = 20 # 训练轮次
warmup_proportion = 0.1 # 学习率预热比例
weight_decay = 0.01 # 权重衰减系数,类似模型正则项策略,避免模型过拟合
max_grad_norm = 1.0
# 训练结束后,存储模型参数
save_dir_curr = "./intent_model_{}".format(MODEL_NAME)
# 记录训练epoch、损失等值
loggiing_print = 50
loggiing_eval = 50
# 划分训练集和验证集(此处用英文和日文进行训练,尝试过中文训练,效果不佳)
intent_train_dataset1,intent_validation_dataset1 = train_test_split(japan_train, test_size=0.3, random_state=seed)
intent_train_dataset2,intent_validation_dataset2 = train_test_split(en_train, test_size=0.3, random_state=seed)
intent_validation_dataset = pd.concat([intent_validation_dataset1,intent_validation_dataset2],axis=0)
intent_train_dataset = pd.concat([intent_train_dataset1,intent_train_dataset2],axis=0)
print(intent_train_dataset.shape,intent_validation_dataset.shape)
# 统计意图标签
JAPAN_INTENT_LIST = list(set(train_ja['意图']))
print("日语训练集共%d条\t意图标签数目%d个" % (train_ja.shape[0],len(JAPAN_INTENT_LIST)))
EN_INTENT_LIST = list(set(train_en['意图']))
print("英文训练集共%d条\t意图标签数目%d个" % (train_en.shape[0],len(EN_INTENT_LIST)))
ZH_INTENT_LIST = list(set(train_cn['意图']))
print("中文训练集共%d条\t意图标签数目%d个" % (train_cn.shape[0],len(ZH_INTENT_LIST )))
print("测试集共%d条" % (test.shape[0]))
# 意图标签
INTENT_LIST = list(set(EN_INTENT_LIST + ZH_INTENT_LIST + EN_INTENT_LIST))
print(f"意图标签共{len(INTENT_LIST)}个\n分别有{INTENT_LIST}")
len(set(JAPAN_INTENT_LIST) & set(ZH_INTENT_LIST)),len(set(JAPAN_INTENT_LIST) & set(EN_INTENT_LIST)),len(set(ZH_INTENT_LIST) & set(EN_INTENT_LIST)),len(INTENT_LIST)
# 统计Slot标签
SLOT_LIST = set()
SLOT_LIST.add('O')
for idx,rows in train_ja.iterrows():
if rows['槽值1'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值1"].split(":")[0])
SLOT_LIST.add(slot_name)
if rows['槽值2'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值2"].split(":")[0])
SLOT_LIST.add(slot_name)
for idx,rows in train_en.iterrows():
if rows['槽值1'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值1"].split(":")[0])
SLOT_LIST.add(slot_name)
if rows['槽值2'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值2"].split(":")[0])
SLOT_LIST.add(slot_name)
for idx,rows in train_cn.iterrows():
if rows['槽值1'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值1"].split(":")[0])
SLOT_LIST.add(slot_name)
if rows['槽值2'] is not np.NaN:
slot_name = (rows["意图"] +"-"+ rows["槽值2"].split(":")[0])
SLOT_LIST.add(slot_name)
SLOT_LIST = list(set(SLOT_LIST))
SLOT_LIST = sorted(SLOT_LIST)
print(f"intent-slot标签共{len(SLOT_LIST)}个\n分别是{SLOT_LIST}")
# 加载tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
# 构建label_map构建映射关系,将槽位类型、意图类型和隐藏意图类型处理成字典
def get_label_map(label_list):
id2label = dict([(idx, label) for idx, label in enumerate(label_list)])
label2id = dict([(label, idx) for idx, label in enumerate(label_list)])
return id2label, label2id
id2intent, intent2id = get_label_map(INTENT_LIST)
id2slot, slot2id = get_label_map(SLOT_LIST)
def en_slot_tokenizer_gen(texts,intents=None,slot_label=None,slot_value=None,modeSlot=True):
# 使用tokenizer编码原始文本和slot值
new_texts = list(tokenizer.convert_ids_to_tokens(tokenizer(texts)['input_ids']))
new_slots = np.full(shape=len(new_texts),fill_value='O',dtype='object')
# 如果是无slot的文本直接返回编码后的全O标签,有slot则使用新的序列进行搜索生成新slot序列
if modeSlot is False:
return new_slots
else:
new_slot_value = tokenizer.convert_ids_to_tokens(tokenizer(slot_value)['input_ids'])[1:-1]
s = new_texts.index(new_slot_value[0])
e = s + len(new_slot_value)
new_slots[s:e] = [(intents+"-"+slot_label)] * len(new_slot_value)
return new_slots
def ja_slot_tokenizer_gen(texts,intents=None,slot_label=None,slot_value=None,modeSlot=True):
# 使用tokenizer编码原始文本和slot值
new_texts = list(tokenizer.convert_ids_to_tokens(tokenizer(texts)['input_ids']))
new_slots = np.full(shape=len(new_texts),fill_value='O',dtype='object')
# 如果是无slot的文本直接返回编码后的全O标签,有slot则使用新的序列进行搜索生成新slot序列
if modeSlot is False:
return new_slots
else:
try:
new_slot_value = tokenizer.convert_ids_to_tokens(tokenizer(slot_value)['input_ids'])[1:-1]
s = [idx for idx,i in enumerate(new_texts) if new_slot_value[0] in i][0]
e = s + len(new_slot_value)
new_slots[s:e] = [(intents+"-"+slot_label)] * len(new_slot_value)
except:
print("当前slot处理出现错误,对报错文本返回全O标签")
print(slot_value)
for i,j in zip(new_texts,new_slots):
print(i,"\t",j)
print("==="*30)
return new_slots
# 根据本地文件格式和上述方式定义数据读取生成器
def read(df,istrain=True):
# 是否是训练集,否则仅返回文本,是则分会文本和标签
if istrain:
for idx,rows in tqdm(df.iterrows(),total=df.shape[0]):
text = rows['原始文本']
intents = rows['意图']
# 使用langid判定文本属于哪一种语言,得到全O槽位
language = langid.classify(text)[0]
if language == 'en':
texts = text
slots = en_slot_tokenizer_gen(texts,modeSlot=False)
elif language == 'ja':
texts = text
slots = ja_slot_tokenizer_gen(texts,modeSlot=False)
# 排除槽值为空的情况
slot_cols = ['槽值1','槽值2']
if rows['槽值1'] is np.NaN:
slot_cols.remove('槽值1')
if rows['槽值2'] is np.NaN:
slot_cols.remove('槽值2')
# 根据槽值字段列表存在的情况对槽值进行处理
if len(slot_cols) > 0:
for solt_col in slot_cols:
# 获取 槽值标签 和 槽值内容
slot_value = rows[solt_col].split(":")[1]
slot_label = rows[solt_col].split(":")[0]
# 若是英语则调用tokenizer进行处理,非英文则搜索对应槽值找到起始位置和结束为止生成序列标记
if language == 'en':
slots = en_slot_tokenizer_gen(texts,intents,slot_label,slot_value,modeSlot=True)
elif language == 'ja':
slots = ja_slot_tokenizer_gen(texts,intents,slot_label,slot_value,modeSlot=True)
yield {
"words":texts,
"intents":intents,
"slots":slots,
}
else:
for idx,rows in df.iterrows():
yield {
'words': rows['原始文本'],
}
# 将生成器传入load_dataset
intent_train_ds = load_dataset(read, df=intent_train_dataset, lazy=False)
intent_valid_ds = load_dataset(read, df=intent_validation_dataset, lazy=False)
for idx in range(78,80):
print(intent_train_ds[idx])
print("==="*30)
for idx in range(100,102):
print(intent_valid_ds[idx])
print("==="*30)
{'words': '2000円くらいで食事できる飛騨小坂駅を教える', 'intents': 'navigate_landmark_poi', 'slots': array(['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'navigate_landmark_poi-poi', 'navigate_landmark_poi-poi',
'navigate_landmark_poi-poi', 'navigate_landmark_poi-poi',
'navigate_landmark_poi-poi', 'O', 'O', 'O', 'O'], dtype=object)}
==================================================================
{'words': '目的地を阿波加茂駅に設定する', 'intents': 'navigate_poi', 'slots': array(['O', 'O', 'O', 'O', 'O', 'navigate_poi-poi', 'navigate_poi-poi',
'navigate_poi-poi', 'navigate_poi-poi', 'navigate_poi-poi', 'O',
'O', 'O', 'O', 'O'], dtype=object)}
==================================================================
{'words': '窓を閉める', 'intents': 'close_car_device', 'slots': array(['O', 'close_car_device-device', 'O', 'O', 'O', 'O'], dtype=object)}
==================================================================
{'words': '会津西方駅に行きたい', 'intents': 'navigate_poi', 'slots': array(['O', 'navigate_poi-poi', 'navigate_poi-poi', 'navigate_poi-poi',
'navigate_poi-poi', 'navigate_poi-poi', 'O', 'O', 'O', 'O', 'O'],
dtype=object)}
==================================================================
# 编码
def convert_example(example, tokenizer, max_seq_len=512, mode='train'):
# 调用tokenizer的数据处理方法把文本转为id
tokenized_input = tokenizer(
example['words'],
is_split_into_words=True,
max_seq_len=max_seq_len)
if mode == "test":
return tokenized_input
# 把意图标签转为数字id
tokenized_input['intent_labels'] = [intent2id[example['intents']]]
tokenized_input['slot_labels'] = [slot2id[i] for i in example['slots']]
return tokenized_input
intent_train_trans_func = partial(
convert_example,
tokenizer=tokenizer,
mode='train',
max_seq_len=max_seq_length)
intent_valid_trans_func = partial(
convert_example,
tokenizer=tokenizer,
mode='dev',
max_seq_len=max_seq_length)
intent_train_ds.map(intent_train_trans_func, lazy=False)
intent_valid_ds.map(intent_valid_trans_func, lazy=False)
# 初始化BatchSampler
np.random.seed(seed)
intent_train_batch_sampler = paddle.io.BatchSampler(
intent_train_ds, batch_size=intent_train_batch_size, shuffle=True)
intent_valid_batch_sampler = paddle.io.BatchSampler(
intent_valid_ds, batch_size=intent_valid_batch_size, shuffle=False)
# 定义batchify_fn
batchify_fn = lambda samples, fn = Dict({
"input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
"token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
"intent_labels": Stack(dtype="int32"),
"slot_labels":Pad(axis=0, pad_val=ignore_label)
}): fn(samples)
# 初始化DataLoader
intent_train_data_loader = paddle.io.DataLoader(
dataset=intent_train_ds,
batch_sampler=intent_train_batch_sampler,
collate_fn=batchify_fn,
return_list=True)
intent_valid_data_loader = paddle.io.DataLoader(
dataset=intent_valid_ds,
batch_sampler=intent_valid_batch_sampler,
collate_fn=batchify_fn,
return_list=True)
# 相同方式构造测试集
intent_test_ds = load_dataset(read,df=test,istrain=False, lazy=False)
intent_test_trans_func = partial(
convert_example,
tokenizer=tokenizer,
mode='test',
max_seq_len=max_seq_length)
intent_test_ds.map(intent_test_trans_func, lazy=False)
intent_test_batch_sampler = paddle.io.BatchSampler(intent_test_ds, batch_size=intent_test_batch_size, shuffle=False)
test_batchify_fn = lambda samples, fn = Dict({
"input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
"token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
}): fn(samples)
intent_test_data_loader = paddle.io.DataLoader(
dataset=intent_test_ds,
batch_sampler=intent_test_batch_sampler,
collate_fn=test_batchify_fn,
return_list=True)
创建联合识别模型
import paddle
from paddle import nn
from paddlenlp.transformers.bert.modeling import BertPretrainedModel
# Model
class BERTIntentModel(BertPretrainedModel):
def __init__(self, bert, intent_dim, slot_dim, dropout=None):
super(BERTIntentModel, self).__init__()
# 标签大小
self.intent_num_labels = intent_dim
self.slot_num_labels = slot_dim
# 预训练模型
self.bert = bert
# dropout
self.intent_dropout = nn.Dropout(dropout if dropout is not None else self.bert.onfig["hidden_dropout_prob"])
self.slot_dropout = nn.Dropout(dropout if dropout is not None else self.bert.config["hidden_dropout_prob"])
# intent classifier
self.intent_classifier = nn.Linear(self.bert.config['hidden_size'], self.intent_num_labels)
# slot classifier
self.slot_classifier = nn.Linear(self.bert.config['hidden_size'], self.slot_num_labels)
# self.apply(self.init_weights)
def forward(self,input_ids,token_type_ids=None):
sequence_output, _ = self.bert(input_ids,token_type_ids=token_type_ids)
pooled_output_intent = sequence_output.mean(axis=1)
pooled_output_intent = self.intent_dropout(pooled_output_intent)
intent_logits = self.intent_classifier(pooled_output_intent)
pooled_output_slot = self.intent_dropout(sequence_output)
slot_logits = self.slot_classifier(pooled_output_slot)
return intent_logits,slot_logits
model = BERTIntentModel.from_pretrained(MODEL_NAME, intent_dim=len(intent2id), slot_dim=len(slot2id), dropout=0.1)
# 总步数
num_training_steps = len(intent_train_data_loader) * epochs
# 学习率衰减策略
lr_scheduler = paddlenlp.transformers.LinearDecayWithWarmup(learning_rate, num_training_steps,warmup_proportion)
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
# 定义优化器
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=weight_decay,
apply_decay_param_fun=lambda x: x in decay_params,
grad_clip=paddle.nn.ClipGradByGlobalNorm(max_grad_norm))
模型训练与保存
# 验证部分
@paddle.no_grad()
def evaluation(model, data_loader):
model.eval()
labels = []
preds = []
slot_acc_count = 0.
for batch in data_loader:
input_ids, token_type_ids, intent_labels, slot_labels = batch
intent_logits, slot_logits = model(input_ids, token_type_ids)
# 计算 slot 预测标签和真实标签
slot_logits = paddle.argmax(slot_logits,axis=2)
length_labels = [len([i for i in s_slot if i != -100]) for s_slot in slot_labels]
for r,p,l in zip(slot_logits,slot_labels,length_labels):
if ((r[:l]==p[:l]).sum() == l):
slot_acc_count += 1
# 意图预测标签和真实标签
preds.extend(paddle.argmax(intent_logits,axis=1).numpy())
labels.extend(intent_labels.numpy())
slot_acc = slot_acc_count / intent_validation_dataset.shape[0]
intent_acc = accuracy_score(y_true=labels,y_pred=preds)
return intent_acc,slot_acc
# 训练阶段
def do_train(model,data_loader):
total_step = len(data_loader) * epochs
intent_model_total_epochs = 0
batch_time = time.time()
best_score = 0.95
best_intent_acc = 0.
best_slot_acc = 0.
# 训练
print("train ...")
for epoch in range(0, epochs):
model.train()
this_epoch_training_loss = 0
for step, batch in enumerate(data_loader, start=1):
input_ids, token_type_ids, intent_labels, slot_labels = batch
intent_logits, slot_logits = model(input_ids, token_type_ids)
intent_loss = F.softmax_with_cross_entropy(intent_logits, intent_labels).mean()
slot_loss = F.cross_entropy(slot_logits, slot_labels,ignore_index=ignore_label).mean()
loss = intent_loss + slot_loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_grad()
this_epoch_training_loss += loss.numpy()
intent_model_total_epochs += 1
if intent_model_total_epochs % loggiing_print == 0:
print("step: %d / %d, training loss: %.5f" % (intent_model_total_epochs, total_step, this_epoch_training_loss/intent_model_total_epochs))
# 验证
eval_intent_score,eval_slot_score = evaluation(model, intent_valid_data_loader)
eval_score = (eval_intent_score + eval_slot_score) / 2
print("eval acc: %.5f speed: %.3f s" % (eval_score,(time.time()-batch_time)))
batch_time = time.time()
if best_score < eval_score:
print("Intent ACC update %.5f ---> %.5f Slot ACC update %.5f ---> %.5f Score update %.5f ---> %.5f" % (best_intent_acc,eval_intent_score,best_slot_acc,eval_slot_score,best_score,eval_score))
best_score = eval_score
best_intent_acc = eval_intent_score
best_slot_acc = eval_slot_score
# 保存模型
os.makedirs(save_dir_curr,exist_ok=True)
save_param_path = os.path.join(save_dir_curr, 'model_best.pdparams')
paddle.save(model.state_dict(), save_param_path)
# 保存tokenizer
tokenizer.save_pretrained(save_dir_curr)
else:
print("but best score least %.5f" % best_score )
return best_score
# 训练
best_score = do_train(model,intent_train_data_loader)
train ...
step: 50 / 2020, training loss: 3.28993
step: 100 / 2020, training loss: 2.49215
eval acc: 0.66791 speed: 332.651 s
but best score least 0.95000
step: 150 / 2020, training loss: 0.27191
step: 200 / 2020, training loss: 0.38707
eval acc: 0.83458 speed: 329.139 s
but best score least 0.95000
step: 250 / 2020, training loss: 0.06808
step: 300 / 2020, training loss: 0.12014
eval acc: 0.91045 speed: 330.730 s
but best score least 0.95000
step: 350 / 2020, training loss: 0.03155
step: 400 / 2020, training loss: 0.06935
eval acc: 0.91418 speed: 331.229 s
but best score least 0.95000
step: 450 / 2020, training loss: 0.01728
step: 500 / 2020, training loss: 0.04216
eval acc: 0.93532 speed: 335.089 s
but best score least 0.95000
step: 550 / 2020, training loss: 0.01810
step: 600 / 2020, training loss: 0.03292
eval acc: 0.94652 speed: 335.920 s
but best score least 0.95000
step: 650 / 2020, training loss: 0.01072
step: 700 / 2020, training loss: 0.02260
eval acc: 0.95647 speed: 346.218 s
Intent ACC update 0.00000 ---> 0.95522 Slot ACC update 0.00000 ---> 0.95771 Score update 0.95000 ---> 0.95647
[2022-07-17 12:48:17,252] [ INFO] - tokenizer config file saved in checkpoint/intent_model_bert-base-multilingual-cased/tokenizer_config.json
[2022-07-17 12:48:17,257] [ INFO] - Special tokens file saved in checkpoint/intent_model_bert-base-multilingual-cased/special_tokens_map.json
step: 750 / 2020, training loss: 0.00686
step: 800 / 2020, training loss: 0.02053
eval acc: 0.95522 speed: 356.755 s
but best score least 0.95647
step: 850 / 2020, training loss: 0.00754
step: 900 / 2020, training loss: 0.01586
eval acc: 0.95522 speed: 346.474 s
but best score least 0.95647
step: 950 / 2020, training loss: 0.00447
step: 1000 / 2020, training loss: 0.01250
eval acc: 0.94900 speed: 344.225 s
but best score least 0.95647
step: 1050 / 2020, training loss: 0.00423
step: 1100 / 2020, training loss: 0.00993
eval acc: 0.95522 speed: 353.853 s
but best score least 0.95647
step: 1150 / 2020, training loss: 0.00401
step: 1200 / 2020, training loss: 0.00825
eval acc: 0.95647 speed: 354.875 s
but best score least 0.95647
step: 1250 / 2020, training loss: 0.00235
step: 1300 / 2020, training loss: 0.00769
……
step: 1900 / 2020, training loss: 0.00257
eval acc: 0.96144 speed: 378.821 s
but best score least 0.96144
step: 1950 / 2020, training loss: 0.00071
step: 2000 / 2020, training loss: 0.00244
eval acc: 0.96020 speed: 374.558 s
but best score least 0.96144
# 运行记录
# logging part
logging_dir = '../res/summit'
logging_name = os.path.join(logging_dir,'run_logging.csv')
os.makedirs(logging_dir,exist_ok=True)
var = [MODEL_NAME, seed, learning_rate, max_seq_length, best_score]
names = ['model', 'seed', 'lr', "max_len", 'best_score']
vars_dict = {k: v for k, v in zip(names, var)}
results = dict(**vars_dict)
keys = list(results.keys())
values = list(results.values())
if not os.path.exists(logging_name):
ori = []
ori.append(values)
logging_df = pd.DataFrame(ori, columns=keys)
logging_df.to_csv(logging_name, index=False)
else:
logging_df= pd.read_csv(logging_name)
new = pd.DataFrame(results, index=[1])
logging_df = logging_df.append(new, ignore_index=True)
logging_df.to_csv(logging_name, index=False)
# 模型预测
def do_sample_predict(model,data_loader):
model.eval()
pred_intents = []
pred_slots = []
for batch in data_loader:
input_ids, token_type_ids = batch
intent_logits, slot_logits = model(input_ids, token_type_ids)
pred_intents.extend(paddle.argmax(intent_logits,axis=1).numpy())
slot_logits = paddle.argmax(slot_logits,axis=2)
length_labels = [len([i for i in i_input if i != tokenizer.pad_token_id]) for i_input in input_ids]
pred_slots.extend([(s.numpy(),i.numpy(),l) for s,i,l in zip(slot_logits,input_ids,length_labels)])
return pred_intents,pred_slots
state_dict = paddle.load(os.path.join(save_dir_curr,'model_best.pdparams'))
model.load_dict(state_dict)
print("predict start ...")
predict_intent,predict_slot = do_sample_predict(model,intent_test_data_loader)
print("predict end ...")
# 重新编写tokenizer的convert_ids_to_string函数,去除convert_ids_to_tokens产生的空格和其他格式影响
def convert_ids_to_string(tokenizer,ids):
tokens = tokenizer.convert_ids_to_tokens(ids)
out_string = " ".join(tokens).replace(" ##", "").strip()
language = langid.classify(out_string)[0]
if language in ["en",'fr']:
out_string = out_string.replace(" - ","-").replace(" . ",".")
elif language in ['ja','zh']:
out_string = out_string.replace(" ","").replace("##", "")
return out_string
# 将意图标签id转换为文本
pred_intent = [id2intent[i] for i in predict_intent]
# 将slot标签id转换为文本
pred_slots1 = []
pred_slots2 = []
for idx,sample in tqdm(enumerate(predict_slot)):
slot_label,text_input_ids,length = sample
slot_label = slot_label[1:length-1]
text_input_ids = text_input_ids[1:length-1]
if slot_label.sum() == 0:
pred_slots1.append(np.NAN)
pred_slots2.append(np.NAN)
else:
slot_name = [id2slot[i] for i in slot_label]
slot_dict = {slot:[] for slot in set(slot_name) if slot!='O'}
for word_id,tag in zip(text_input_ids,slot_name):
if tag != 'O':
slot_dict[tag].append(word_id)
for slot in slot_dict.keys():
slot_dict[slot] = convert_ids_to_string(tokenizer,slot_dict[slot])
# print(slot_dict[slot])
results = ["{}:{}".format(k.split("-")[1],v) for k,v in slot_dict.items()]
pred_slots1.append(results[0])
if len(results) > 1:
pred_slots2.append(results[1])
else:
pred_slots2.append(np.NAN)
assert len(pred_slots1) == len(pred_slots1)
assert len(pred_slots1) == len(pred_intent)
assert len(pred_slots2) == len(pred_intent)
# 提交文件名称
nums = len(logging_df) - 1
sumbit_excel_name = "../res/submit_testA{}.xlsx".format(nums)
# 生成文件
predict_submit = pd.DataFrame([],columns=["原始文本","意图","槽值1","槽值2"])
# 原始文本
texts = test['原始文本'].tolist()
# 导入每条数据
for text,intent,slot1,slot2 in zip(texts,pred_intent,pred_slots1,pred_slots2):
predict_submit.loc[len(predict_submit),:] = [text,intent,slot1,slot2]
# 分割英文和日语两个文件
sumbit_en = predict_submit.loc[:test_en.shape[0]-1,:]
sumbit_japan = predict_submit.loc[test_en.shape[0]:,:]
# 创建词典保存sheet数据
sheet_names = ['英文_testA','日语_testA']
report_dict = {'英文_testA':sumbit_en,'日语_testA':sumbit_japan}
# 创建ExcelWriter文件读写,根据sheet_name保存文件
writer = pd.ExcelWriter(sumbit_excel_name, engine='openpyxl')
for sheet_name in sheet_names:
report_dict[sheet_name].to_excel(excel_writer=writer, sheet_name=sheet_name, encoding="utf-8",index=False)
print(f"{sheet_name} save complete")
writer.save()
writer.close()
print(f"==={sumbit_excel_name} complete===")