NER(命名实体识别)是一种自然语言处理的基础技术,用于在给定的文本内容中提取适当的实体,并将提取的实体分类到预定义的类别下,例如公司名称、人名、地名等实体。Spacy 库允许我们通过根据特定上下文更新现有模型,也可以训练新的模型。在本文中,我们将探讨如何构建自定义 NER 模型。
1.首先加载所需的依赖库:
import os
import pandas as pd
import spacy
import random
import joblib
from spacy.training.example import Example
import time
2.从label studio标注完数据后导出Json文件,并将批量文件合并转换成Excel:
# labeling studio导出数据后,pd读json
path = 'C:/Users/xxx/downloads/'
dfs = []
for file in os.listdir(path):
data = pd.read_json(path + str(file), encoding='utf-8')
data['created_at'] = data['created_at'].apply(lambda x: x.replace(tzinfo=None) if x is not None else None)
data['updated_at'] = data['updated_at'].apply(lambda x: x.replace(tzinfo=None) if x is not None else None)
dfs.append(data)
# 合并所有的dataframe
merged_df = pd.concat(dfs)
# 将合并后的dataframe输出到Excel文件
merged_df.to_excel(path + 'labeling output.xlsx', index=False)
3.将上述Excel output转换成训练数据集:
df = pd.read_excel("C:/Users/xxx/downloads/labeling output.xlsx")
df = df.dropna()
label = df.label.values.tolist()
content = df['Content'].values.tolist()
all_label, all_values = [], []
## 处理 Label 列的数据
for line in range(len(label)):
ner_label, values = [], []
for i in eval(label[line]):
start = i['start']
end = i['end']
labels = i['labels'][0]
value = i['text']
# print(start,end,labels)
ner_label.append((start, end, labels))
values.append(value)
all_label.append(ner_label)
all_values.append(values)
df['ner-label'] = all_label
df['values'] = all_values
df.to_excel("C:/Users/xxx/downloads/ner.xlsx", index=False)
4.检查ner.xlsx的数据质量,是否满足训练要求,数据格式包含如下column:
content | ner-label | values |
文本内容1 | (start_id1, end_id1, Entity1), (start_id2, end_id2, Entity2), ... | ['Entity's content1', 'Entity's content2', ...] |
文本内容2 | (start_id3, end_id3, Entity3), (start_id4, end_id4, Entity2), ... | ['Entity's content3', 'Entity's content4', ...] |
5.模型训练环节,将读取好的数据集转换训练NER模型的数据格式:
def GroupData(content,label,TRAIN_DATA):
for line in range(len(content)):
Entity = {}
Entities = []
for i in eval(label[line]):
if 'int' in str(type(i)):
Entities.append(eval(label[line]))
break
if 'tuple' in str(type(i)):
Entities.append(i)
Entity["entities"] = Entities
TRAIN_DATA.append((content[line],Entity))
return TRAIN_DATA
6.进入模型训练函数,iterations:模型迭代次数
def train_spacy(TRAIN_DATA, iterations):
# 创建一个空白英文模型:en,中文模型:zh
nlp = spacy.blank("en")
#若没有则添加NER组件
if "ner" not in nlp.pipe_names:
ner = nlp.add_pipe("ner",last=True)
#添加所有实体标签到spaCy模型
for _, annotations in TRAIN_DATA:
# print(annotations.get("entities"))
for ent in annotations.get("entities"):
ner.add_label(ent[2])
#获取模型中除了NER之外的其他管件
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
#开始训练 #消除其他管件的影响
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
optimizer.learn_rate = 1e-3
for itn in range(iterations):
print("开始迭代",itn + 1,"次")
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
try:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], losses=losses, sgd=optimizer) # drop=0.4
except:
continue
return (nlp)
7.主函数:
if __name__ == '__main__':
df = pd.read_excel(r"C:\Users\xxx\Desktop\data\NER\ner-train.xlsx")
df = df.dropna()
## 转换为列表 ##
content = df['Content'].values.tolist()
label = df['label'].values.tolist()
TRAIN_DATA = []
## 整合数据
GroupData(content, label, TRAIN_DATA)
## 训练模型
begin = time.perf_counter()
trained_nlp = train_spacy(TRAIN_DATA,50)
joblib.dump(trained_nlp, r"C:\Users\xxx\Desktop\data\NER\NER.m")
end_time = time.perf_counter()
run_time = end_time - begin
print('模型生成成功,建模运行时间', run_time, 's')
至此,相信你已成功训练出自定义的NER模型,参考下一篇文章会继续讲述NLP干货: (4) NER模型的调用和效果评估。