命名实体识别(NameEntity Recognition) 是信息提取的一个子任务,其目的是将文本中的命名实体定位并分类为预定义的类别,如人员、组织、位置等。它是信息抽取、问答系统和句法分析等应用领域的重要基础技术,是结构化信息抽取的重要步骤。
目前可公开访问获得的、高质量、细粒度的中文NER数据集较少,其中(CLUE)基于清华大学开源的文本分类数据集THUCNEWS,选出部分数据进行细粒度命名实体标注,并对数据进行清洗,得到一个细粒度的NER数据集。
项目地址:GitHub - CLUEbenchmark/CLUENER2020: CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
详细内容可以参考论文:
CLUENER2020: Fine-grained Named Entity Recognition Dataset and Benchmark for Chinese
https://arxiv.org/ftp/arxiv/papers/2001/2001.04351.pdf
下面是对数据集的简单介绍:
数据类别
数据分为10个标签类别,分别为: 地址(address),书名(book),公司(company),游戏(game),政府(government),电影(movie),姓名(name),组织机构(organization),职位(position),景点(scene)
- 1
标签类别定义 & 标注规则
地址(address): **省**市**区**街**号,**路,**街道,**村等(如单独出现也标记)。地址是标记尽量完全的, 标记到最细。
书名(book): 小说,杂志,习题集,教科书,教辅,地图册,食谱,书店里能买到的一类书籍,包含电子书。
公司(company): **公司,**集团,**银行(央行,中国人民银行除外,二者属于政府机构), 如:新东方,包含新华网/中国军网等。
游戏(game): 常见的游戏,注意有一些从小说,电视剧改编的游戏,要分析具体场景到底是不是游戏。
政府(government): 包括中央行政机关和地方行政机关两级。 中央行政机关有国务院、国务院组成部门(包括各部、委员会、中国人民银行和审计署)、国务院直属机构(如海关、税务、工商、环保总局等),军队等。
电影(movie): 电影,也包括拍的一些在电影院上映的纪录片,如果是根据书名改编成电影,要根据场景上下文着重区分下是电影名字还是书名。
姓名(name): 一般指人名,也包括小说里面的人物,宋江,武松,郭靖,小说里面的人物绰号:及时雨,花和尚,著名人物的别称,通过这个别称能对应到某个具体人物。
组织机构(organization): 篮球队,足球队,乐团,社团等,另外包含小说里面的帮派如:少林寺,丐帮,铁掌帮,武当,峨眉等。
职位(position): 古时候的职称:巡抚,知州,国师等。现代的总经理,记者,总裁,艺术家,收藏家等。
景点(scene): 常见旅游景点如:长沙公园,深圳动物园,海洋馆,植物园,黄河,长江等。
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
数据下载地址
数据分布
训练集:10748
验证集集:1343
按照不同标签类别统计,训练集数据分布如下(注:一条数据中出现的所有实体都进行标注,如果一条数据出现两个地址(address)实体,那么统计地址(address)类别数据的时候,算两条数据):
【训练集】标签数据分布如下:
地址(address):2829
书名(book):1131
公司(company):2897
游戏(game):2325
政府(government):1797
电影(movie):1109
姓名(name):3661
组织机构(organization):3075
职位(position):3052
景点(scene):1462
【验证集】标签数据分布如下:
地址(address):364
书名(book):152
公司(company):366
游戏(game):287
政府(government):244
电影(movie):150
姓名(name):451
组织机构(organization):344
职位(position):425
景点(scene):199
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
数据字段解释
以train.json为例,数据分为两列:text & label,其中text列代表文本,label列代表文本中出现的所有包含在10个类别中的实体。
例如:
text: "北京勘察设计协会副会长兼秘书长周荫如"
label: {"organization": {"北京勘察设计协会": [[0, 7]]}, "name": {"周荫如": [[15, 17]]}, "position": {"副会长": [[8, 10]], "秘书长": [[12, 14]]}}
其中,organization,name,position代表实体类别,
"organization": {"北京勘察设计协会": [[0, 7]]}:表示原text中,"北京勘察设计协会" 是类别为 "组织机构(organization)" 的实体, 并且start_index为0,end_index为7 (注:下标从0开始计数)
"name": {"周荫如": [[15, 17]]}:表示原text中,"周荫如" 是类别为 "姓名(name)" 的实体, 并且start_index为15,end_index为17
"position": {"副会长": [[8, 10]], "秘书长": [[12, 14]]}:表示原text中,"副会长" 是类别为 "职位(position)" 的实体, 并且start_index为8,end_index为10,同时,"秘书长" 也是类别为 "职位(position)" 的实体,
并且start_index为12,end_index为14
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
数据来源
本数据是在清华大学开源的文本分类数据集THUCTC基础上,选出部分数据进行细粒度命名实体标注,原数据来源于Sina News RSS.
- 1
baseline:BiLSTM-CRF
模型所需的环境:
- pytorch1.12
- python3.7
模型的主要代码:
from torch.nn import LayerNorm
import torch.nn as nn
from crf import CRF
class SpatialDropout(nn.Dropout2d):
def __init__(self, p=0.6):
super(SpatialDropout, self).__init__(p=p)
def forward(self, x):
x = x.unsqueeze(2) # (N, T, 1, K)
x = x.permute(0, 3, 2, 1) # (N, K, 1, T)
x = super(SpatialDropout, self).forward(x) # (N, K, 1, T), some features are masked
x = x.permute(0, 3, 2, 1) # (N, T, 1, K)
x = x.squeeze(2) # (N, T, K)
return x
class NERModel(nn.Module):
def __init__(self,vocab_size,embedding_size,hidden_size,
label2id,device,drop_p = 0.1):
super(NERModel, self).__init__()
self.emebdding_size = embedding_size
self.embedding = nn.Embedding(vocab_size, embedding_size)
self.bilstm = nn.LSTM(input_size=embedding_size,hidden_size=hidden_size,
batch_first=True,num_layers=2,dropout=drop_p,
bidirectional=True)
self.dropout = SpatialDropout(drop_p)
self.layer_norm = LayerNorm(hidden_size * 2)
self.classifier = nn.Linear(hidden_size * 2,len(label2id))
self.crf = CRF(tagset_size=len(label2id), tag_dictionary=label2id, device=device)
def forward(self, inputs_ids, input_mask):
embs = self.embedding(inputs_ids)
embs = self.dropout(embs)
embs = embs * input_mask.float().unsqueeze(2)
seqence_output, _ = self.bilstm(embs)
seqence_output= self.layer_norm(seqence_output)
features = self.classifier(seqence_output)
return features
def forward_loss(self, input_ids, input_mask, input_lens, input_tags=None):
features = self.forward(input_ids, input_mask)
if input_tags is not None:
return features, self.crf.calculate_loss(features, tag_list=input_tags, lengths=input_lens)
else:
return features
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
以下为训练和评估代码:
import json
import torch
import argparse
import torch.nn as nn
from torch import optim
import config
from model import NERModel
from dataset_loader import DatasetLoader
from progressbar import ProgressBar
from ner_metrics import SeqEntityScore
from data_processor import CluenerProcessor
from lr_scheduler import ReduceLROnPlateau
from utils_ner import get_entities
from common import (init_logger,
logger,
json_to_text,
load_model,
AverageMeter,
seed_everything)
def train(args,model,processor):
train_dataset = load_and_cache_examples(args, processor, data_type='train')
train_loader = DatasetLoader(data=train_dataset, batch_size=args.batch_size,
shuffle=False, seed=args.seed, sort=True,
vocab = processor.vocab,label2id = args.label2id)
parameters = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.Adam(parameters, lr=args.learning_rate)
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3,
verbose=1, epsilon=1e-4, cooldown=0, min_lr=0, eps=1e-8)
best_f1 = 0
for epoch in range(1, 1 + args.epochs):
print(f"Epoch {epoch}/{args.epochs}")
pbar = ProgressBar(n_total=len(train_loader), desc='Training')
train_loss = AverageMeter()
model.train()
assert model.training
for step, batch in enumerate(train_loader):
input_ids, input_mask, input_tags, input_lens = batch
input_ids = input_ids.to(args.device)
input_mask = input_mask.to(args.device)
input_tags = input_tags.to(args.device)
features, loss = model.forward_loss(input_ids, input_mask, input_lens, input_tags)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_norm)
optimizer.step()
optimizer.zero_grad()
pbar(step=step, info={'loss': loss.item()})
train_loss.update(loss.item(), n=1)
print(" ")
train_log = {'loss': train_loss.avg}
if 'cuda' in str(args.device):
torch.cuda.empty_cache()
eval_log, class_info = evaluate(args,model,processor)
logs = dict(train_log, **eval_log)
show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])
logger.info(show_info)
scheduler.epoch_step(logs['eval_f1'], epoch)
if logs['eval_f1'] > best_f1:
logger.info(f"\nEpoch {epoch}: eval_f1 improved from {best_f1} to {logs['eval_f1']}")
logger.info("save model to disk.")
best_f1 = logs['eval_f1']
if isinstance(model, nn.DataParallel):
model_stat_dict = model.module.state_dict()
else:
model_stat_dict = model.state_dict()
state = {'epoch': epoch, 'arch': args.arch, 'state_dict': model_stat_dict}
model_path = args.output_dir / 'best-model.bin'
torch.save(state, str(model_path))
print("Eval Entity Score: ")
for key, value in class_info.items():
info = f"Subject: {key} - Acc: {value['acc']} - Recall: {value['recall']} - F1: {value['f1']}"
logger.info(info)
def evaluate(args,model,processor):
eval_dataset = load_and_cache_examples(args,processor, data_type='dev')
eval_dataloader = DatasetLoader(data=eval_dataset, batch_size=args.batch_size,
shuffle=False, seed=args.seed, sort=False,
vocab=processor.vocab, label2id=args.label2id)
pbar = ProgressBar(n_total=len(eval_dataloader), desc="Evaluating")
metric = SeqEntityScore(args.id2label,markup=args.markup)
eval_loss = AverageMeter()
model.eval()
with torch.no_grad():
for step, batch in enumerate(eval_dataloader):
input_ids, input_mask, input_tags, input_lens = batch
input_ids = input_ids.to(args.device)
input_mask = input_mask.to(args.device)
input_tags = input_tags.to(args.device)
features, loss = model.forward_loss(input_ids, input_mask, input_lens, input_tags)
eval_loss.update(val=loss.item(), n=input_ids.size(0))
tags, _ = model.crf._obtain_labels(features, args.id2label, input_lens)
input_tags = input_tags.cpu().numpy()
target = [input_[:len_] for input_, len_ in zip(input_tags, input_lens)]
metric.update(pred_paths=tags, label_paths=target)
pbar(step=step)
print(" ")
eval_info, class_info = metric.result()
eval_info = {f'eval_{key}': value for key, value in eval_info.items()}
result = {'eval_loss': eval_loss.avg}
result = dict(result, **eval_info)
return result, class_info
def predict(args,model,processor):
model_path = args.output_dir / 'best-model.bin'
model = load_model(model, model_path=str(model_path))
test_data = []
with open(str(args.data_dir / "test.json"), 'r') as f:
idx = 0
for line in f:
json_d = {}
line = json.loads(line.strip())
text = line['text']
words = list(text)
labels = ['O'] * len(words)
json_d['id'] = idx
json_d['context'] = " ".join(words)
json_d['tag'] = " ".join(labels)
json_d['raw_context'] = "".join(words)
idx += 1
test_data.append(json_d)
pbar = ProgressBar(n_total=len(test_data))
results = []
for step, line in enumerate(test_data):
token_a = line['context'].split(" ")
input_ids = [processor.vocab.to_index(w) for w in token_a]
input_mask = [1] * len(token_a)
input_lens = [len(token_a)]
model.eval()
with torch.no_grad():
input_ids = torch.tensor([input_ids], dtype=torch.long)
input_mask = torch.tensor([input_mask], dtype=torch.long)
input_lens = torch.tensor([input_lens], dtype=torch.long)
input_ids = input_ids.to(args.device)
input_mask = input_mask.to(args.device)
features = model.forward_loss(input_ids, input_mask, input_lens, input_tags=None)
tags, _ = model.crf._obtain_labels(features, args.id2label, input_lens)
label_entities = get_entities(tags[0], args.id2label)
json_d = {}
json_d['id'] = step
json_d['tag_seq'] = " ".join(tags[0])
json_d['entities'] = label_entities
results.append(json_d)
pbar(step=step)
print(" ")
output_predic_file = str(args.output_dir / "test_prediction.json")
output_submit_file = str(args.output_dir / "test_submit.json")
with open(output_predic_file, "w") as writer:
for record in results:
writer.write(json.dumps(record) + '\n')
test_text = []
with open(str(args.data_dir / 'test.json'), 'r') as fr:
for line in fr:
test_text.append(json.loads(line))
test_submit = []
for x, y in zip(test_text, results):
json_d = {}
json_d['id'] = x['id']
json_d['label'] = {}
entities = y['entities']
words = list(x['text'])
if len(entities) != 0:
for subject in entities:
tag = subject[0]
start = subject[1]
end = subject[2]
word = "".join(words[start:end + 1])
if tag in json_d['label']:
if word in json_d['label'][tag]:
json_d['label'][tag][word].append([start, end])
else:
json_d['label'][tag][word] = [[start, end]]
else:
json_d['label'][tag] = {}
json_d['label'][tag][word] = [[start, end]]
test_submit.append(json_d)
json_to_text(output_submit_file, test_submit)
def load_and_cache_examples(args,processor, data_type='train'):
# Load data features from cache or dataset file
cached_examples_file = args.data_dir / 'cached_crf-{}_{}_{}'.format(
data_type,
args.arch,
str(args.task_name))
if cached_examples_file.exists():
logger.info("Loading features from cached file %s", cached_examples_file)
examples = torch.load(cached_examples_file)
else:
logger.info("Creating features from dataset file at %s", args.data_dir)
if data_type == 'train':
examples = processor.get_train_examples()
elif data_type == 'dev':
examples = processor.get_dev_examples()
logger.info("Saving features into cached file %s", cached_examples_file)
torch.save(examples, str(cached_examples_file))
return examples
def main():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument("--do_train", default=False, action='store_true')
parser.add_argument('--do_eval', default=False, action='store_true')
parser.add_argument("--do_predict", default=False, action='store_true')
parser.add_argument('--markup', default='bios', type=str, choices=['bios', 'bio'])
parser.add_argument("--arch",default='bilstm_crf',type=str)
parser.add_argument('--learning_rate',default=0.001,type=float)
parser.add_argument('--seed',default=1234,type=int)
parser.add_argument('--gpu',default='0',type=str)
parser.add_argument('--epochs',default=50,type=int)
parser.add_argument('--batch_size',default=32,type=int)
parser.add_argument('--embedding_size',default=128,type=int)
parser.add_argument('--hidden_size',default=384,type=int)
parser.add_argument("--grad_norm", default=5.0, type=float, help="Max gradient norm.")
parser.add_argument("--task_name", type=str, default='ner')
args = parser.parse_args()
args.data_dir = config.data_dir
if not config.output_dir.exists():
args.output_dir.mkdir()
args.output_dir = config.output_dir / '{}'.format(args.arch)
if not args.output_dir.exists():
args.output_dir.mkdir()
init_logger(log_file=str(args.output_dir / '{}-{}.log'.format(args.arch, args.task_name)))
seed_everything(args.seed)
if args.gpu!='':
args.device = torch.device(f"cuda:{args.gpu}")
else:
args.device = torch.device("cpu")
args.id2label = {i: label for i, label in enumerate(config.label2id)}
args.label2id = config.label2id
processor = CluenerProcessor(data_dir=config.data_dir)
processor.get_vocab()
model = NERModel(vocab_size=len(processor.vocab), embedding_size=args.embedding_size,
hidden_size=args.hidden_size,device=args.device,label2id=args.label2id)
model.to(args.device)
if args.do_train:
train(args,model,processor)
if args.do_eval:
model_path = args.output_dir / 'best-model.bin'
model = load_model(model, model_path=str(model_path))
evaluate(args,model,processor)
if args.do_predict:
predict(args,model,processor)
if __name__ == "__main__":
main()
运行
1.运行下列命令,进行模型训练:
python run_lstm_crf.py --do_train
电脑经过四个小时的奋战,得到的结果为:
可以看出经过50个epoch之后,
eval_f1 达到了 0.7234823215476984
下面是各个领域的评估结果:
- name - Acc: 0.7734 - Recall: 0.7634 - F1: 0.7684
- address - Acc: 0.542 - Recall: 0.5013 - F1: 0.5209
- movie - Acc: 0.7447 - Recall: 0.6954 - F1: 0.7192
- position - Acc: 0.787 - Recall: 0.7252 - F1: 0.7548
- organization - Acc: 0.8058 - Recall: 0.7575 - F1: 0.7809
- company - Acc: 0.7688 - Recall: 0.7302 - F1: 0.749
- scene - Acc: 0.6568 - Recall: 0.5311 - F1: 0.5873
- government - Acc: 0.7378 - Recall: 0.7976 - F1: 0.7665
- book - Acc: 0.7984 - Recall: 0.6688 - F1: 0.7279
- game - Acc: 0.7814 - Recall: 0.8237 - F1: 0.802
- 运行下列命令,进行模型预测
python run_lstm_crf.py --do_predict
以前五条数据为例:
{“id”: 0, “text”: “四川敦煌学”。近年来,丹棱县等地一些不知名的石窟迎来了海内外的游客,他们随身携带着胡文和的著作。”}
{“id”: 1, “text”: “尼日利亚海军发言人当天在阿布贾向尼日利亚通讯社证实了这一消息。”}
{“id”: 2, “text”: “销售冠军:辐射3-Bethesda”}
{“id”: 3, “text”: “所以大多数人都是从巴厘岛南部开始环岛之旅。”}
{“id”: 4, “text”: “备受瞩目的动作及冒险类大作《迷失》在其英文版上市之初就受到了全球玩家的大力追捧。”}
{“id”: 5, “text”: “filippagowski:14岁时我感觉自己像梵高”}
提取到的实体
{“id”: 0, “label”: {“address”: {“四川敦煌”: [[0, 3]], “丹棱县”: [[11, 13]]}, “name”: {“胡文和”: [[41, 43]]}}}
{“id”: 1, “label”: {“government”: {“尼日利亚海军”: [[0, 5]]}, “position”: {“发言人”: [[6, 8]]}, “organization”: {“阿布贾”: [[12, 14]]}, “company”: {“尼日利亚通讯社”: [[16, 22]]}}}
{“id”: 2, “label”: {}}
{“id”: 3, “label”: {“scene”: {“巴厘岛”: [[9, 11]]}}}
{“id”: 4, “label”: {“game”: {"《迷失》": [[13, 16]]}}}
{“id”: 5, “label”: {“name”: {“filippagowski”: [[0, 12]], “梵高”: [[24, 25]]}}}
对整句话进行序列标注
{“id”: 0, “tag_seq”: “B-address I-address I-address I-address O O O O O O O B-address I-address I-address O O O O O O O O O O O O O O O O O O O O O O O O O O O B-name I-name I-name O O O O”, “entities”: [[“address”, 0, 3], [“address”, 11, 13], [“name”, 41, 43]]}
{“id”: 1, “tag_seq”: “B-government I-government I-government I-government I-government I-government B-position I-position I-position O O O B-organization I-organization I-organization O B-company I-company I-company I-company I-company I-company I-company O O O O O O O O”, “entities”: [[“government”, 0, 5], [“position”, 6, 8], [“organization”, 12, 14], [“company”, 16, 22]]}
{“id”: 2, “tag_seq”: “O O O O O O O O O O O O O O O O O”, “entities”: []}
{“id”: 3, “tag_seq”: “O O O O O O O O O B-scene I-scene I-scene O O O O O O O O O”, “entities”: [[“scene”, 9, 11]]}
{“id”: 4, “tag_seq”: “O O O O O O O O O O O O O B-game I-game I-game I-game O O O O O O O O O O O O O O O O O O O O O O O”, “entities”: [[“game”, 13, 16]]}
{“id”: 5, “tag_seq”: “B-name I-name I-name I-name I-name I-name I-name I-name I-name I-name I-name I-name I-name O O O O O O O O O O O B-name I-name”, “entities”: [[“name”, 0, 12], [“name”, 24, 25]]}
可以看出识别效果还算不错!
参考
GitHub - CLUEbenchmark/CLUENER2020: CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
CLUENER 细粒度命名实体识别baseline:BiLSTM-CRF_命名实体识别常作对比的baseline_Rock_y的博客-CSDN博客
CLUENER2020:中文细粒度命名实体识别数据集来了 - 知乎
GitHub - CLUEbenchmark/CLUENER2020: CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition