NER判别实战--demo

NER任务简介

命名体识别(Name Entity Recognition)是自然语言处理(Nature Language Processing)领域中比较重要的一个任务,几乎百分之50的和文本处理有关的项目中都会涉及到命名体识别。

NER是一个序列标注任务,和分词,词性标注的任务属同一类。任务的输入是一串序列,输出也是一串序列。例子如下:
输入:[北,京,天,气,真,不,错]
输出:[1,2,0,0,0,0,0]
其中1表示位置实体的开头(B_LOC),2表示位置实体的中间(I_LOC),通过 B_LOC和I_LOC我们就可以锁定北京这个位置实体。

NER算法简介

NER算法从上古时期的HMM,到CRF,再到现在的特别火爆的深度学习CNN+BiLSTM+CRF(论文地址),算法的准确率可谓是节节高升,现在的比较先进CNN+BiLSTM+CRF算法已经可以达到97%以上的准确率了。
本人自己构建了标注数据,基于Keras框架就CNN算法,写了一个demo,为了提升预测准确率以及模型泛化能力,模型还叠加了TimeDistributed layer 和Bidirectional layer。预测准确率还不错,给大家分享一下。

构建数据

构建标记函数

任务是识别name, age 和 city;我们分别用[],(), {}, 确定命名实体的位置;由于BIO标记过于复杂且本文数据集较为简单,本文使用BI标记。

lst = 'O	B-Name	I-Name	B-AGE	I-AGE	B-City	I-City'
print(lst.split('\t'))

# ['O', 'B-NAME', 'I-NAME', 'B-AGE', 'I-AGE', 'B-CITY', 'I-CITY']
# Name: []
# Age: ()
# City: {}

label_tag = {
    'NAME': '[]',
    'AGE': '()',
    'CITY': '{}'
}

label_tag

print:

['O', 'B-Name', 'I-Name', 'B-AGE', 'I-AGE', 'B-City', 'I-City']
{'NAME': '[]', 'AGE': '()', 'CITY': '{}'}

下面我们将每个英文字符标记,B表示命名体的开头,I表示中间位置,O表示不重要信息。

txt = "Hi, I'm [Bob] and I from {SZ}, I'm (8) years old."

def annotator(text):
    data = []
    outside = 'O'
    tag = outside
    flag = False
    for i, c in enumerate(text):  #enumerate对数据标序号
        for k, v in label_tag.items():
            start, end = v     #v包含[],{},(),每个分别有两个
            if c == start:
                tag = k
                flag = True
                break
            elif c == end:
                tag = outside
                break
        else:
            if flag:
                _tag = 'B-' + tag
            elif tag != outside:
                _tag = 'I-' + tag
            else:
                _tag = tag
            data.append((c, _tag))
            flag = False
    
    return ['\t'.join(line) for line in data]

annotator(txt)

输出如下:

['H\tO',
 'i\tO',
 ',\tO',
 ' \tO',
 'I\tO',
 "'\tO",
 'm\tO',
 ' \tO',
 'B\tB-NAME',
 'o\tI-NAME',
 'b\tI-NAME',
 ' \tO',
 'a\tO',
 'n\tO',
 'd\tO',
 ' \tO',
 'I\tO',
 ' \tO',
 'f\tO',
 'r\tO',
 'o\tO',
 'm\tO',
 ' \tO',
 'S\tB-CITY',
 'Z\tI-CITY',
 ',\tO',
 ...

Creating data set

import re
import random
import string

names = [''.join(random.sample(string.ascii_lowercase, random.randint(3, 5))).capitalize() for i in range(1000)]
cities = [''.join(random.sample(string.ascii_lowercase, random.randint(5, 10))).upper() for i in range(1000)]
age = [random.randint(20, 100) for i in range(1000)]

text = '''
Hi, I'm [Bob] and I from {SZ}, I'm (18) years old.
Hello, I am [RPI], I'm (20) years old and I come from {GZ}.
I born in {YuLin}. I'm (24), just call me [ATA].
I don't want to tell you my name, but I from {ShangHai}.
Hey My name is [CZW], I live in {HangZhou}, what's your name?
I don't wanna tell you my name, but I from {BeiJing}.
Oh, I'm [Linda] and I from {NanNing}, I'm (25) years old.
Hi, I'm [Tian] and I from {YUI}, I'm (19) years old
Hey, I'm [Lily] and I from {PEI}, I'm (19) years old
'''
train_text = []
sentences = text.strip().split('\n')
for i in range(500):
    for row in  sentences:
        strs = row
        strs = re.sub(r'\[.*?\]', f'[{random.choice(names)}]', strs)
        strs = re.sub(r'\{.*?\}', '{%s}'%(random.choice(cities)), strs)
        strs = re.sub(r'\(.*?\)', '(%s)'%(random.choice(age)), strs)
        train_text.append(strs)

random.shuffle(train_text)

char_lst = [annotator(row)+['END'] for row in train_text]
with open('train.txt','w+') as fp:
    fp.write('\n'.join(sum(char_lst, [])))

构建数据集方法也可参考我的上一篇blog,用正则匹配生成固定格式的随机文本python

文本处理

对text进行处理,统一x,y的vector长度,构造适合输入model的array

# 对x,y值统一长度为sentence_max_length
sentence_max_length = 100
# voc为text里面的所有可能取值,本训练集为大小写字母和数字
voc = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,.'? "
label = ['O', 'B-NAME', 'I-NAME', 'B-AGE', 'I-AGE', 'B-CITY', 'I-CITY']

###### 补齐为0
def voc_idx(lst):
    _voc = list(voc)
    res = [_voc.index(i)+1 if i in _voc else 0 for i in lst]
    if len(res) > sentence_max_length:
        raise Exception('exceed sentence_max_length ')
        
    return res + [0]*(sentence_max_length - len(res))

def lable_idx(lst):
    res = [label.index(i) for i in lst]
    if len(res) > sentence_max_length:
        raise Exception('exceed sentence_max_length ')
        
    return res + [0]*(sentence_max_length - len(res))


with open('train.txt') as fp:
    text = [row.strip().split('\n') for row in fp.read().split('END')]
    
train_data = []
lable_data = []

for row  in text:
    if len(row) <= 1:
        continue        
    train_lst, lable_lst = zip(*[i.split('\t') for i in row])  #解压
    train_data.append(voc_idx(train_lst))    #补齐到100个
    lable_data.append(lable_idx(lable_lst))  #补齐到100个
## 构造二元组
import numpy as np
x, y = np.array(train_data), np.array(lable_data)  #x为25000*100矩阵
y = y.reshape((y.shape[0], y.shape[1], 1))  #reshape重组array

Create model layers

载入相关package

import pickle
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras import Sequential
from keras_contrib.layers import CRF
import pickle
from keras.layers import Embedding,Bidirectional,LSTM,TimeDistributed,Dense

Create model

model = Sequential()
# len(self.vocab)
model.add(Embedding(len(voc)+1, 200))  
# model.add(Bidirectional(GRU(256)))
# add Bidirectional GRU
model.add(Bidirectional(GRU(256, return_sequences=True)))
model.add(Bidirectional(GRU(256, return_sequences=True)))
# add timedistributed
model.add(TimeDistributed(Dense(len(label))))
crf = CRF(len(label), sparse_target=True)
model.add(crf)
model.summary()

Model strcture

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_9 (Embedding)      (None, None, 200)         13600     
_________________________________________________________________
bidirectional_17 (Bidirectio (None, None, 512)         701952    
_________________________________________________________________
bidirectional_18 (Bidirectio (None, None, 512)         1181184   
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 7)           3591      
_________________________________________________________________
crf_7 (CRF)                  (None, None, 7)           119       
=================================================================
Total params: 1,900,446
Trainable params: 1,900,446
Non-trainable params: 0
_________________________________________________________________

Trian model

model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])
# model.fit(data,label,batch_size=16,epochs=EPOCHS)
# y = y.reshape((y.shape[0], y.shape[1], 1))
model.fit(x, y, batch_size=32, epochs=10)

模型参数较多,train的过程比较长,可以减少epochs缩短时间

Model predict

## d为x, voc为y

d = np.array([45, 67, 13, 11, 24, 67, 24, 25, 30, 67, 30, 15, 22, 22, 67, 35, 25,
       31, 67, 23, 35, 67, 24, 11, 23, 15, 63, 67, 12, 31, 30, 67, 45, 67,
       16, 28, 25, 23, 67, 58, 37, 40, 38, 39, 41, 64,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

voc = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,.'? "
result_label = [np.argmax(i) for i in model.predict(x[5])]
print(result_label)

最后根据model predict的序号判断预测是否准确,例如,若是0,则是O;6为city。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值