命名实体识别_利用CRF_CodingPark编程公园

文章介绍

本文主要讲解
原始语料库 -> 语料清洗 -> 语料分割 -> 构建训练和测试数据 -> CRF++训练 -> 命名实体检索 -> 模型评估

完整项目

  1. 原始语料库
    以人民日报1998年01月语料库为例子
    在这里插入图片描述

  2. 语料清洗
    (1)将语料全角字符(SBC)统一转为半角(DBC)
    (2)三空格变双空格
    标注和标注之间的间隔规定为双空格,但是会存在一些三空格
    (3)单空格变为双空格
    标注和标注之间的间隔规定为双空格,但是会存在一些单空格
    (4)中括号内容合并
    例如,‘学生/n ,/w 奔波/v 于/p 两/m 个/q 课堂/n 。/w [上海市/ns 房屋/n 土地/n 管理局/n]nt 为/p’
    (5)合并人名
    例如,“金/nr 正日/nr” 合并为“金正日/nr”
    在这里插入图片描述

  3. 语料分割
    将cleaned_data.txt按照8:2随机分割,作为训练和测试原始语料

  4. 构建训练和测试数据
    构建训练数据集和测试数据集
    采用“BMEWO”标签体系生成训练数据,具体解释如下
    ‘B’:Begin
    ‘M’:Middle
    ‘E’:End
    ‘W’:代表单个实体
    ‘O’:Other
    识别实体类型和语料词性对应关系:
    时间:TIME, /t
    人物:PERSON, /nr
    地点:LOCATION, /ns
    团体组织机关:ORGANIZATION, /nt
    在这里插入图片描述

  5. CRF++训练


./CRF++-0.58/crf_learn -f 2 -c 3.0 /usr/cellar/NER/CRF++-0.58/example/seg/template labeled_reain_data.txt model -t

“…/CRF+±0.58/crf_learn” 为…/CRF+±0.58/crf_learn路径
" -f 2 -c 3.0" 为模型计算参数
" /usr/cellar/NER/CRF+±0.58/example/seg/template" 为template路径
“labeled_train_data.txt” 为训练样本数据
"-t"为输出参数设置

在这里插入图片描述

在这里插入图片描述
训练完毕 生成model 和 model.txt
在这里插入图片描述

  1. CRF++测试
    测试指令:…/crf_test -[可选参数] -m model test.data
    例如: …/CRF+±0.58/crf_test -m model labeled_test_data.txt >> Fin_testdata.txt
    Fin_testdata.txt 部分截图如下-黄色部分为模型预测结果,主要看黄色部分,第二列个人感觉主要作用是评估模型。其实第二列个人感觉如不需评估模型,其实可以随意赋值。下文中命名实体识别也是通过黄色部分查找。故主要看黄色部分
    在这里插入图片描述

  2. 命名实体检索

  3. 模型评估
    准确率(Accurary) 、 精确率(Precision)、召回率( Recall)、F1
    准确率(Accurary) 、 精确率 (Precision)、召回率( Recall )、F1是机器学习模型性能评测中易混淆的几个概念。
    TP:样本为正,预测为正; FP:样本为负,预测为正;
    FN:样本为正,预测为负; TN:样本为负,预测为负。
    Accurary = ( TP + TN ) / ( TP + FP + FN + TN)
    Accurary表示所有样本中分类正确的样本与总样本的比例,反映总体样本预测准确率。
    Precision = TP / ( TP + FP )
    Precision表示预测为正样本的样本中真正为正的比例,反映正样本预测准确率。
    Recall = TP / ( TP + FN )
    Recall表示模型准确预测为正样本的数量占所有正样本数量的比例,反映正样本有多少被找出来(召回)。
    F1 = 2 * Precision * Recall / ( Precision + Recall )
    F1是一个综合指标,是Precision和Recall的调和平均数,一般情况下,Precision和Recall是两个互斥指标,即Recall越大,Precision往往越小,所以需要通过F1测度来综合进行评估,F1越大,分类器效果越好。
    Accuracy和Precision区别:
    Accaracy和Precision作用相差不大,值越大,分类器效果越好,Accuracy使用前提是样本是均衡的,如果样本严重失衡了,Accuracy不再适用,只能使用Precision。
    在这里插入图片描述


完整代码

语料清洗

import copy
import re

# 读取98年人民日报语料库
def read_srcdata(path):
    with open(path, 'r', encoding='UTF-8') as fr:
        srcdata = fr.readlines()
    return srcdata


# 将语料全角字符(SBC)统一转为半角(DBC)
def sbc2dbc(data):
    temp_data = copy.deepcopy(data)
    for i, sentence in enumerate(data):
        temp_sentence = ''
        for word in sentence:
            word_code = ord(word)
            if word_code == 12288:
                word_code = 32
            elif (word_code >= 65281 and word_code <= 65374):
                word_code -= 65248
            temp_sentence += chr(word_code)
        temp_data[i] = temp_sentence
    return temp_data


# 三空格变双空格
def three_space_2_double_space(data):
    for i, sentence in enumerate(data):
        sentence = sentence.split('   ')
        data[i] = '  '.join(sentence)
    return data


# 单空格变为双空格
def single_space_2_double_space(data):
    for i, sentence in enumerate(data):
        sentence = sentence.split('  ')
        len_sen = len(sentence)
        for j in range(len_sen):
            if ' ' in sentence[j]:
                sentence[j] = sentence[j].replace(' ', '  ')
        data[i] = '  '.join(sentence)
    return data

# 中括号内容合并
# '学生/n  ,/w  奔波/v  于/p  两/m  个/q  课堂/n  。/w  [上海市/ns  房屋/n  土地/n  管理局/n]nt  为/p'
def merge_bracket_content(data):
    pattern = re.compile(r'([a-z]{1,5}\][a-z]{1,5})')
    for i, sentence in enumerate(data):
        sentence = sentence.split('  ')
        len_sen = len(sentence)
        flag = False
        tag1 = False
        tag2 = False
        for j in range(len_sen)[::-1]:
            if not flag:
                re_val = re.findall(pattern, sentence[j])
                tag1 = True if len(re_val) >= 1 else False
            if tag1 and j-1 >= 0:
                tag2 = True if sentence[j-1][0] != '[' else False
            else:
                continue
            if tag1 and tag2:
                pat = re.compile(r'/(.*])[a-z]{1,5}')
                val = re.findall(pat, sentence[j])
                if len(val)>=1:
                    string = val[0]
                    sentence[j] = sentence[j].replace(string, '')
                sentence[j-1] = sentence[j-1].split('/')[0] + sentence[j]
                flag = True  
                del sentence[j]
            elif tag1 and not tag2:
                pat = re.compile(r'/(.*])[a-z]{1,5}')
                val = re.findall(pat, sentence[j])
                if len(val)>=1:
                    string = val[0]
                    sentence[j] = sentence[j].replace(string, '')
                sentence[j-1] = sentence[j-1].strip('[').split('/')[0] + sentence[j]
                flag = False
                del sentence[j]
        data[i] = '  '.join(sentence)
    return data
                    
                      
# 合并人名,例如"金/nr  正日/nr" 拼为“金正日/nr” 
def merge_name(data):
    for i, sentence in enumerate(data):
        sentence = sentence.split('  ')
        len_sen = len(sentence)
        temp_j = len_sen - 1
        for j in range(len_sen)[::-1]:
            if j-1 >= 0:
                tag1 = sentence[j].split('/')[-1]
                tag2 = sentence[j-1].split('/')[-1]
                if tag1=='nr' and tag2 == 'nr':
                    if j == temp_j and j != len_sen-1:
                        continue
                    sentence[j-1] = sentence[j-1].strip('/nr')+sentence[j]
                    temp_j = j-1
                    del sentence[j]
        data[i] = '  '.join(sentence)
    return data


# 合并时间"1月/t  26日/t" 合并为“1月26日/t”
def merge_time(data):
    for i, sentence in enumerate(data):
        sentence = sentence.split('  ')
        len_sen = len(sentence)
        for j in range(len_sen)[::-1]:
            if j-1 >= 0:
                tag1 = sentence[j].split('/')[-1]
                tag2 = sentence[j-1].split('/')[-1]
                if tag1=='t' and tag2 == 't':
                    sentence[j-1] = sentence[j-1].strip('/t')+sentence[j]
                    del sentence[j]
        data[i] = '  '.join(sentence)
    return data  
            

def main():
    # 按照如下顺序对数据进行清洗
    data0 = read_srcdata('199801.txt')  # 原始语料库路径
    data1 = sbc2dbc(data0)
    data2 = three_space_2_double_space(data1)
    data3 = single_space_2_double_space(data2)
    data4 = merge_bracket_content(data3)
    data5 = merge_name(data4)
    data6 = merge_time(data5)
    with open('cleaned_data.txt','w', encoding='utf-8') as fw:
        for i, meta in enumerate(data6):
            if meta == '\n':
                continue
            fw.write(meta)

if __name__ == '__main__':
    main()



语料分割

# -*- coding: utf-8 -*-
"""
按照8:2切分训练集和测试集
"""

import random

def read_srcdata(path):
    with open(path, 'r', encoding='UTF-8') as fr:
        srcdata = fr.readlines()
    return srcdata


def train_test_segment(datapath, train_txt, test_txt, ratio):
    srcdata = read_srcdata(datapath)
    del srcdata[-1]     # 去掉最后一行‘\n’
    src_len = len(srcdata)
    max_len = int(src_len * ratio)
    index_set = list(range(src_len))
    random.shuffle(index_set)
    train_fw = open(train_txt, 'w', encoding='utf-8')
    test_fw = open(test_txt, 'w', encoding='utf-8')
    for i in range(len(index_set)):
        if i <= max_len:
            train_fw.write(srcdata[index_set[i]])
        else:
            test_fw.write(srcdata[index_set[i]])
    train_fw.close()
    test_fw.close() 
    return True
    


if __name__ == '__main__':
    datapath = 'cleaned_data.txt'
    trainpath = 'train_data.txt'
    testpath = 'test_data.txt'
    ratio = 0.8     # 0.8为traindata
    result = train_test_segment(datapath, trainpath, testpath, ratio)
    print(result)



构建训练和测试数据

# -*- coding: utf-8 -*-

#采用“BMEWO”标签体系生成训练数据
"""
'B':Begin
'M':Middle
'E':End
'W':代表单个实体
'O':Other
时间:TIME, /t
人物:PERSON, /nr
地点:LOCATION, /ns
团体组织机关:ORGANIZATION, /nt
"""

import codecs 
# 读取清洗后98年人民日报语料库
def read_srcdata(path):
    with open(path, 'r', encoding='UTF-8') as fr:
        srcdata = fr.readlines()
    return srcdata

def generate_train_data(input_path, output_path):
    data0 = read_srcdata(input_path)
    delimiter = '\t'
    with codecs.open(output_path,'w', encoding='utf-8') as fw:
        for i, sentence in enumerate(data0):
            words = sentence.split('  ')
            for j,word in enumerate(words):
                if j==0 or word == '\n':
                    continue
                split_word = word.split('/')
                tag = split_word[-1]
                word_meta = split_word[0]
                meta_len = len(word_meta)                    
                if tag == 't':
                    if meta_len == 1:
                        char = word_meta
                        fw.write(char+delimiter+'W'+'\n')
                        continue
                    for k, char in enumerate(word_meta):
                        if k == 0:
                            fw.write(char+delimiter+'B_TIME'+'\n')
                        elif k == meta_len - 1:
                            fw.write(char+delimiter+'E_TIME'+'\n')
                        else:
                            fw.write(char+delimiter+'M_TIME'+'\n')
                elif tag == 'nr':
                    if meta_len == 1:
                        char = word_meta
                        fw.write(char+delimiter+'W'+'\n')
                        continue
                    for k, char in enumerate(word_meta):
                        if k == 0:
                            fw.write(char+delimiter+'B_PERSON'+'\n')
                        elif k == meta_len - 1:
                            fw.write(char+delimiter+'E_PERSON'+'\n')
                        else:
                            fw.write(char+delimiter+'M_PERSON'+'\n')
                elif tag == 'ns':
                    if meta_len == 1:
                        char = word_meta
                        fw.write(char+delimiter+'W'+'\n')
                        continue
                    for k, char in enumerate(word_meta):
                        if k == 0:
                            fw.write(char+delimiter+'B_LOCATION'+'\n')
                        elif k == meta_len - 1:
                            fw.write(char+delimiter+'E_LOCATION'+'\n')
                        else:
                            fw.write(char+delimiter+'M_LOCATION'+'\n')
                elif tag == 'nt':
                    if meta_len == 1:
                        char = word_meta
                        fw.write(char+delimiter+'W'+'\n')
                        continue
                    for k, char in enumerate(word_meta):
                        if k == 0:
                            fw.write(char+delimiter+'B_ORGANIZATION'+'\n')
                        elif k == meta_len - 1:
                            fw.write(char+delimiter+'E_ORGANIZATION'+'\n')
                        else:
                            fw.write(char+delimiter+'M_ORGANIZATION'+'\n')
                else:
                    for k, char in enumerate(word_meta):
                        fw.write(char+delimiter+'O'+'\n')
            fw.write('\n')
    return True
    
if __name__ == '__main__':
    input_file = 'test_data.txt'
    output_file = 'labeled_test_data.txt'
    result = generate_train_data(input_file, output_file)
    print(result)
    


命名实体检索

"""

人名识别

"""
def readtxt(path):
    with open(path, 'r', encoding='UTF-8') as fr:
        content = fr.readlines()                # @ readline 与 readlines 要区分开哟 哈哈
        return content


def findit(content):
    c1 = ''
    c2 = ''
    c3 = ''
    C = ''
    name = []
    for line in content:
        if line == '\n': continue
        line = line.strip('\n').split('\t')
        predicted_value = line[-1]
        if predicted_value == 'O': continue

        if predicted_value == 'B_PERSON':
            c1 = line[0]
        if predicted_value == 'M_PERSON':
            c2 = line[0]
        if predicted_value == 'E_PERSON':
            c3 = line[0]
            C = c1 + c2 + c3
            print(C)
            name.append(C)


    return name




if __name__ == '__main__':
    raw = 'Fin_testdata.txt'
    content = readtxt(raw)
    name = findit(content)
    Strname = ''
    Name = Strname + str(name)
    with open('FindName.txt', 'w', encoding='utf-8') as fw:
        fw.write(Name)
        print()
        print("---完成---")



模型评估

# -*- coding: utf-8 -*-
# 计算B_LOCATION的准确率(Accurary) 、 精确率(Precision)、召回率( Recall)、F1
# 分别计算TP FN FP TN
    
# 计算TP
def cal_tp(data, tag):
    TP = 0
    for line in data:
        if line == '\n': continue
        line = line.strip('\n').split('\t')
        actual_value = line[-2]
        predicted_value = line[-1]
        if actual_value == tag and actual_value == predicted_value:
            TP += 1
    print('TP  ->  ', TP)
    return TP

# 计算FN
def cal_fn(data, tag):
    FN = 0
    for line in data:
        if line == '\n': continue
        line = line.strip('\n').split('\t')
        actual_value = line[-2]
        predicted_value = line[-1]
        if actual_value == tag and actual_value != predicted_value:
            FN += 1
    print('FN  ->  ', FN)
    return FN

# 计算FP
def cal_fp(data, tag):
    FP = 0
    for line in data:
        if line == '\n': continue
        line = line.strip('\n').split('\t')
        actual_value = line[-2]
        predicted_value = line[-1]
        if predicted_value == tag and actual_value != tag:
            FP += 1
    print('FP  ->  ', FP)
    return FP

# 计算TN
def cal_tn(data, tag):
    TN = 0
    for line in data:
        if line == '\n': continue
        line = line.strip('\n').split('\t')
        actual_value = line[-2]
        predicted_value = line[-1]
        if predicted_value != tag and actual_value != tag:
            TN += 1
    print('TN  ->  ', TN)
    return TN

def cal_main():
    with open('Fin_testdata.txt','r', encoding='UTF-8') as fw:
        data = fw.readlines()
    tag = 'B_LOCATION'
    TP = cal_tp(data, tag)
    FN = cal_fn(data, tag)
    FP = cal_fp(data, tag)
    TN = cal_tn(data, tag)
    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    accurary = (TP+TN)/(TP+TN+FN+FP)
    f1 = 2*precision*recall/(precision+recall)
    return round(accurary, 3), round(precision, 3), round(recall, 3), round(f1, 3)      # 保留小数3位
    
if __name__ == '__main__':
    accurary, precision, recall, f1 = cal_main()
    print(accurary, precision, recall, f1)
    
    
    

特别鸣谢

📍Linux 命令详解./configure、make、make install 命令
https://www.100txy.com/article/207.html
📍如何安装CRF++工具
https://www.jianshu.com/p/9a98701799af
📍如何使用CRF++工具
https://www.jianshu.com/p/1fdece7f7c41
📍如何评测CRF++结果
https://www.jianshu.com/p/13f8792bfbe4

在这里插入图片描述

评论将由博主筛选后显示,对所有人可见 | 还能输入1000个字符
©️2020 CSDN 皮肤主题: 鲸 设计师: meimeiellie 返回首页
实付0元
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值