CRF用于命名实体识别（快速上手实现）

最新推荐文章于 2022-11-02 17:10:45 发布

早睡身体好_

最新推荐文章于 2022-11-02 17:10:45 发布

阅读量4k

点赞数 7

分类专栏：知识图谱 ——机器学习—— 命名实体识别

本文链接：https://blog.csdn.net/Q_M_X_D_D_/article/details/110500176

版权

——机器学习—— 同时被 3 个专栏收录

15 篇文章

订阅专栏

知识图谱

7 篇文章

订阅专栏

命名实体识别

5 篇文章

订阅专栏

本文利用中医典籍文本的命名实体标注数据集，构建简单的CRF模型。介绍了数据预处理，包括读取语料、处理序列、抽取特征等；阐述模型构建、训练、预测过程，使用sklearn_crfsuite库，用metrics模块评估效果，最后对结果进行分析并给出程序源代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

写在前面

最近在看命名实体识别相关的模型，实验室正好有中医典籍文本的命名实体标注数据集，拿来练练构建一个简单的CRF模型，顺便记录下来，代码可以作为一个参考，手中有标注数据集就可以使用这段代码来训练自己的CRF模型。本次实验用到了sklearn_crfsuite库，这是一个轻量级的CRF库，不仅提供了训练预测方法，还提供了评估方法。数据集的格式大致如下图所示：

每行包含一个字和对应的标注，用空行来分隔开每句话。采用了四个符号（B、I、O、S），分别表示实体的起始字、实体的剩余部分、非实体、单字实体。但语料文本中还有一个E符号，表示实体的结束，感觉意义不大，为了降低复杂度，就将所有的E符号转换为“I”。

数据预处理

这部分要为后面的模型来准备数据，包括特征和标注序列。需要实现以下函数：

（1）读取所有语料数据并返回

    def __init__(self):
        self.file_path="中医语料.txt"

    def read_file(self):
        f=open(self.file_path,encoding="utf-8")
        line=f.readlines()
        f.close()
        return line

（2）将每句话放到一个list中，形成一个二维列表。

    def pre_process(self):
        lines=self.read_file()
        new_lines=[]
        list = []
        for line in lines:
            line=line.strip()
            if len(line) == 0:
                new_lines.append(list)
                list=[]
            else:
                list.append(line)
        return new_lines

（3）将每句话的字序列和标记序列分别存储到两个二维序列中，这里的word_list就是上面pre_process函数返回的二维列表。并且在每句话的前后加上“BOS”和“EOS”标记。

    def init(self,word_list):
        self.word_seq=[[u'<BOS>']+[word.split(" ")[0] for word in words]+[u'<EOS>'] for words in word_list]
        self.tag_seq=[[word.split(" ")[1].replace("E","I") for word in words] for words in word_list]

在此次实验中，特征用的是简单的N-gram模型，所以要实现一个滑动窗口和特征抽取函数。

（4）实现滑动窗口函数，每三个字形成一个片段gram。这里的word_list是上面init函数生成的字序列。

    def segment_by_window(self,word_list):
        words=[]
        begin,end=0,3
        for _ in range(1,len(word_list)):
            if end >len(word_list):
                break
            words.append(word_list[begin:end])
            begin+=1
            end+=1
        return words

（5）特征抽取函数，使用每句话的字序列生成的gram，利用tri-gram模型抽取特征。

    def extract_features(self,word_grams):
        features,features_list=[],[]
        for index in range(len(word_grams)):
            for i in range(len(word_grams[index])):
                word_gram=word_grams[index][i]
                feature={u'w-1':word_gram[0],u'w':word_gram[1],u'w+1':word_gram[2],
                          u'w-1:w':word_gram[0]+word_gram[1],u'w:w+1':word_gram[1]+word_gram[2],
                          u'bias':1.0}
                features_list.append(feature)
            features.append(features_list)
            features_list=[]
        return features

（6）组合CRF模型的输入数据。这个函数将滑动窗口函数和特征抽取函数组合起来，并形成最后输入到CRF模型中的数据。

    def generator(self):
        # word_gram此时是三维的，句子->片段->字
        word_grams = [self.segment_by_window(word_list) for word_list in self.word_seq]
        features = self.extract_feature(word_grams)
        return features, self.tag_seq

模型构建

设置CRF模型的初始化参数。algorithm表示优化算法；c1表示L1正则化系数；c2表示L2正则化系数；max_iteration表示最大迭代次数；model_path表示模型的保存路径；然后初始化语料。

    def __init__(self):
        self.algorithm="lbfgs"
        self.c1=0.1
        self.c2=0.2
        self.max_iterations=100
        self.model_path="TCM_model.pkl"
        self.corpus=init_process()
        self.corpus_text=self.corpus.pre_process()
        self.corpus.init(self.corpus_text)
        self.model=None

    def init_model(self):
        algorithm=self.algorithm
        c1=float(self.c1)
        c2=float(self.c2)
        max_iterations=self.max_iterations
        self.model=sklearn_crfsuite.CRF(algorithm=algorithm,c1=c1,c2=c2,max_iterations=max_iterations,all_possible_transitions=True)

模型训练

初始化模型及语料后，划分数据集和测试集，生成输入数据并对模型进行训练，使用metrics模块来评估模型效果。

    def train(self):
        self.init_model()
        x,y=self.corpus.generator()
        x_train,y_train=x[1000:],y[1000:]
        x_test,y_test=x[:1000],y[:1000]
        self.model.fit(x_train,y_train)
        labels=list(self.model.classes_)
        labels.remove("O")
        y_predict=self.model.predict(x_test)
        metrics.flat_f1_score(y_test,y_predict,average="weighted",labels=labels)
        sorted_labels=sorted(labels,key=lambda name:(name[1:],name[0]))
        print(metrics.flat_classification_report(y_test,y_predict,labels=sorted_labels,digits=3))
        self.save_model()

模型预测

训练好模型后，就可以使用模型来进行预测了，但预测函数输出的结果并不直观，还需要做一些处理。

    def predict(self,sentence):
        self.load_model()
        word_lists=[[u'BOS']+[word for word in sentence]+[u'EOS']]
        word_gram=[self.corpus.segment_by_window(word_list) for word_list in word_lists]
        print(word_lists)
        features=self.corpus.extract_features(word_gram)
        y_predict=self.model.predict(features)
        print(y_predict)
        entity=""
        for index in range(len(y_predict[0])):
            if y_predict[0][index] != u'O':
                entity += sentence[index]
                if index<len(y_predict[0])-1 and y_predict[0][index][-2:] != y_predict[0][index+1][-2:]:
                    entity+=" "
        return entity

结果分析

模型训练的结果如下图所示，对于大部分标签，其精确率和召回率都算不错，support表示标签出现的次数。

使用模型来对这样一个新句子进行命名实体识别： “服药五日，渐变神昏谵语，胸腹满痛，舌干不饮水，小便清长。”命名实体识别的结果如下：

程序源代码

import sklearn_crfsuite
from sklearn_crfsuite import metrics
import joblib

class init_process(object):

    def __init__(self):
        self.file_path="中医语料.txt"

    def read_file(self):
        f=open(self.file_path,encoding="utf-8")
        line=f.readlines()
        f.close()
        return line

    def pre_process(self):
        lines=self.read_file()
        new_lines=[]
        list = []
        for line in lines:
            line=line.strip()
            if len(line) == 0:
                new_lines.append(list)
                list=[]
            else:
                list.append(line)
        return new_lines

    def extract_features(self,word_grams):
        features,features_list=[],[]
        for index in range(len(word_grams)):
            for i in range(len(word_grams[index])):
                word_gram=word_grams[index][i]
                feature={u'w-1':word_gram[0],u'w':word_gram[1],u'w+1':word_gram[2],
                          u'w-1:w':word_gram[0]+word_gram[1],u'w:w+1':word_gram[1]+word_gram[2],
                          u'bias':1.0}
                features_list.append(feature)
            features.append(features_list)
            features_list=[]
        return features

    def segment_by_window(self,word_list):
        words=[]
        begin,end=0,3
        for _ in range(1,len(word_list)):
            if end >len(word_list):
                break
            words.append(word_list[begin:end])
            begin+=1
            end+=1
        return words

    def init(self,word_list):
        self.word_seq=[[u'<BOS>']+[word.split(" ")[0] for word in words]+[u'<EOS>'] for words in word_list]
        self.tag_seq=[[word.split(" ")[1].replace("E","I") for word in words] for words in word_list]

    def generator(self):
        word_grams=[self.segment_by_window(word) for word in self.word_seq]
        features=self.extract_features(word_grams)
        return features,self.tag_seq

class ner(object):

    def __init__(self):
        self.algorithm="lbfgs"
        self.c1=0.1
        self.c2=0.2
        self.max_iterations=100
        self.model_path="TCM_model.pkl"
        self.corpus=init_process()
        self.corpus_text=self.corpus.pre_process()
        self.corpus.init(self.corpus_text)
        self.model=None

    def init_model(self):
        algorithm=self.algorithm
        c1=float(self.c1)
        c2=float(self.c2)
        max_iterations=self.max_iterations
        self.model=sklearn_crfsuite.CRF(algorithm=algorithm,c1=c1,c2=c2,max_iterations=max_iterations,all_possible_transitions=True)

    def train(self):
        self.init_model()
        x,y=self.corpus.generator()
        x_train,y_train=x[1000:],y[1000:]
        x_test,y_test=x[:1000],y[:1000]
        self.model.fit(x_train,y_train)
        labels=list(self.model.classes_)
        labels.remove("O")
        y_predict=self.model.predict(x_test)
        metrics.flat_f1_score(y_test,y_predict,average="weighted",labels=labels)
        sorted_labels=sorted(labels,key=lambda name:(name[1:],name[0]))
        print(metrics.flat_classification_report(y_test,y_predict,labels=sorted_labels,digits=3))
        self.save_model()

    def predict(self,sentence):
        self.load_model()
        word_lists=[[u'BOS']+[word for word in sentence]+[u'EOS']]
        word_gram=[self.corpus.segment_by_window(word_list) for word_list in word_lists]
        print(word_lists)
        features=self.corpus.extract_features(word_gram)
        y_predict=self.model.predict(features)
        print(y_predict)
        entity=""
        for index in range(len(y_predict[0])):
            if y_predict[0][index] != u'O':
                entity += sentence[index]
                if index<len(y_predict[0])-1 and y_predict[0][index][-2:] != y_predict[0][index+1][-2:]:
                    entity+=" "
        return entity

    def save_model(self):
        joblib.dump(self.model,self.model_path)

    def load_model(self):
        self.model=joblib.load(self.model_path)

NER=ner()
NER.train()
print(NER.predict("服药五日，渐变神昏谵语，胸腹满痛，舌干不饮水，小便清长。"))