Spacy自训练中文词性标注模型

最新推荐文章于 2024-08-08 07:07:22 发布

TANK CHENG

最新推荐文章于 2024-08-08 07:07:22 发布

阅读量3.1k

点赞数 2

分类专栏：自然语言处理文章标签：自然语言处理 python

本文链接：https://blog.csdn.net/tankcheng/article/details/115743805

版权

自然语言处理专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文介绍了如何使用Spacy自训练中文词性标注模型。首先，通过jieba分词处理数据，然后定义词性映射，接着进行训练数据的标注和模型训练。在训练过程中，针对中文模型的特性，处理了专有名词的分词问题。最后，展示了训练后的模型在测试数据上的应用，并保存了模型。该实验旨在实现中文词性的自动标注功能。

摘要由CSDN通过智能技术生成

Spacy自训练中文词性标注模型

2021/4/14

首先加载相关包并读入数据文件：

# 读入相关包
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.training import Example
import jieba

# 读入训练数据和测试数据
# 删除标点
def clean_str(str1):
    result = str1
    for i in ['\n','；','。']:
        result = result.replace(i,'')
    return result
    
with open( 'test.txt', 'r',encoding='utf-8') as f:
	test = [clean_str(line) for line in f.readlines()]
with open( 'train.txt', 'r',encoding='utf=8') as f:
	train = [clean_str(line) for line in f.readlines()]

使用spacy中文包的分词功能进行分词，但是对于苹果公司这种专有名词则需要单独拿出来。

nlp1 = spacy.load('zh_core_web_sm')
proper_nouns = ['苹果公司']
nlp1.tokenizer.pkuseg_update_user_dict(proper_nouns)

def word_split(string):
    doc1 = nlp1(string)
    return '/'.join([t.text for t in doc1])
train_word_split = [word_split(t) for t in train]
test_word_split = [word_split(t) for t in test]
print('训练集分词：',train_word_split,'\n')
print('测试集分词:',test_word_split)
print('------------------------------------------------------------------------------------')

展示：
在这里插入图片描述

首先设定词性，中文主要有主语、谓语、宾语、形容词、数词、副词、助词、介词。
在这里插入图片描述

分别对应：

名称	对应内容
主语	NOUN\PRON
谓语	VERB
宾语	NOUN\PRON
代词	PRON
形容词	ADJ
数词	NUM
副词	ADV
助词/介词	PART

建立字典对应词性：

TAG_MAP = {
    'N': {'pos': 'NOUN'},
    'V': {'pos': 'VERB'},
    'J': {'pos': 'ADJ'},
    'P': {'pos': 'PRON'},
    'M': {'pos': 'NUM'},
    'A': {'pos': 'ADV'},
    'R': {'pos': 'PART'}
}

根据分词结果进行标注：

训练集分词： ['我/有/一个/梦想', '我/有/一个/苹果', '很多/人/都/喜欢/吃/苹果', '苹果公司/的/市值/最/高/的', '苹果/的/营养/是/最好/的', '库克/是/苹果公司/的/CEO', '陕西/是/苹果/的/原产地/之一', '今天/天气/真/好', '不/想/工作'] 

测试集分词: ['我/喜欢/吃/苹果', '市值/最/高/的/公司/是/苹果', '库克/是/一个/很/好的/CEO', '我/因为/今天/的/天气/不/想/工作']
------------------------------------------------------------------------------------

训练集：

PVMN，PVMN，JNAAVN，NRNAVR，NRNVJR，NVNRN，NVNRNR，JNAJ，AVN
PAVN，NAJRNVN，NVMAJN，PAJRNAVN

根据上述标记，转换为tags：

# 词性标记
# 转换tags
train_bz = 'PVMN，PVMN，JNAAVN，NRNAVR，NRNVJR，NVNRN，NVNRNR，JNAJ，AVN'
test_bz = 'PAVN，NAJRNVN，NVMAJN，PAJRNAVN'
train_bz_list = train_bz.split('，')
test_bz_list = train_bz.split('，')

TRAIN_DATA = []
for i in range(len(train)):
    TRAIN_DATA.append((train[i],{"tags":list(train_bz_list[i])}))
print(TRAIN_DATA)

在这里插入图片描述

注解器编写：

# 注解器，设置一些参数
@plac.annotations(
    lang=("ISO Code of language to use", "option", "l", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int))

# 训练模型与测试
def main(lang='en', output_dir='mymodel', n_iter=25):
    nlp = spacy.blank(lang)  
    ##创建一个空的模型，表示中文的模型。
    tagger = nlp.add_pipe('tagger')
    
    # 添加注释器
    for tag, values in TAG_MAP.items():
        print("tag:",tag)
        print("values:",values)
        tagger.add_label(tag)
    print("3:",tagger)
    
    # 训练模型
    optimizer = nlp.begin_training() ##模型初始化
    for i in range(n_iter):
        random.shuffle(TRAIN_DATA)  ##打乱列表
        losses = {}
        for text, annotations in TRAIN_DATA:
            example = Example.from_dict(nlp.make_doc(text), annotations)
            nlp.update([example], sgd=optimizer, losses=losses)
        print(losses)
    
    # test the trained model
    print('下面进行临时输入测试！--------------------------------------------------------------------')
    time.sleep(2)
    string1 = "库克经理是苹果公司的老板拥有百万财产"
    string2 = "学校的一点点奶茶店快开张了"
    test_text1 = word_split(string1)
    test_text2 = word_split(string2)
    doc1 = nlp(test_text1)
    doc2 = nlp(test_text2)
    print('Tags', [(t.text, t.tag_, t.pos_) for t in doc1])
    print('Tags', [(t.text, t.tag_, t.pos_) for t in doc2])
    
    # save model to output directory
    print('下面进行测试集测试！--------------------------------------------------------------------')
    time.sleep(2)
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("模型保存为：", output_dir)

        # test the save model
        print("加载模型地址：", output_dir)
        nlp2 = spacy.load(output_dir)
        
        # 测试数据
        for text in test_word_split:
            doc = nlp2(text)
            print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])

中文空模型不具备分词功能，若像英文空模型那样去训练会造成长度不匹配现象。

在这里插入图片描述

会报：

ValueError: [E971] Found incompatible lengths in `Doc.from_array`: 4 for the array and 6 for the Doc itself.

意思是文本长度与词性个数不同。数字会变是因为训练时打乱了排序。

而本实验目的主要是为了实现训练标注功能，因此分词功能仍然用zh_core_web_md-3.0.0-py3-none-any模型。

英文空模型通过空格直接分词，可以通过初始的分词模型先对训练集和测试集进行分词，然后再训练词性标注模型。

输入了两个句子：

    string1 = "库克经理是苹果公司的老板拥有百万财产"
    string2 = "学校的一点点奶茶店快开张了"

分词结果如下：

Tags [('库克', 'N', ''), ('经理', 'N', ''), ('是', 'V', ''), ('苹果公司', 'N', ''), ('的', 'R', ''), ('老板', 'N', ''), ('拥有', 'N', ''), ('百万', 'N', ''), ('财产', 'N', '')]Z

Tags [('学校', 'N', ''), ('的', 'R', ''), ('一点点', 'N', ''), ('奶茶', 'N', ''), ('店', 'V', ''), ('快', 'J', ''), ('开张', 'N', ''), ('了', 'N', '')]

设置了路径，可以对训练好的模型进行保存。
在这里插入图片描述