词性标注教程
基本概念
在语言学上,词性(Par-Of-Speech, Pos )指的是单词的语法分类,也称为词类。同一个类别的词语具有相似的语法性质,所有词性的集合称为词性标注集。不同的语料库采用了不同的词性标注集,一般都含有形容词、动词、名词等常见词性。
序列标注指的是给定一个序列 ,找出序列中每个元素对应标签 的问题。其中,y 所有可能的取值集合称为标注集。比如,输入一个自然数序列,输出它们的奇偶性。
求解序列标注问题的模型一般称为序列标注器,通常由模型从一个标注数据集中学习相关知识后再进行预测。在NLP问题中,x 通常是字符或词语,而 y 则是待预测的组词角色或词性等标签。中文分词、词性标注以及命名实体识别,都可以转化为序列标注问题。
ICTCLAS 汉语词性标注集
词性 | 名称 | 诠释 |
---|---|---|
Ag | 形语素 | 形容词性语素。形容词代码为a,语素代码g前面置以A。 |
a | 形容词 | 取英语形容词adjective的第1个字母。 |
ad | 副形词 | 直接作状语的形容词。形容词代码a和副词代码d并在一起。 |
an | 名形词 | 具有名词功能的形容词。形容词代码a和名词代码n并在一起。 |
b | 区别词 | 取汉字“别”的声母。 |
c | 连词 | 取英语连词conjunction的第1个字母。 |
Dg | 副语素 | 副词性语素。副词代码为d,语素代码g前面置以D。 |
d | 副词 | 取adverb的第2个字母,因其第1个字母已用于形容词。 |
e | 叹词 | 取英语叹词exclamation的第1个字母。 |
f | 方位词 | 取汉字“方” 的声母。 |
g | 语素 | 绝大多数语素都能作为合成词的“词根”,取汉字“根”的声母。 |
h | 前接成分 | 取英语head的第1个字母。 |
i | 成语 | 取英语成语idiom的第1个字母。 |
j | 简称略语 | 取汉字“简”的声母。 |
k | 后接成分 | |
l | 习用语 | 习用语尚未成为成语,有点“临时性”,取“临”的声母。 |
m | 数词 | 取英语numeral的第3个字母,n,u已有他用。 |
Ng | 名语素 | 名词性语素。名词代码为n,语素代码g前面置以N。 |
n | 名词 | 取英语名词noun的第1个字母。 |
nr | 人名 | 名词代码n和“人(ren)”的声母并在一起。 |
ns | 地名 | 名词代码n和处所词代码s并在一起。 |
nt | 机构团体 | “团”的声母为t,名词代码n和t并在一起。 |
nz | 其他专名 | “专”的声母的第1个字母为z,名词代码n和z并在一起。 |
o | 拟声词 | 取英语拟声词onomatopoeia的第1个字母。 |
p | 介词 | 取英语介词prepositional的第1个字母。 |
q | 量词 | 取英语quantity的第1个字母。 |
r | 代词 | 取英语代词pronoun的第2个字母,因p已用于介词。 |
s | 处所词 | 取英语space的第1个字母。 |
Tg | 时语素 | 时间词性语素。时间词代码为t,在语素的代码g前面置以T。 |
t | 时间词 | 取英语time的第1个字母。 |
u | 助词 | 取英语助词auxiliary 的第2个字母,因a已用于形容词。 |
Vg | 动语素 | 动词性语素。动词代码为v。在语素的代码g前面置以V。 |
v | 动词 | 取英语动词verb的第一个字母。 |
vd | 副动词 | 直接作状语的动词。动词和副词的代码并在一起。 |
vn | 名动词 | 指具有名词功能的动词。动词和名词的代码并在一起。 |
w | 标点符号 | |
x | 非语素字 | 非语素字只是一个符号,字母x通常用于代表未知数、符号。 |
y | 语气词 | 取汉字“语”的声母。 |
z | 状态词 | 取汉字“状”的声母的前一个字母。 |
英文SpaCy词性对照表
词性 | 单词 | 名称 |
---|---|---|
ADJ | adjective | 形容词 |
ADP | adposition | 介词 |
ADV | adverb | 副词 |
AUX | auxiliary verb | 助动词 |
CONJ | coordinating conjunction | 并列连词 |
DET | determiner | 限定词 |
INTJ | interjection | 感叹词 |
NOUN | noun | 名词 |
NUM | numeral | 数词 |
PART | particle | 助词 |
PRON | pronoun | 代词 |
PROPN | proper noun | 专有名词 |
PUNCT | punctuation | 标点符号 |
SCONJ | subordinating conjunction | 从属连词 |
SYM | symbol 符号 | |
VERB | verb 动词 | |
X | other |
官方关于词性、依存关系、实体的名词解释:
def explain(term):
"""Get a description for a given POS tag, dependency label or entity type.
term (str): The term to explain.
RETURNS (str): The explanation, or `None` if not found in the glossary.
EXAMPLE:
>>> spacy.explain(u'NORP')
>>> doc = nlp(u'Hello world')
>>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc])
"""
if term in GLOSSARY:
return GLOSSARY[term]
GLOSSARY = {
# POS tags
# Universal POS Tags
# http://universaldependencies.org/u/pos/
"ADJ": "adjective",
"ADP": "adposition",
"ADV": "adverb",
"AUX": "auxiliary",
"CONJ": "conjunction",
"CCONJ": "coordinating conjunction",
"DET": "determiner",
"INTJ": "interjection",
"NOUN": "noun",
"NUM": "numeral",
"PART": "particle",
"PRON": "pronoun",
"PROPN": "proper noun",
"PUNCT": "punctuation",
"SCONJ": "subordinating conjunction",
"SYM": "symbol",
"VERB": "verb",
"X": "other",
"EOL": "end of line",
"SPACE": "space",
# POS tags (English)
# OntoNotes 5 / Penn Treebank
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
".": "punctuation mark, sentence closer",
",": "punctuation mark, comma",
"-LRB-": "left round bracket",
"-RRB-": "right round bracket",
"``": "opening quotation mark",
'""': "closing quotation mark",
"''": "closing quotation mark",
":": "punctuation mark, colon or ellipsis",
"$": "symbol, currency",
"#": "symbol, number sign",
"AFX": "affix",
"CC": "conjunction, coordinating",
"CD": "cardinal number",
"DT": "determiner",
"EX": "existential there",
"FW": "foreign word",
"HYPH": "punctuation mark, hyphen",
"IN": "conjunction, subordinating or preposition",
"JJ": "adjective (English), other noun-modifier (Chinese)",
"JJR": "adjective, comparative",
"JJS": "adjective, superlative",
"LS": "list item marker",
"MD": "verb, modal auxiliary",
"NIL": "missing tag",
"NN": "noun, singular or mass",
"NNP": "noun, proper singular",
"NNPS": "noun, proper plural",
"NNS": "noun, plural",
"PDT": "predeterminer",
"POS": "possessive ending",
"PRP": "pronoun, personal",
"PRP$": "pronoun, possessive",
"RB": "adverb",
"RBR": "adverb, comparative",
"RBS": "adverb, superlative",
"RP": "adverb, particle",
"TO": 'infinitival "to"',
"UH": "interjection",
"VB": "verb, base form",
"VBD": "verb, past tense",
"VBG": "verb, gerund or present participle",
"VBN": "verb, past participle",
"VBP": "verb, non-3rd person singular present",
"VBZ": "verb, 3rd person singular present",
"WDT": "wh-determiner",
"WP": "wh-pronoun, personal",
"WP$": "wh-pronoun, possessive",
"WRB": "wh-adverb",
"SP": "space (English), sentence-final particle (Chinese)",
"ADD": "email",
"NFP": "superfluous punctuation",
"GW": "additional word in multi-word expression",
"XX": "unknown",
"BES": 'auxiliary "be"',
"HVS": 'forms of "have"',
"_SP": "whitespace",
# POS Tags (German)
# TIGER Treebank
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
"$(": "other sentence-internal punctuation mark",
"$,": "comma",
"$.": "sentence-final punctuation mark",
"ADJA": "adjective, attributive",
"ADJD": "adjective, adverbial or predicative",
"APPO": "postposition",
"APPR": "preposition; circumposition left",
"APPRART": "preposition with article",
"APZR": "circumposition right",
"ART": "definite or indefinite article",
"CARD": "cardinal number",
"FM": "foreign language material",
"ITJ": "interjection",
"KOKOM": "comparative conjunction",
"KON": "coordinate conjunction",
"KOUI": 'subordinate conjunction with "zu" and infinitive',
"KOUS": "subordinate conjunction with sentence",
"NE": "proper noun",
"NNE": "proper noun",
"PAV": "pronominal adverb",
"PROAV": "pronominal adverb",
"PDAT": "attributive demonstrative pronoun",
"PDS": "substituting demonstrative pronoun",
"PIAT": "attributive indefinite pronoun without determiner",
"PIDAT": "attributive indefinite pronoun with determiner",
"PIS": "substituting indefinite pronoun",
"PPER": "non-reflexive personal pronoun",
"PPOSAT": "attributive possessive pronoun",
"PPOSS": "substituting possessive pronoun",
"PRELAT": "attributive relative pronoun",
"PRELS": "substituting relative pronoun",
"PRF": "reflexive personal pronoun",
"PTKA": "particle with adjective or adverb",
"PTKANT": "answer particle",
"PTKNEG": "negative particle",
"PTKVZ": "separable verbal particle",
"PTKZU": '"zu" before infinitive',
"PWAT": "attributive interrogative pronoun",
"PWAV": "adverbial interrogative or relative pronoun",
"PWS": "substituting interrogative pronoun",
"TRUNC": "word remnant",
"VAFIN": "finite verb, auxiliary",
"VAIMP": "imperative, auxiliary",
"VAINF": "infinitive, auxiliary",
"VAPP": "perfect participle, auxiliary",
"VMFIN": "finite verb, modal",
"VMINF": "infinitive, modal",
"VMPP": "perfect participle, modal",
"VVFIN": "finite verb, full",
"VVIMP": "imperative, full",
"VVINF": "infinitive, full",
"VVIZU": 'infinitive with "zu", full',
"VVPP": "perfect participle, full",
"XY": "non-word containing non-letter",
# POS Tags (Chinese)
# OntoNotes / Chinese Penn Treebank
# https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports
"AD": "adverb",
"AS": "aspect marker",
"BA": "把 in ba-construction",
# "CD": "cardinal number",
"CS": "subordinating conjunction",
"DEC": "的 in a relative clause",
"DEG": "associative 的",
"DER": "得 in V-de const. and V-de-R",
"DEV": "地 before VP",
"ETC": "for words 等, 等等",
# "FW": "foreign words"
"IJ": "interjection",
# "JJ": "other noun-modifier",
"LB": "被 in long bei-const",
"LC": "localizer",
"M": "measure word",
"MSP": "other particle",
# "NN": "common noun",
"NR": "proper noun",
"NT": "temporal noun",
"OD": "ordinal number",
"ON": "onomatopoeia",
"P": "preposition excluding 把 and 被",
"PN": "pronoun",
"PU": "punctuation",
"SB": "被 in short bei-const",
# "SP": "sentence-final particle",
"VA": "predicative adjective",
"VC": "是 (copula)",
"VE": "有 as the main verb",
"VV": "other verb",
# Noun chunks
"NP": "noun phrase",
"PP": "prepositional phrase",
"VP": "verb phrase",
"ADVP": "adverb phrase",
"ADJP": "adjective phrase",
"SBAR": "subordinating conjunction",
"PRT": "particle",
"PNP": "prepositional noun phrase",
# Dependency Labels (English)
# ClearNLP / Universal Dependencies
# https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
"acl": "clausal modifier of noun (adjectival clause)",
"acomp": "adjectival complement",
"advcl": "adverbial clause modifier",
"advmod": "adverbial modifier",
"agent": "agent",
"amod": "adjectival modifier",
"appos": "appositional modifier",
"attr": "attribute",
"aux": "auxiliary",
"auxpass": "auxiliary (passive)",
"case": "case marking",
"cc": "coordinating conjunction",
"ccomp": "clausal complement",
"clf": "classifier",
"complm": "complementizer",
"compound": "compound",
"conj": "conjunct",
"cop": "copula",
"csubj": "clausal subject",
"csubjpass": "clausal subject (passive)",
"dative": "dative",
"dep": "unclassified dependent",
"det": "determiner",
"discourse": "discourse element",
"dislocated": "dislocated elements",
"dobj": "direct object",
"expl": "expletive",
"fixed": "fixed multiword expression",
"flat": "flat multiword expression",
"goeswith": "goes with",
"hmod": "modifier in hyphenation",
"hyph": "hyphen",
"infmod": "infinitival modifier",
"intj": "interjection",
"iobj": "indirect object",
"list": "list",
"mark": "marker",
"meta": "meta modifier",
"neg": "negation modifier",
"nmod": "modifier of nominal",
"nn": "noun compound modifier",
"npadvmod": "noun phrase as adverbial modifier",
"nsubj": "nominal subject",
"nsubjpass": "nominal subject (passive)",
"nounmod": "modifier of nominal",
"npmod": "noun phrase as adverbial modifier",
"num": "number modifier",
"number": "number compound modifier",
"nummod": "numeric modifier",
"oprd": "object predicate",
"obj": "object",
"obl": "oblique nominal",
"orphan": "orphan",
"parataxis": "parataxis",
"partmod": "participal modifier",
"pcomp": "complement of preposition",
"pobj": "object of preposition",
"poss": "possession modifier",
"possessive": "possessive modifier",
"preconj": "pre-correlative conjunction",
"prep": "prepositional modifier",
"prt": "particle",
"punct": "punctuation",
"quantmod": "modifier of quantifier",
"rcmod": "relative clause modifier",
"relcl": "relative clause modifier",
"reparandum": "overridden disfluency",
"root": "root",
"vocative": "vocative",
"xcomp": "open clausal complement",
# Dependency labels (German)
# TIGER Treebank
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
# currently missing: 'cc' (comparative complement) because of conflict
# with English labels
"ac": "adpositional case marker",
"adc": "adjective component",
"ag": "genitive attribute",
"ams": "measure argument of adjective",
"app": "apposition",
"avc": "adverbial phrase component",
"cd": "coordinating conjunction",
"cj": "conjunct",
"cm": "comparative conjunction",
"cp": "complementizer",
"cvc": "collocational verb construction",
"da": "dative",
"dh": "discourse-level head",
"dm": "discourse marker",
"ep": "expletive es",
"hd": "head",
"ju": "junctor",
"mnr": "postnominal modifier",
"mo": "modifier",
"ng": "negation",
"nk": "noun kernel element",
"nmc": "numerical component",
"oa": "accusative object",
"oc": "clausal object",
"og": "genitive object",
"op": "prepositional object",
"par": "parenthetical element",
"pd": "predicate",
"pg": "phrasal genitive",
"ph": "placeholder",
"pm": "morphological particle",
"pnc": "proper noun component",
"rc": "relative clause",
"re": "repeated element",
"rs": "reported speech",
"sb": "subject",
"sbp": "passivized subject (PP)",
"sp": "subject or predicate",
"svp": "separable verb prefix",
"uc": "unit component",
"vo": "vocative",
# Named Entity Recognition
# OntoNotes 5
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
"PERSON": "People, including fictional",
"NORP": "Nationalities or religious or political groups",
"FACILITY": "Buildings, airports, highways, bridges, etc.",
"FAC": "Buildings, airports, highways, bridges, etc.",
"ORG": "Companies, agencies, institutions, etc.",
"GPE": "Countries, cities, states",
"LOC": "Non-GPE locations, mountain ranges, bodies of water",
"PRODUCT": "Objects, vehicles, foods, etc. (not services)",
"EVENT": "Named hurricanes, battles, wars, sports events, etc.",
"WORK_OF_ART": "Titles of books, songs, etc.",
"LAW": "Named documents made into laws.",
"LANGUAGE": "Any named language",
"DATE": "Absolute or relative dates or periods",
"TIME": "Times smaller than a day",
"PERCENT": 'Percentage, including "%"',
"MONEY": "Monetary values, including unit",
"QUANTITY": "Measurements, as of weight or distance",
"ORDINAL": '"first", "second", etc.',
"CARDINAL": "Numerals that do not fall under another type",
# Named Entity Recognition
# Wikipedia
# http://www.sciencedirect.com/science/article/pii/S0004370212000276
# https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
"PER": "Named person or family.",
"MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
# https://github.com/ltgoslo/norne
"EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
"PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
"DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
"GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
"GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
}
算法方法
基于隐马尔可夫模型的词性标注
HMM是概率有向图模型中的一种,应该是机器学习中推理最繁琐的算法了,并且用到了动态规划算法,具体数学细节参考《统计学习方法》; 隐马尔可夫模型( Hidden Markov Model, HMM)是描述两个时序序列联合分布 p(x,y) 的概率模型: x 序列外界可见(外界指的是观测者),称为观测序列(obsevation sequence); y 序列外界不可见,称为状态序列(state sequence)。比如观测 x 为单词,状态 y 为词性,我们需要根据单词序列去猜测它们的词性。隐马尔可夫模型之所以称为“隐”,是因为从外界来看,状 态序列(例如词性)隐藏不可见,是待求的因变量。从这个角度来讲,人们也称状态为隐状态(hidden state),而称观测为显状态( visible state)。隐马尔可夫模型之所以称为“马尔可夫模型”,”是因为它满足马尔可夫假设; 从数据–>HMM模型–>预测词性,要解决概率计算问题、学习问题、预测问题,预测问题就是根据观测序列,预测概率最大的状态序列(即词性序列);
基于条件随机场的词性标注
CRF是概率无向图模型中的一种,数学细节和HMM基本类似,具体数学细节参考《统计学习方法》;
Hanlp中的CRF实现由于基于java虚拟机,速度比c++要慢,所以作者建议直接在本机上安装crf++,mac安装很简单:brew install crf++,其它安装方法参考
基于感知机的词性标注
perceptron算法可以说是一层神经网络,多层perceptron便成了深度学习模型,具体参考《统计学习方法》;
按照中文分词时的经验,感知机能够利用丰富的上下文特征,是优于隐马尔可夫模型的选择,对于词性标注也是如此。
词性标注评测
A c c u r a c y = 预 测 正 确 的 标 签 数 标 签 总 数 Accuracy = \frac{预测正确的标签数} {标签总数} Accuracy=标签总数预测正确的标签数
代码
基于隐马尔可夫模型实现
from pyhanlp import *
# 这里使用的是1998年《人民日报》1月份语料
from tests.book.ch07.pku import PKU199801_TRAIN
HMMPOSTagger = JClass('com.hankcs.hanlp.model.hmm.HMMPOSTagger')
AbstractLexicalAnalyzer = JClass('com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer')
PerceptronSegmenter = JClass('com.hankcs.hanlp.model.perceptron.PerceptronSegmenter')
FirstOrderHiddenMarkovModel = JClass('com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModel')
SecondOrderHiddenMarkovModel = JClass('com.hankcs.hanlp.model.hmm.SecondOrderHiddenMarkovModel')
def train_hmm_pos(corpus, model):
tagger = HMMPOSTagger(model) # 创建词性标注器
tagger.train(corpus) # 训练
# 词性标注器不负责分词,所以只接受分词后的单词序列
# print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学"))) # 预测
# 加上analyzer可以同时执行分词和词性标注
analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger) # 构造词法分析器
# 英文缩写词性
# print(analyzer.analyze("今年元旦我要去看升国旗!")) # 分词+词性标注
# 把英文缩写词性转化为英文
print(analyzer.analyze("他的希望是希望上学").translateLabels()) # 分词+词性标注
return tagger
# 一阶隐马尔可夫
tagger = train_hmm_pos(PKU199801_TRAIN, FirstOrderHiddenMarkovModel())
# 二阶隐马尔可夫
tagger = train_hmm_pos(PKU199801_TRAIN, SecondOrderHiddenMarkovModel())
基于条件随机场模型实现
from pyhanlp import *
from tests.book.ch07.demo_hmm_pos import AbstractLexicalAnalyzer, PerceptronSegmenter
from tests.book.ch07.pku import POS_MODEL, PKU199801_TRAIN
CRFPOSTagger = JClass('com.hankcs.hanlp.model.crf.CRFPOSTagger')
def train_crf_pos(corpus):
# 选项1.使用HanLP的Java API训练,慢
tagger = CRFPOSTagger(None) # 创建空白标注器
tagger.train(corpus, POS_MODEL) # 训练
tagger = CRFPOSTagger(POS_MODEL) # 加载
# 选项2.使用CRF++训练,HanLP加载。(训练命令由选项1给出)
# tagger = CRFPOSTagger(POS_MODEL + ".txt")
print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学"))) # 预测
analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger) # 构造词法分析器
print(analyzer.analyze("李狗蛋的希望是希望上学")) # 分词+词性标注
return tagger
if __name__ == '__main__':
tagger = train_crf_pos(PKU199801_TRAIN)
基于感知机模型实现
from pyhanlp import *
from tests.book.ch07.demo_hmm_pos import AbstractLexicalAnalyzer, PerceptronSegmenter
from tests.book.ch07.pku import PKU199801_TRAIN, POS_MODEL
POSTrainer = JClass('com.hankcs.hanlp.model.perceptron.POSTrainer')
PerceptronPOSTagger = JClass('com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger')
def train_perceptron_pos(corpus):
trainer = POSTrainer()
# trainer.train(corpus, POS_MODEL) # 训练
tagger = PerceptronPOSTagger(POS_MODEL) # 加载
print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学"))) # 预测
analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger) # 构造词法分析器
print(analyzer.analyze("李狗蛋的希望是希望上学")) # 分词+词性标注
return tagger
if __name__ == '__main__':
train_perceptron_pos(PKU199801_TRAIN)
spacy 训练中文预料
# -*- coding: utf-8 -*-
"""
Created on Fri Apr 1 18:10:18 2022
@author: He Zekai
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.training import Example
from spacy.tokens import Doc
TAG_MAP = {
'n':{'pos':'普通名词'},
'f':{'pos':'方位名词'},
's':{'pos':'处所名词'},
'nw':{'pos':'作品名'},
'nz':{'pos':'其他专名'},
'v':{'pos':'普通动词'},
'vd':{'pos':'动副词'},
'vn':{'pos':'名动词'},
'a':{'pos':'形容词'},
'ad':{'pos':'副形词'},
'an':{'pos':'名形词'},
'd':{'pos':'副词'},
'm':{'pos':'数量词'},
'q':{'pos':'量词'},
'r':{'pos':'代词'},
'p':{'pos':'介词'},
'c':{'pos':'连词'},
'u':{'pos':'助词'},
'xc':{'pos':'其他虚词'},
'w':{'pos':'标点符号'},
'PER':{'pos':'人名'},
'LOC':{'pos':'地名'},
'ORG':{'pos':'机构名'},
'TIME':{'pos':'时间'}
}
import random
from LAC import LAC
import pandas as pd
text = []
with open('train_text.txt',encoding='utf8') as f:
for i in f.readlines():
text.append(i.strip('\u200b\u200b\u200b\n'))
random.seed(123)
lac = LAC(mode='lac')
train = random.sample(text, 20)#随机取20条新闻
data = lac.run(train)
TRAIN_DATA = []
for i in range(len(data)):
txt = (data[i][0],{'tags':data[i][1]})
TRAIN_DATA.append(txt)
TRAIN_DATA
@plac.annotations(
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(lang='zh', output_dir="C:/Users/11752/Desktop/大三下/自然语言处理/作业5--自然语言处理作业zip/model/", n_iter=300):
nlp = spacy.blank(lang)
tagger = nlp.add_pipe('tagger')
print("names2:",nlp.pipe_names)
#添加标注器
for tag, values in TAG_MAP.items():
print("tag:",tag)
tagger.add_label(tag)
optimizer = nlp.begin_training() #模型初始化
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
example = Example.from_dict(Doc(nlp.vocab, words=text, spaces=[""] * len(text)), annotations)
nlp.update([example], sgd=optimizer, losses=losses)
print("i:",str(i) + str(losses))
test_text = random.sample(train, 5)#随机取20条新闻
test_text = u','.join(test_text)
doc = nlp(test_text)
print("doc:",[t.text for t in doc])
print('Test_Tags', [(t.text, t.tag_, TAG_MAP[t.tag_]['pos']) for t in doc])
# 将模型保存到输出目录
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# 保存模型
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, TAG_MAP[t.tag_]['pos']) for t in doc])
if __name__ == '__main__':
plac.call(main)
参考
https://blog.csdn.net/yuebowhu/article/details/112006712
https://blog.csdn.net/weixin_45965387/article/details/123953377