词性标注教程

发呆的比目鱼

已于 2022-05-25 23:12:51 修改

阅读量2.8k

点赞数 1

分类专栏：算法文章标签：机器学习自然语言处理深度学习

于 2022-05-25 17:53:45 首次发布

本文链接：https://blog.csdn.net/weixin_42486623/article/details/124970473

版权

算法专栏收录该内容

43 篇文章

订阅专栏

词性标注教程

基本概念

在语言学上，词性(Par-Of-Speech, Pos )指的是单词的语法分类，也称为词类。同一个类别的词语具有相似的语法性质，所有词性的集合称为词性标注集。不同的语料库采用了不同的词性标注集，一般都含有形容词、动词、名词等常见词性。

序列标注指的是给定一个序列，找出序列中每个元素对应标签的问题。其中，y 所有可能的取值集合称为标注集。比如，输入一个自然数序列，输出它们的奇偶性。
求解序列标注问题的模型一般称为序列标注器，通常由模型从一个标注数据集中学习相关知识后再进行预测。在NLP问题中，x 通常是字符或词语，而 y 则是待预测的组词角色或词性等标签。中文分词、词性标注以及命名实体识别，都可以转化为序列标注问题。

ICTCLAS 汉语词性标注集

词性	名称	诠释
Ag	形语素	形容词性语素。形容词代码为a，语素代码ｇ前面置以A。
a	形容词	取英语形容词adjective的第1个字母。
ad	副形词	直接作状语的形容词。形容词代码a和副词代码d并在一起。
an	名形词	具有名词功能的形容词。形容词代码a和名词代码n并在一起。
b	区别词	取汉字“别”的声母。
c	连词	取英语连词conjunction的第1个字母。
Dg	副语素	副词性语素。副词代码为d，语素代码ｇ前面置以D。
d	副词	取adverb的第2个字母，因其第1个字母已用于形容词。
e	叹词	取英语叹词exclamation的第1个字母。
f	方位词	取汉字“方” 的声母。
g	语素	绝大多数语素都能作为合成词的“词根”，取汉字“根”的声母。
h	前接成分	取英语head的第1个字母。
i	成语	取英语成语idiom的第1个字母。
j	简称略语	取汉字“简”的声母。
k	后接成分
l	习用语	习用语尚未成为成语，有点“临时性”，取“临”的声母。
m	数词	取英语numeral的第3个字母，n，u已有他用。
Ng	名语素	名词性语素。名词代码为n，语素代码ｇ前面置以N。
n	名词	取英语名词noun的第1个字母。
nr	人名	名词代码n和“人(ren)”的声母并在一起。
ns	地名	名词代码n和处所词代码s并在一起。
nt	机构团体	“团”的声母为t，名词代码n和t并在一起。
nz	其他专名	“专”的声母的第1个字母为z，名词代码n和z并在一起。
o	拟声词	取英语拟声词onomatopoeia的第1个字母。
p	介词	取英语介词prepositional的第1个字母。
q	量词	取英语quantity的第1个字母。
r	代词	取英语代词pronoun的第2个字母,因p已用于介词。
s	处所词	取英语space的第1个字母。
Tg	时语素	时间词性语素。时间词代码为t,在语素的代码g前面置以T。
t	时间词	取英语time的第1个字母。
u	助词	取英语助词auxiliary 的第2个字母,因a已用于形容词。
Vg	动语素	动词性语素。动词代码为v。在语素的代码g前面置以V。
v	动词	取英语动词verb的第一个字母。
vd	副动词	直接作状语的动词。动词和副词的代码并在一起。
vn	名动词	指具有名词功能的动词。动词和名词的代码并在一起。
w	标点符号
x	非语素字	非语素字只是一个符号，字母x通常用于代表未知数、符号。
y	语气词	取汉字“语”的声母。
z	状态词	取汉字“状”的声母的前一个字母。

英文SpaCy词性对照表

词性	单词	名称
ADJ	adjective	形容词
ADP	adposition	介词
ADV	adverb	副词
AUX	auxiliary verb	助动词
CONJ	coordinating conjunction	并列连词
DET	determiner	限定词
INTJ	interjection	感叹词
NOUN	noun	名词
NUM	numeral	数词
PART	particle	助词
PRON	pronoun	代词
PROPN	proper noun	专有名词
PUNCT	punctuation	标点符号
SCONJ	subordinating conjunction	从属连词
SYM	symbol 符号
VERB	verb 动词
X	other

官方关于词性、依存关系、实体的名词解释:

def explain(term):
    """Get a description for a given POS tag, dependency label or entity type.
    term (str): The term to explain.
    RETURNS (str): The explanation, or `None` if not found in the glossary.
    EXAMPLE:
        >>> spacy.explain(u'NORP')
        >>> doc = nlp(u'Hello world')
        >>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc])
    """
    if term in GLOSSARY:
        return GLOSSARY[term]
 
 
GLOSSARY = {
    # POS tags
    # Universal POS Tags
    # http://universaldependencies.org/u/pos/
    "ADJ": "adjective",
    "ADP": "adposition",
    "ADV": "adverb",
    "AUX": "auxiliary",
    "CONJ": "conjunction",
    "CCONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other",
    "EOL": "end of line",
    "SPACE": "space",
    # POS tags (English)
    # OntoNotes 5 / Penn Treebank
    # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    ".": "punctuation mark, sentence closer",
    ",": "punctuation mark, comma",
    "-LRB-": "left round bracket",
    "-RRB-": "right round bracket",
    "``": "opening quotation mark",
    '""': "closing quotation mark",
    "''": "closing quotation mark",
    ":": "punctuation mark, colon or ellipsis",
    "$": "symbol, currency",
    "#": "symbol, number sign",
    "AFX": "affix",
    "CC": "conjunction, coordinating",
    "CD": "cardinal number",
    "DT": "determiner",
    "EX": "existential there",
    "FW": "foreign word",
    "HYPH": "punctuation mark, hyphen",
    "IN": "conjunction, subordinating or preposition",
    "JJ": "adjective (English), other noun-modifier (Chinese)",
    "JJR": "adjective, comparative",
    "JJS": "adjective, superlative",
    "LS": "list item marker",
    "MD": "verb, modal auxiliary",
    "NIL": "missing tag",
    "NN": "noun, singular or mass",
    "NNP": "noun, proper singular",
    "NNPS": "noun, proper plural",
    "NNS": "noun, plural",
    "PDT": "predeterminer",
    "POS": "possessive ending",
    "PRP": "pronoun, personal",
    "PRP$": "pronoun, possessive",
    "RB": "adverb",
    "RBR": "adverb, comparative",
    "RBS": "adverb, superlative",
    "RP": "adverb, particle",
    "TO": 'infinitival "to"',
    "UH": "interjection",
    "VB": "verb, base form",
    "VBD": "verb, past tense",
    "VBG": "verb, gerund or present participle",
    "VBN": "verb, past participle",
    "VBP": "verb, non-3rd person singular present",
    "VBZ": "verb, 3rd person singular present",
    "WDT": "wh-determiner",
    "WP": "wh-pronoun, personal",
    "WP$": "wh-pronoun, possessive",
    "WRB": "wh-adverb",
    "SP": "space (English), sentence-final particle (Chinese)",
    "ADD": "email",
    "NFP": "superfluous punctuation",
    "GW": "additional word in multi-word expression",
    "XX": "unknown",
    "BES": 'auxiliary "be"',
    "HVS": 'forms of "have"',
    "_SP": "whitespace",
    # POS Tags (German)
    # TIGER Treebank
    # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
    "$(": "other sentence-internal punctuation mark",
    "$,": "comma",
    "$.": "sentence-final punctuation mark",
    "ADJA": "adjective, attributive",
    "ADJD": "adjective, adverbial or predicative",
    "APPO": "postposition",
    "APPR": "preposition; circumposition left",
    "APPRART": "preposition with article",
    "APZR": "circumposition right",
    "ART": "definite or indefinite article",
    "CARD": "cardinal number",
    "FM": "foreign language material",
    "ITJ": "interjection",
    "KOKOM": "comparative conjunction",
    "KON": "coordinate conjunction",
    "KOUI": 'subordinate conjunction with "zu" and infinitive',
    "KOUS": "subordinate conjunction with sentence",
    "NE": "proper noun",
    "NNE": "proper noun",
    "PAV": "pronominal adverb",
    "PROAV": "pronominal adverb",
    "PDAT": "attributive demonstrative pronoun",
    "PDS": "substituting demonstrative pronoun",
    "PIAT": "attributive indefinite pronoun without determiner",
    "PIDAT": "attributive indefinite pronoun with determiner",
    "PIS": "substituting indefinite pronoun",
    "PPER": "non-reflexive personal pronoun",
    "PPOSAT": "attributive possessive pronoun",
    "PPOSS": "substituting possessive pronoun",
    "PRELAT": "attributive relative pronoun",
    "PRELS": "substituting relative pronoun",
    "PRF": "reflexive personal pronoun",
    "PTKA": "particle with adjective or adverb",
    "PTKANT": "answer particle",
    "PTKNEG": "negative particle",
    "PTKVZ": "separable verbal particle",
    "PTKZU": '"zu" before infinitive',
    "PWAT": "attributive interrogative pronoun",
    "PWAV": "adverbial interrogative or relative pronoun",
    "PWS": "substituting interrogative pronoun",
    "TRUNC": "word remnant",
    "VAFIN": "finite verb, auxiliary",
    "VAIMP": "imperative, auxiliary",
    "VAINF": "infinitive, auxiliary",
    "VAPP": "perfect participle, auxiliary",
    "VMFIN": "finite verb, modal",
    "VMINF": "infinitive, modal",
    "VMPP": "perfect participle, modal",
    "VVFIN": "finite verb, full",
    "VVIMP": "imperative, full",
    "VVINF": "infinitive, full",
    "VVIZU": 'infinitive with "zu", full',
    "VVPP": "perfect participle, full",
    "XY": "non-word containing non-letter",
    # POS Tags (Chinese)
    # OntoNotes / Chinese Penn Treebank
    # https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports
    "AD": "adverb",
    "AS": "aspect marker",
    "BA": "把 in ba-construction",
    # "CD": "cardinal number",
    "CS": "subordinating conjunction",
    "DEC": "的 in a relative clause",
    "DEG": "associative 的",
    "DER": "得 in V-de const. and V-de-R",
    "DEV": "地 before VP",
    "ETC": "for words 等, 等等",
    # "FW": "foreign words"
    "IJ": "interjection",
    # "JJ": "other noun-modifier",
    "LB": "被 in long bei-const",
    "LC": "localizer",
    "M": "measure word",
    "MSP": "other particle",
    # "NN": "common noun",
    "NR": "proper noun",
    "NT": "temporal noun",
    "OD": "ordinal number",
    "ON": "onomatopoeia",
    "P": "preposition excluding 把 and 被",
    "PN": "pronoun",
    "PU": "punctuation",
    "SB": "被 in short bei-const",
    # "SP": "sentence-final particle",
    "VA": "predicative adjective",
    "VC": "是 (copula)",
    "VE": "有 as the main verb",
    "VV": "other verb",
    # Noun chunks
    "NP": "noun phrase",
    "PP": "prepositional phrase",
    "VP": "verb phrase",
    "ADVP": "adverb phrase",
    "ADJP": "adjective phrase",
    "SBAR": "subordinating conjunction",
    "PRT": "particle",
    "PNP": "prepositional noun phrase",
    # Dependency Labels (English)
    # ClearNLP / Universal Dependencies
    # https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
    "acl": "clausal modifier of noun (adjectival clause)",
    "acomp": "adjectival complement",
    "advcl": "adverbial clause modifier",
    "advmod": "adverbial modifier",
    "agent": "agent",
    "amod": "adjectival modifier",
    "appos": "appositional modifier",
    "attr": "attribute",
    "aux": "auxiliary",
    "auxpass": "auxiliary (passive)",
    "case": "case marking",
    "cc": "coordinating conjunction",
    "ccomp": "clausal complement",
    "clf": "classifier",
    "complm": "complementizer",
    "compound": "compound",
    "conj": "conjunct",
    "cop": "copula",
    "csubj": "clausal subject",
    "csubjpass": "clausal subject (passive)",
    "dative": "dative",
    "dep": "unclassified dependent",
    "det": "determiner",
    "discourse": "discourse element",
    "dislocated": "dislocated elements",
    "dobj": "direct object",
    "expl": "expletive",
    "fixed": "fixed multiword expression",
    "flat": "flat multiword expression",
    "goeswith": "goes with",
    "hmod": "modifier in hyphenation",
    "hyph": "hyphen",
    "infmod": "infinitival modifier",
    "intj": "interjection",
    "iobj": "indirect object",
    "list": "list",
    "mark": "marker",
    "meta": "meta modifier",
    "neg": "negation modifier",
    "nmod": "modifier of nominal",
    "nn": "noun compound modifier",
    "npadvmod": "noun phrase as adverbial modifier",
    "nsubj": "nominal subject",
    "nsubjpass": "nominal subject (passive)",
    "nounmod": "modifier of nominal",
    "npmod": "noun phrase as adverbial modifier",
    "num": "number modifier",
    "number": "number compound modifier",
    "nummod": "numeric modifier",
    "oprd": "object predicate",
    "obj": "object",
    "obl": "oblique nominal",
    "orphan": "orphan",
    "parataxis": "parataxis",
    "partmod": "participal modifier",
    "pcomp": "complement of preposition",
    "pobj": "object of preposition",
    "poss": "possession modifier",
    "possessive": "possessive modifier",
    "preconj": "pre-correlative conjunction",
    "prep": "prepositional modifier",
    "prt": "particle",
    "punct": "punctuation",
    "quantmod": "modifier of quantifier",
    "rcmod": "relative clause modifier",
    "relcl": "relative clause modifier",
    "reparandum": "overridden disfluency",
    "root": "root",
    "vocative": "vocative",
    "xcomp": "open clausal complement",
    # Dependency labels (German)
    # TIGER Treebank
    # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
    # currently missing: 'cc' (comparative complement) because of conflict
    # with English labels
    "ac": "adpositional case marker",
    "adc": "adjective component",
    "ag": "genitive attribute",
    "ams": "measure argument of adjective",
    "app": "apposition",
    "avc": "adverbial phrase component",
    "cd": "coordinating conjunction",
    "cj": "conjunct",
    "cm": "comparative conjunction",
    "cp": "complementizer",
    "cvc": "collocational verb construction",
    "da": "dative",
    "dh": "discourse-level head",
    "dm": "discourse marker",
    "ep": "expletive es",
    "hd": "head",
    "ju": "junctor",
    "mnr": "postnominal modifier",
    "mo": "modifier",
    "ng": "negation",
    "nk": "noun kernel element",
    "nmc": "numerical component",
    "oa": "accusative object",
    "oc": "clausal object",
    "og": "genitive object",
    "op": "prepositional object",
    "par": "parenthetical element",
    "pd": "predicate",
    "pg": "phrasal genitive",
    "ph": "placeholder",
    "pm": "morphological particle",
    "pnc": "proper noun component",
    "rc": "relative clause",
    "re": "repeated element",
    "rs": "reported speech",
    "sb": "subject",
    "sbp": "passivized subject (PP)",
    "sp": "subject or predicate",
    "svp": "separable verb prefix",
    "uc": "unit component",
    "vo": "vocative",
    # Named Entity Recognition
    # OntoNotes 5
    # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
    "PERSON": "People, including fictional",
    "NORP": "Nationalities or religious or political groups",
    "FACILITY": "Buildings, airports, highways, bridges, etc.",
    "FAC": "Buildings, airports, highways, bridges, etc.",
    "ORG": "Companies, agencies, institutions, etc.",
    "GPE": "Countries, cities, states",
    "LOC": "Non-GPE locations, mountain ranges, bodies of water",
    "PRODUCT": "Objects, vehicles, foods, etc. (not services)",
    "EVENT": "Named hurricanes, battles, wars, sports events, etc.",
    "WORK_OF_ART": "Titles of books, songs, etc.",
    "LAW": "Named documents made into laws.",
    "LANGUAGE": "Any named language",
    "DATE": "Absolute or relative dates or periods",
    "TIME": "Times smaller than a day",
    "PERCENT": 'Percentage, including "%"',
    "MONEY": "Monetary values, including unit",
    "QUANTITY": "Measurements, as of weight or distance",
    "ORDINAL": '"first", "second", etc.',
    "CARDINAL": "Numerals that do not fall under another type",
    # Named Entity Recognition
    # Wikipedia
    # http://www.sciencedirect.com/science/article/pii/S0004370212000276
    # https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
    "PER": "Named person or family.",
    "MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
    # https://github.com/ltgoslo/norne
    "EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
    "PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
    "DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
    "GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
    "GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
}

算法方法

基于隐马尔可夫模型的词性标注

HMM是概率有向图模型中的一种，应该是机器学习中推理最繁琐的算法了，并且用到了动态规划算法，具体数学细节参考《统计学习方法》；隐马尔可夫模型( Hidden Markov Model, HMM)是描述两个时序序列联合分布 p(x,y) 的概率模型: x 序列外界可见(外界指的是观测者)，称为观测序列(obsevation sequence); y 序列外界不可见，称为状态序列(state sequence)。比如观测 x 为单词，状态 y 为词性，我们需要根据单词序列去猜测它们的词性。隐马尔可夫模型之所以称为“隐”，是因为从外界来看，状态序列(例如词性)隐藏不可见，是待求的因变量。从这个角度来讲，人们也称状态为隐状态(hidden state),而称观测为显状态( visible state)。隐马尔可夫模型之所以称为“马尔可夫模型”，”是因为它满足马尔可夫假设；从数据–>HMM模型–>预测词性，要解决概率计算问题、学习问题、预测问题，预测问题就是根据观测序列，预测概率最大的状态序列（即词性序列）；

基于条件随机场的词性标注

CRF是概率无向图模型中的一种，数学细节和HMM基本类似，具体数学细节参考《统计学习方法》；
Hanlp中的CRF实现由于基于java虚拟机，速度比c++要慢，所以作者建议直接在本机上安装crf++，mac安装很简单：brew install crf++，其它安装方法参考

基于感知机的词性标注

perceptron算法可以说是一层神经网络，多层perceptron便成了深度学习模型，具体参考《统计学习方法》；
按照中文分词时的经验，感知机能够利用丰富的上下文特征，是优于隐马尔可夫模型的选择，对于词性标注也是如此。

词性标注评测

$\frac{预测正确的标签数} {标签总数}$

代码

基于隐马尔可夫模型实现

from pyhanlp import *
# 这里使用的是1998年《人民日报》1月份语料
from tests.book.ch07.pku import PKU199801_TRAIN

HMMPOSTagger = JClass('com.hankcs.hanlp.model.hmm.HMMPOSTagger')
AbstractLexicalAnalyzer = JClass('com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer')
PerceptronSegmenter = JClass('com.hankcs.hanlp.model.perceptron.PerceptronSegmenter')
FirstOrderHiddenMarkovModel = JClass('com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModel')
SecondOrderHiddenMarkovModel = JClass('com.hankcs.hanlp.model.hmm.SecondOrderHiddenMarkovModel')

def train_hmm_pos(corpus, model):
    tagger = HMMPOSTagger(model)  # 创建词性标注器
    tagger.train(corpus)  # 训练
    # 词性标注器不负责分词，所以只接受分词后的单词序列
#     print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学")))  # 预测
    # 加上analyzer可以同时执行分词和词性标注
    analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger)  # 构造词法分析器
    # 英文缩写词性
#     print(analyzer.analyze("今年元旦我要去看升国旗！"))  # 分词+词性标注
    # 把英文缩写词性转化为英文
    print(analyzer.analyze("他的希望是希望上学").translateLabels())  # 分词+词性标注
    return tagger
 
# 一阶隐马尔可夫
tagger = train_hmm_pos(PKU199801_TRAIN, FirstOrderHiddenMarkovModel())
# 二阶隐马尔可夫
tagger = train_hmm_pos(PKU199801_TRAIN, SecondOrderHiddenMarkovModel())

基于条件随机场模型实现

from pyhanlp import *
from tests.book.ch07.demo_hmm_pos import AbstractLexicalAnalyzer, PerceptronSegmenter
from tests.book.ch07.pku import POS_MODEL, PKU199801_TRAIN

CRFPOSTagger = JClass('com.hankcs.hanlp.model.crf.CRFPOSTagger')


def train_crf_pos(corpus):
    # 选项1.使用HanLP的Java API训练，慢
    tagger = CRFPOSTagger(None)  # 创建空白标注器
    tagger.train(corpus, POS_MODEL)  # 训练
    tagger = CRFPOSTagger(POS_MODEL) # 加载
    # 选项2.使用CRF++训练，HanLP加载。（训练命令由选项1给出）
    # tagger = CRFPOSTagger(POS_MODEL + ".txt")
    print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学")))  # 预测
    analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger)  # 构造词法分析器
    print(analyzer.analyze("李狗蛋的希望是希望上学"))  # 分词+词性标注
    return tagger


if __name__ == '__main__':
    tagger = train_crf_pos(PKU199801_TRAIN)

基于感知机模型实现

from pyhanlp import *
from tests.book.ch07.demo_hmm_pos import AbstractLexicalAnalyzer, PerceptronSegmenter
from tests.book.ch07.pku import PKU199801_TRAIN, POS_MODEL

POSTrainer = JClass('com.hankcs.hanlp.model.perceptron.POSTrainer')
PerceptronPOSTagger = JClass('com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger')


def train_perceptron_pos(corpus):
    trainer = POSTrainer()
    # trainer.train(corpus, POS_MODEL)  # 训练
    tagger = PerceptronPOSTagger(POS_MODEL)  # 加载
    print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学")))  # 预测
    analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger)  # 构造词法分析器
    print(analyzer.analyze("李狗蛋的希望是希望上学"))  # 分词+词性标注
    return tagger


if __name__ == '__main__':
    train_perceptron_pos(PKU199801_TRAIN)

spacy 训练中文预料

# -*- coding: utf-8 -*-
"""
Created on Fri Apr  1 18:10:18 2022

@author: He Zekai
"""

from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.training import Example
from spacy.tokens import Doc

TAG_MAP = {
     'n':{'pos':'普通名词'},
     'f':{'pos':'方位名词'},
     's':{'pos':'处所名词'},
     'nw':{'pos':'作品名'},
     'nz':{'pos':'其他专名'},
     'v':{'pos':'普通动词'},
     'vd':{'pos':'动副词'},
     'vn':{'pos':'名动词'},
     'a':{'pos':'形容词'},
     'ad':{'pos':'副形词'},
     'an':{'pos':'名形词'},
     'd':{'pos':'副词'},
     'm':{'pos':'数量词'},
     'q':{'pos':'量词'},
     'r':{'pos':'代词'},
     'p':{'pos':'介词'},
     'c':{'pos':'连词'},
     'u':{'pos':'助词'},
     'xc':{'pos':'其他虚词'},
     'w':{'pos':'标点符号'},
     'PER':{'pos':'人名'},
     'LOC':{'pos':'地名'},
     'ORG':{'pos':'机构名'},
     'TIME':{'pos':'时间'}
}

import random
from LAC import LAC
import pandas as pd

text = []
with open('train_text.txt',encoding='utf8') as f:
    for i in f.readlines():
        text.append(i.strip('\u200b\u200b\u200b\n'))
random.seed(123)
lac = LAC(mode='lac')
train = random.sample(text, 20)#随机取20条新闻
data = lac.run(train)

TRAIN_DATA = []
for i in range(len(data)):
    txt = (data[i][0],{'tags':data[i][1]})
    TRAIN_DATA.append(txt)
TRAIN_DATA
@plac.annotations(
    lang=("ISO Code of language to use", "option", "l", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int))

def main(lang='zh', output_dir="C:/Users/11752/Desktop/大三下/自然语言处理/作业5--自然语言处理作业zip/model/", n_iter=300):
    nlp = spacy.blank(lang)
    tagger = nlp.add_pipe('tagger')
    
    print("names2:",nlp.pipe_names)
    #添加标注器
    for tag, values in TAG_MAP.items():
        print("tag:",tag)
        tagger.add_label(tag)

    optimizer = nlp.begin_training()  #模型初始化
    for i in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            example = Example.from_dict(Doc(nlp.vocab, words=text, spaces=[""] * len(text)), annotations)
            nlp.update([example], sgd=optimizer, losses=losses)
        print("i:",str(i) + str(losses))

    test_text = random.sample(train, 5)#随机取20条新闻
    test_text = u','.join(test_text)
    doc = nlp(test_text)
    print("doc:",[t.text for t in doc])
    print('Test_Tags', [(t.text, t.tag_, TAG_MAP[t.tag_]['pos']) for t in doc])

    # 将模型保存到输出目录
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # 保存模型
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc = nlp2(test_text)
        print('Tags',  [(t.text, t.tag_, TAG_MAP[t.tag_]['pos'])  for t in doc])

if __name__ == '__main__':
    plac.call(main)