python 语料标注_【语言处理与Python】5.1使用词性标注器/5.2标注语料库

什么是词性标注?

将词性按照它们的词性分类以及相应的标注它们的过程被成为词性标注。

词性也成为词类,或者词汇范畴。

用于特定任务的标记的集合被称为一个标记集。

5.1使用词性标注器

词性标注器的简单例子

text=nltk.word_tokenize(“And now forsomething completely different”)

nltk.pos_tag(text)

#查看词性

nltk.help.upenn_tagset(‘RB’)

5.2标注语料库

表示已标注的标识符

#转换成元组

tagged_token=nltk.tag.str2tuple(‘fly/NN’)

>>>sent = '''... The/ATgrand/JJjury/NN commented/VBD on/INa/AT number/NNof/IN

... other/AP topics/NNS ,/, AMONG/INthem/PPO the/AT Atlanta/NPand/CC

... Fulton/NP-tlCounty/NN-tlpurchasing/VBG departments/NNSwhich/WDTit/PP

... said/VBD ``/`` ARE/BERwell/QLoperated/VBN and/CC follow/VB generally/R

... accepted/VBN practices/NNS which/WDTinure/VB to/IN the/AT best/JJT

... interest/NN of/INboth/ABX governments/NNS ''/'' ./.

...'''

>>>[nltk.tag.str2tuple(t) for t insent.split()]

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),

('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

读取已经标注的语料库

#注意并非所有的语料库的标注都一致

nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]>>>nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

简化的词性标记集

标记 含义 例子

ADJ 形容词 new,good, high, special, big, local

ADV 动词 really, already, still, early, now

CNJ 连词and, or,but, if, while,although

DET 限定词 the, a, some, most,every, no

EX 存在量词 there, there's

FW 外来词 dolce, ersatz, esprit, quo,maitre

MOD 情态动词 will,can,would,may,must,should

N 名词 year,home, costs, time, education

NP 专有名词 Alison,Africa,April,Washington

NUM 数词 twenty-four, fourth, 1991,14:24PRO 代词 he, their, her,its, my,I, us

P 介词 on, of,at, with,by,into, under

TO 词to to

UH 感叹词 ah, bang, ha,whee,hmpf,oops

V 动词is, has,get, do,make,see, run

VD 过去式 said, took, told, made,asked

VG 现在分词 making,going, playing, working

VN 过去分词 given, taken, begun,sung

WH Wh限定词 who,which,when,what,where,how

名词

简化的名词标记对普通名词是N,对专有名词是NP

#如何构建一个双连词链表

brown_news_tagged=brown.tagged_words(categories='news', simplify_tags=True)

word_tag_pairs=nltk.bigrams(brown_news_tagged)

list(nltk.FreqDist(a[1] for (a, b)in word_tag_pairs if b[1]=='N'))

['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]

这样就可以分析出,在一个名词前,会出现什么词性。

动词、形容词、副词(具体查看相关资料)

未简化的标记

让我们以一段程序开始,探索每个名词中最频繁的名词

NN

含有$的名词所有格

含有S的复数名词

含有P的专有名词

后缀修饰符

-NC表示引用

-HL表示标题中的词

-TL表示标题

deffindtags(tag_prefix,tagged_text):

cfd=nltk.ConditionalFreqDist((tag,word) for (word,tag) in tagged_text iftag.startswith(tag_prefix))return dict((tag,cfd[tag].keys()[:5]) for tag incfd.condition())>>>tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))>>>for tag insorted(tagdict):

...printtag, tagdict[tag]

...

NN['year', 'time', 'state', 'week', 'man']

NN$["year's", "world's", "state's", "nation's", "company's"]

NN$-HL["Golf's", "Navy's"]

NN$-TL["President's", "University's", "League's", "Gallery's", "Army's"]

NN-HL['cut', 'Salary', 'condition', 'Question', 'business']

NN-NC['eva', 'ova', 'aya']

NN-TL['President', 'House', 'State', 'University', 'City']

NN-TL-HL['Fort', 'City', 'Commissioner', 'Grove', 'House']

NNS['years', 'members', 'people', 'sales', 'men']

NNS$["children's", "women's", "men's", "janitors'", "taxpayers'"]

NNS$-HL["Dealers'", "Idols'"]

NNS$-TL["Women's", "States'", "Giants'", "Officers'", "Bombers'"]

NNS-HL['years', 'idols', 'Creations', 'thanks', 'centers']

NNS-TL['States', 'Nations', 'Masters', 'Rules', 'Communists']

NNS-TL-HL['Nations']

搜索已经标注的语料库

补充:

nltk.ibigrams()的用法示例:

>>> from nltk.util importibigrams>>> list(ibigrams([1,2,3,4,5]))

[(1, 2), (2, 3), (3, 4), (4, 5)]

查询之后,nltk.bigrams和nltk.ibigrams用法一致。

下面这个例子可以展示,如何研究often的用法,看看在often后面的词汇:

#这样查看,过于简单,意义也不大,如果结合词性看,更好

brown_learned_text = brown.words(categories='learned')

sorted(set(bfor (a,b) in nltk.ibigrams(brown_learned_text) if a == 'often'))#结合词性信息来研究often的用法

brown_lrnd_tagged=brown.tagged_words(categories='learned',dimplify_tags=True)

tags=[b[1] for (a,b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often']

tags=nltk.FreqDist(tags)

fd.tabulate()

#使用POS标记寻找三词短语到

fromnltk.corpusimport browndefprocess(sentence):for (w1,t1), (w2,t2), (w3,t3) innltk.trigrams(sentence):if (t1.startswith('V') and t2 =='TO' and t3.startswith('V')):printw1,w2,w3>>>for tagged_sent inbrown.tagged_sents():

... process(tagged_sent)

...

combinedto achieve

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值