NLTK2:词性标注


参考链接2
参考链接3
参考链接1

自然语言是人类在沟通中形成的一套规则体系。规则有强有弱,比如非正式场合使用口语,正式场合下的书面语。要处理自然语言,也要遵循这些形成的规则,否则就会得出令人无法理解的结论。下面介绍一些术语的简单区别。
文法:等同于语法(grammar),文章的书写规范,用来描述语言及其结构,它包含句法和词法规范。
句法:Syntax,句子的结构或成分的构成与关系的规范。
词法:Lexical,词的构词,变化等的规范。

词性标注,或POS(Part Of Speech),是一种分析句子成分的方法,通过它来识别每个词的词性。

下面简要列举POS的tagset含意,详细可看nltk.help.brown_tagset()

标记词性示例
ADJ形容词new, good, high, special, big, local
ADV动词really, already, still, early, now
CONJ连词and, or, but, if, while, although
DET限定词the, a, some, most, every, no
EX存在量词there, there’s
MOD情态动词will, can, would, may, must, should
NN名词year,home,costs,time
NNP专有名词April,China,Washington
NUM数词fourth,2016, 09:30
PRON代词he,they,us
P介词on,over,with,of
TO词toto
UH叹词ah,ha,oops
VB动词
VBD动词过去式made,said,went
VBG现在分词going,lying,playing
VBN过去分词taken,given,gone
WHwh限定词who,where,when,what

1. 使用NLTK对英文进行词性标注

1.1词性标注示例

import nltk

sent = "I am going to Beijing tomorrow."

"""
nltk.sent_tokenize(text) #按句子分割 ,python3分不开句子
nltk.word_tokenize(sentence) #分词 
nltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词: 
"""
'\nnltk.sent_tokenize(text) #按句子分割 ,python3分不开句子\nnltk.word_tokenize(sentence) #分词 \nnltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词: \n'
# 分割句子
words = nltk.word_tokenize(sent)
print(words)
['I', 'am', 'going', 'to', 'Beijing', 'tomorrow', '.']
# 词性标注
taged_sent = nltk.pos_tag(words)
taged_sent
[('I', 'PRP'),
 ('am', 'VBP'),
 ('going', 'VBG'),
 ('to', 'TO'),
 ('Beijing', 'NNP'),
 ('tomorrow', 'NN'),
 ('.', '.')]

1.2 语料库的已标注数据

语料类提供了下列方法可以返回预标注数据。

方法说明
tagged_words(fileids,categories)返回标注数据,以词列表的形式
tagged_sents(fileids,categories)返回标注数据,以句子列表形式
tagged_paras(fileids,categories)返回标注数据,以文章列表形式

2 标注器

2.1 默认标注器

最简单的词性标注器是将所有词都标注为名词NN。这种标注器没有太大的价值。正确率很低。下面演示NLTK提供的默认标注器的用法。

import nltk
from nltk.corpus import brown
# 加载数据
brown_tagged_sents = brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), 
brown_sents = brown.sents(categories='news')
# brown_tagged_sents
# 最简单的标注器是为每个标识符分配同样的标记。这似乎是一个相对普通的方法,
# 但为标注器的性能建立了一个重要的标准。为了得到最好的效果,我们用最有可能的标记标注每个词。
# 通过下例找出哪个标记是最有可能的。
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
tags
['AT',
 'NP-TL',
 'NN-TL',
 'JJ-TL',
 'NN-TL',
 'VBD',
 'NR',
 'AT',
 'NN',
 'IN',
 'NP$',
 'JJ',
 'NN',
 'NN',
 'VBD',
 '``',
 'AT',
 'NN',
 "''",
 'CS',
 'DTI',
 'NNS',
 'VBD',
 'NN',
...,
 'IN',
 'NN',
 '.',
 'NP',
 'NPS',
 'BER',
 'VBG',
 'JJ',
 'NN',
 'TO',
 'VB',
 'AT',
 'NN',
 'IN',
 'AT',
 'CD',
 'NN$',
 ...]
tag = nltk.FreqDist(tags).max()
tag
'NN'
# 我们现在可以创建一个将所有词都标注为NN的标注器。
default_tagger = nltk.DefaultTagger('NN')
sent = "I am going to Beijing tomorrow."
default_tagger.tag(nltk.word_tokenize(sent))
[('I', 'NN'),
 ('am', 'NN'),
 ('going', 'NN'),
 ('to', 'NN'),
 ('Beijing', 'NN'),
 ('tomorrow', 'NN'),
 ('.', 'NN')]
default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028

2.2 正则表达式标注器

正则表达式标注器基于匹配模式分配标记给标识符。例如,一般情况下认为任一以ed结尾的词都是动词过去分词,任一以‘s结尾的词都是名词所有格。下例中可以用正则表达式的列表来表示这些。

patterns = [
    (r'.*ing$', 'VBG'),                                # gerunds
    (r'.*ed$', 'VBD'),                                 # simple past
    (r'.*es$', 'VBZ'),                                 # 3rd singular present
    (r'.*ould$', 'MD'),                                # modals
    (r'.*\'s$', 'NN$'),                                # possessive nouns
    (r'.*s$', 'NNS'),                                  # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),                 # cardinal numbers
    (r'.*', 'NN')                                      # nouns (deafult)
]

这些是按顺序处理的,第一个匹配上的会被使用。现在建立一个标注器,并用它来标注句子。

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
regexp_tagger.evaluate(brown_tagged_sents)
# 0.20326391789486245 # 大约有五分之一是正确的
0.20326391789486245

2.3 查询标注器

很多高频词没有NN标记,我们找出100个最频繁的词,存储它们最有可能的标记,然后我们可以使用这个信息作为“查询标注器(NLTKUnigramTagger)”的模型,如下例:

# 先把词拿出来
fd = nltk.FreqDist(brown.words(categories='news')) # ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 收集了在不同条件下运行的单个实验的频率分布。条件频率分布用于记录每个样本在给定的实验条件下出现的次数。
# 例如,可以使用条件频率分布来记录文档中给定长度的每个单词(类型)的频率。
# 在形式上,条件频率分布可以定义为一个函数,将每个条件映射到实验条件下的FreqDist。
cfd =nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) 
# print(cfd.items()) # dict_items([('The', FreqDist({'AT': 775, 'AT-TL': 28, 'AT-HL': 3})), ('Fulton', FreqDist({'NP-TL': 10, 'NP': 4})), 

# 频繁词top100
most_freq_words = fd.keys()# python 3.6 以上,dict_keys 类型需要list转化
most_freq_words = list(most_freq_words)[:100] # ['The','Fulton','County','Grand','Jury','said','Friday','an',

# 字典生成式 对于top100的单词,取该单词频率分布最高的词性,作为该词的词性
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
# likely_tags # {'The': 'AT','Fulton': 'NP-TL','County': 'NN-TL','Grand': 'JJ-TL','Jury': 'NN-TL','said': 'VBD','Friday': 'NR',

# UnigramTagger为训练语料库中的每个单词找到最有可能的标记,然后使用该信息为新标记分配标记。
baseline_tagger = nltk.UnigramTagger(model = likely_tags)

baseline_tagger.evaluate(brown_tagged_sents) # 0.3329355371243312
# brown.tagged_words(categories='news') #[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
baseline_tagger.evaluate([brown.tagged_words(categories='news')]) # brown.tagged_words()需要加括号转二维数组 0.3329355371243312
baseline_tagger.evaluate([brown.tagged_sents(categories='news')[3]]) # 个别语句会有极高的准确率 0.972972972972973
0.972972972972973

此处结果与书中不同,书中结果为0.45左右,即仅仅知道100个最频繁的词的标记就能正确标注很大一部分标识符。

来看看它在未标注的输入文本是运行得怎么样:

sent = brown.sents(categories='news')[10] #
baseline_tagger.tag(sent)
[('It', 'PPS'),
 ('urged', None),
 ('that', 'CS'),
 ('the', 'AT'),
 ('city', 'NN'),
 ('``', '``'),
 ('take', None),
 ('steps', None),
 ('to', 'TO'),
 ('remedy', None),
 ("''", "''"),
 ('this', 'DT'),
 ('problem', None),
 ('.', '.')]

可以看到很多词都被分配了’None’标签,因为它们不在100个最频繁的词中。这种情况我们想分配默认标记NN。也就是说,我们应先使用查找表,如果不能指定就使用默认标注器,这个过程叫“回退”。

# 设置默认标注器,在找不到匹配时使用
baseline_tagger = nltk.UnigramTagger(model = likely_tags,backoff = nltk.DefaultTagger('NN'))

最后我们把查找标注器和默认标注器结合起来之后,看它的性能如何,使用大小不同的模型:

def performance(cfd,wordlist):
    lt = dict((word,cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
 
def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(16)
    prefs = [performance(cfd,words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes,prefs,'-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performace')
    pylab.show()
 
display()

请添加图片描述

可以看到随着模型规模的增长,最初性能增加较快,最终达到稳定水平,这时哪怕模型规模再增加,性能提升幅度也很小

3 训练N-gram标注器

3.1 一般N-gram标注

在上一节中,已经使用了1-Gram,即Unigram标注器。考虑更多的上下文,便有了2/3-gram,这里统称为N-gram。注意,更长的上正文并不能带来准确度的提升。
除了向N-gram标注器提供词表模型,另外一种构建标注器的方法是训练。N-gram标注器的构建函数如下:init(train=None, model=None, backoff=None),可以将标注好的语料作为训练数据,用于构建一个标注器。

import nltk
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0:train_num]
x_test = brown_tagged_sents[train_num:]
tagger = nltk.UnigramTagger(train = x_train)
print(tagger.evaluate(x_test)) # 0.8121200039868434
0.8121200039868434

对于UniGram,使用90%的数据进行训练,在余下10%的数据上测试的准确率为81%。如果改为BiGram,则正确率会下降到10%左右。

3.2 组合标注器

可以利用backoff参数,将多个组合标注器组合起来,以提高识别精确率。

import nltk
from nltk.corpus import brown
pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍标注为NN
]
brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train =  brown_tagged_sents[0:train_num]
x_test =   brown_tagged_sents[train_num:]

t0 = nltk.RegexpTagger(pattern)
t1 = nltk.UnigramTagger(x_train,backoff = t0)
t2 = nltk.BigramTagger(x_train,backoff = t1)
print(t2.evaluate(x_test)) # 0.8627529153792485
0.8627529153792485

从上面可以看出,不需要任何的语言学知识,只需要借助统计数据便可以使得词性标注做的足够好。
对于中文,只要有标注语料,也可以按照上面的过程训练N-gram标注器。

4.更进一步

nltk.tag.BrillTagger实现了基于转换的标注,在基础标注器的结果上,对输出进行基于规则的修正,实现更高的准确度。

import nltk
import nltk.tag.brill
from nltk.corpus import brown

pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍标注为NN
]
# 划分数据集
brown_tagged_sents = brown.tagged_sents(categories = ['news'])
train_num = int(len(brown_tagged_sents)*0.9)
x_train = brown_tagged_sents[:train_num]
x_test = brown_tagged_sents[train_num:]
#
baseline_tagger = nltk.UnigramTagger(x_train,backoff = nltk.RegexpTagger(pattern))
tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24())
brill_tagger = tt.train(x_train,max_rules=20,min_acc=0.99)
# 评估
print(brill_tagger.evaluate(x_test))# 0.8683344961626632                                    
0.8683344961626632
brown_sents = brown.sents(categories="news")
print(brown_tagged_sents[2007])
print(brill_tagger.tag(brown_sents[2007]))
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

5.中文标注器的训练

下面基于Unigram训练一个中文词性标注器,语料使用网上可以下载得到的人民日报98年1月的标注资料。

import nltk
import json

lines = open('./词性标注人民日报199801.txt',
        encoding = 'utf-8'
            ).readlines()
all_tagged_sents = []

for line in lines:
    sent = line.split()
    tagged_sent = []
    for item in sent:
        pair = nltk.str2tuple(item)
        tagged_sent.append(pair)
    
    if len(tagged_sent)>0:
        all_tagged_sents.append(tagged_sent)

train_size = int(len(all_tagged_sents)*0.8)
x_train = all_tagged_sents[:train_size]
x_test = all_tagged_sents[train_size:]

tagger = nltk.UnigramTagger(train=x_train,backoff=nltk.DefaultTagger('n'))
print(tagger.evaluate(x_test)) # 0.8714095491725319
"""
line:
19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w 

line.split():
'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w  \n'

tagged_sent:
[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('(', 'W'), ('附', 'V'), ('图片', 'N'), ('1', 'M'), ('张', 'Q'), (')', 'W')]
"""
0.8714095491725319





"\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w \n\nline.split():\n'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  (/w  附/v  图片/n  1/m  张/q  )/w  \n'\n\ntagged_sent:\n[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('(', 'W'), ('附', 'V'), ('图片', 'N'), ('1', 'M'), ('张', 'Q'), (')', 'W')]\n"

6. brown语料库相关方法

# 语料库文件名列表
brown.fileids()
['ca01',
 'ca02',
 'ca03',
 'ca04',
...,
 'cp17',
 'cp18',
 'cp19',
 'cp20',
 'cp21',
 'cp22',
 'cp23',
 'cp24',
 'cp25',
 'cp26',
 'cp27',
 'cp28',
 'cp29',
 'cr01',
 'cr02',
 'cr03',
 'cr04',
 'cr05',
 'cr06',
 'cr07',
 'cr08',
 'cr09']
# 返回指定类别('news')的文件名列表
brown.fileids('news')
['ca01',
 'ca02',
 'ca03',
 'ca04',
 'ca05',
 'ca06',
 'ca07',
 'ca08',
 ...
 'ca26',
 'ca27',
 'ca28',
 'ca29',
 'ca30',
 'ca31',
 'ca32',
 'ca33',
 'ca34',
 'ca35',
 'ca36',
 'ca37',
 'ca38',
 'ca39',
 'ca40',
 'ca41',
 'ca42',
 'ca43',
 'ca44']
# 返回指定分类的原始文本
brown.raw(categories=['news'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ‘’/‘’ that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, / deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ‘’/‘’ for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.\n\n\n\tThe/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl

Steve/np Barber/np joined/vbd the/at club/nn one/cd week/nn ago/rb after/cs completing/vbg his/pp$ hitch/nn under/in the/at Army’s/nnKaTeX parse error: Undefined control sequence: \nThe at position 108: … ,/, Ky./np ./.\̲n̲T̲h̲e̲/at 22-year-old… bulky/jj spring-training/nn contingent/nn now/rb gradually/rb will/md be/be reduced/vbn as/cs Manager/nn-tl Paul/np Richards/np and/cc his/pp$ coaches/nns seek/vb to/to trim/vb it/ppo down/rp to/in a/at more/ql streamlined/vbn and/cc workable/jj unit/nn ./.\n\n\n\n\n/ Take/vb a/at ride/nn on/in this/dt one/cd ‘’/‘’ ,/, Brooks/np Robinson/np greeted/vbd Hansen/np as/cs the/at Bird/np third/od sacker/nn grabbed/vbd a/at bat/nn ,/, headed/vbd for/in the/at plate/nn and/cc bounced/vbd a/at third-inning/nn two-run/jj double/nn off/in the/at left-centerfield/nn wall/nn tonight/nr ./.\n\n\n\tIt/pps was/bedz the/at first/od of/in two/cd doubles/nns by/in Robinson/np ,/, who/wps was/bedz in/in a/at mood/nn to/to celebrate/vb ./.\n\n\n\tJust/rb before/in game/nn time/nn ,/, Robinson’s/np$ pretty/jj wife/nn ,/, Connie/np informed/vbd him/ppo that/cs an/at addition/nn to/in the/at family/nn can/md be/be expected/vbn late/jj next/ap summer/nn ./.\n\n\n\tUnfortunately/rb ,/, Brooks’s/np$ teammates/nns were/bed not/* in/in such/ql festive/jj mood/nn as/cs the/at Orioles/nps expired/vbd before/in the/at seven-hit

# 返回指定文件名的文本字符串
brown.raw(fileids=['ca01','ca02'])
"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl ... for/in-hl extension/nn-hl \nOther/ap recommendations/nns made/vbn by/in the/at committee/nn are/ber :/: \n\n\tExtension/nn of/in the/at ADC/nn program/nn to/in all/abn children/nns in/in need/nn living/vbg with/in any/dti relatives/nns ,/, including/in both/abx parents/nns ,/, as/cs a/at means/nns of/in preserving/vbg family/nn unity/nn ./.\n\n\n\tResearch/nn projects/nns as/ql soon/rb as/cs possible/jj on/in the/at causes/nns and/cc prevention/nn of/in dependency/nn and/cc illegitimacy/nn ./.\n\n"
# 返回指定文件名的语句列表
brown.sents(fileids=['ca01','ca02'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
# 按分类返回语句列表
brown.sents(categories=['news'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
# 返回指定文件名的单词列表
brown.words('ca01')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
# 返回指定分类的单词列表
brown.words(categories=['news'])
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
# 返回按句子标注好词性的二维数组
brown.tagged_sents(categories=['news'])
[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值