目录
参考链接2
参考链接3
参考链接1
自然语言是人类在沟通中形成的一套规则体系。规则有强有弱,比如非正式场合使用口语,正式场合下的书面语。要处理自然语言,也要遵循这些形成的规则,否则就会得出令人无法理解的结论。下面介绍一些术语的简单区别。
文法:等同于语法(grammar),文章的书写规范,用来描述语言及其结构,它包含句法和词法规范。
句法:Syntax,句子的结构或成分的构成与关系的规范。
词法:Lexical,词的构词,变化等的规范。
词性标注,或POS(Part Of Speech),是一种分析句子成分的方法,通过它来识别每个词的词性。
下面简要列举POS的tagset含意,详细可看nltk.help.brown_tagset()
标记 | 词性 | 示例 |
---|---|---|
ADJ | 形容词 | new, good, high, special, big, local |
ADV | 动词 | really, already, still, early, now |
CONJ | 连词 | and, or, but, if, while, although |
DET | 限定词 | the, a, some, most, every, no |
EX | 存在量词 | there, there’s |
MOD | 情态动词 | will, can, would, may, must, should |
NN | 名词 | year,home,costs,time |
NNP | 专有名词 | April,China,Washington |
NUM | 数词 | fourth,2016, 09:30 |
PRON | 代词 | he,they,us |
P | 介词 | on,over,with,of |
TO | 词to | to |
UH | 叹词 | ah,ha,oops |
VB | 动词 | |
VBD | 动词过去式 | made,said,went |
VBG | 现在分词 | going,lying,playing |
VBN | 过去分词 | taken,given,gone |
WH | wh限定词 | who,where,when,what |
1. 使用NLTK对英文进行词性标注
1.1词性标注示例
import nltk
sent = "I am going to Beijing tomorrow."
"""
nltk.sent_tokenize(text) #按句子分割 ,python3分不开句子
nltk.word_tokenize(sentence) #分词
nltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词:
"""
'\nnltk.sent_tokenize(text) #按句子分割 ,python3分不开句子\nnltk.word_tokenize(sentence) #分词 \nnltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词: \n'
# 分割句子
words = nltk.word_tokenize(sent)
print(words)
['I', 'am', 'going', 'to', 'Beijing', 'tomorrow', '.']
# 词性标注
taged_sent = nltk.pos_tag(words)
taged_sent
[('I', 'PRP'),
('am', 'VBP'),
('going', 'VBG'),
('to', 'TO'),
('Beijing', 'NNP'),
('tomorrow', 'NN'),
('.', '.')]
1.2 语料库的已标注数据
语料类提供了下列方法可以返回预标注数据。
方法 | 说明 |
---|---|
tagged_words(fileids,categories) | 返回标注数据,以词列表的形式 |
tagged_sents(fileids,categories) | 返回标注数据,以句子列表形式 |
tagged_paras(fileids,categories) | 返回标注数据,以文章列表形式 |
2 标注器
2.1 默认标注器
最简单的词性标注器是将所有词都标注为名词NN。这种标注器没有太大的价值。正确率很低。下面演示NLTK提供的默认标注器的用法。
import nltk
from nltk.corpus import brown
# 加载数据
brown_tagged_sents = brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'),
brown_sents = brown.sents(categories='news')
# brown_tagged_sents
# 最简单的标注器是为每个标识符分配同样的标记。这似乎是一个相对普通的方法,
# 但为标注器的性能建立了一个重要的标准。为了得到最好的效果,我们用最有可能的标记标注每个词。
# 通过下例找出哪个标记是最有可能的。
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
tags
['AT',
'NP-TL',
'NN-TL',
'JJ-TL',
'NN-TL',
'VBD',
'NR',
'AT',
'NN',
'IN',
'NP$',
'JJ',
'NN',
'NN',
'VBD',
'``',
'AT',
'NN',
"''",
'CS',
'DTI',
'NNS',
'VBD',
'NN',
...,
'IN',
'NN',
'.',
'NP',
'NPS',
'BER',
'VBG',
'JJ',
'NN',
'TO',
'VB',
'AT',
'NN',
'IN',
'AT',
'CD',
'NN$',
...]
tag = nltk.FreqDist(tags).max()
tag
'NN'
# 我们现在可以创建一个将所有词都标注为NN的标注器。
default_tagger = nltk.DefaultTagger('NN')
sent = "I am going to Beijing tomorrow."
default_tagger.tag(nltk.word_tokenize(sent))
[('I', 'NN'),
('am', 'NN'),
('going', 'NN'),
('to', 'NN'),
('Beijing', 'NN'),
('tomorrow', 'NN'),
('.', 'NN')]
default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028
2.2 正则表达式标注器
正则表达式标注器基于匹配模式分配标记给标识符。例如,一般情况下认为任一以ed结尾的词都是动词过去分词,任一以‘s结尾的词都是名词所有格。下例中可以用正则表达式的列表来表示这些。
patterns = [
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals
(r'.*\'s$', 'NN$'), # possessive nouns
(r'.*s$', 'NNS'), # plural nouns
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'.*', 'NN') # nouns (deafult)
]
这些是按顺序处理的,第一个匹配上的会被使用。现在建立一个标注器,并用它来标注句子。
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
regexp_tagger.evaluate(brown_tagged_sents)
# 0.20326391789486245 # 大约有五分之一是正确的
0.20326391789486245
2.3 查询标注器
很多高频词没有NN标记,我们找出100个最频繁的词,存储它们最有可能的标记,然后我们可以使用这个信息作为“查询标注器(NLTKUnigramTagger)”的模型,如下例:
# 先把词拿出来
fd = nltk.FreqDist(brown.words(categories='news')) # ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
# 收集了在不同条件下运行的单个实验的频率分布。条件频率分布用于记录每个样本在给定的实验条件下出现的次数。
# 例如,可以使用条件频率分布来记录文档中给定长度的每个单词(类型)的频率。
# 在形式上,条件频率分布可以定义为一个函数,将每个条件映射到实验条件下的FreqDist。
cfd =nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
# print(cfd.items()) # dict_items([('The', FreqDist({'AT': 775, 'AT-TL': 28, 'AT-HL': 3})), ('Fulton', FreqDist({'NP-TL': 10, 'NP': 4})),
# 频繁词top100
most_freq_words = fd.keys()# python 3.6 以上,dict_keys 类型需要list转化
most_freq_words = list(most_freq_words)[:100] # ['The','Fulton','County','Grand','Jury','said','Friday','an',
# 字典生成式 对于top100的单词,取该单词频率分布最高的词性,作为该词的词性
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
# likely_tags # {'The': 'AT','Fulton': 'NP-TL','County': 'NN-TL','Grand': 'JJ-TL','Jury': 'NN-TL','said': 'VBD','Friday': 'NR',
# UnigramTagger为训练语料库中的每个单词找到最有可能的标记,然后使用该信息为新标记分配标记。
baseline_tagger = nltk.UnigramTagger(model = likely_tags)
baseline_tagger.evaluate(brown_tagged_sents) # 0.3329355371243312
# brown.tagged_words(categories='news') #[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
baseline_tagger.evaluate([brown.tagged_words(categories='news')]) # brown.tagged_words()需要加括号转二维数组 0.3329355371243312
baseline_tagger.evaluate([brown.tagged_sents(categories='news')[3]]) # 个别语句会有极高的准确率 0.972972972972973
0.972972972972973
此处结果与书中不同,书中结果为0.45左右,即仅仅知道100个最频繁的词的标记就能正确标注很大一部分标识符。
来看看它在未标注的输入文本是运行得怎么样:
sent = brown.sents(categories='news')[10] #
baseline_tagger.tag(sent)
[('It', 'PPS'),
('urged', None),
('that', 'CS'),
('the', 'AT'),
('city', 'NN'),
('``', '``'),
('take', None),
('steps', None),
('to', 'TO'),
('remedy', None),
("''", "''"),
('this', 'DT'),
('problem', None),
('.', '.')]
可以看到很多词都被分配了’None’标签,因为它们不在100个最频繁的词中。这种情况我们想分配默认标记NN。也就是说,我们应先使用查找表,如果不能指定就使用默认标注器,这个过程叫“回退”。
# 设置默认标注器,在找不到匹配时使用
baseline_tagger = nltk.UnigramTagger(model = likely_tags,backoff = nltk.DefaultTagger('NN'))
最后我们把查找标注器和默认标注器结合起来之后,看它的性能如何,使用大小不同的模型:
def performance(cfd,wordlist):
lt = dict((word,cfd[word].max()) for word in wordlist)
baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN'))
return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
def display():
import pylab
words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
sizes = 2 ** pylab.arange(16)
prefs = [performance(cfd,words_by_freq[:size]) for size in sizes]
pylab.plot(sizes,prefs,'-bo')
pylab.title('Lookup Tagger Performance with Varying Model Size')
pylab.xlabel('Model Size')
pylab.ylabel('Performace')
pylab.show()
display()
可以看到随着模型规模的增长,最初性能增加较快,最终达到稳定水平,这时哪怕模型规模再增加,性能提升幅度也很小
3 训练N-gram标注器
3.1 一般N-gram标注
在上一节中,已经使用了1-Gram,即Unigram标注器。考虑更多的上下文,便有了2/3-gram,这里统称为N-gram。注意,更长的上正文并不能带来准确度的提升。
除了向N-gram标注器提供词表模型,另外一种构建标注器的方法是训练。N-gram标注器的构建函数如下:init(train=None, model=None, backoff=None),可以将标注好的语料作为训练数据,用于构建一个标注器。
import nltk
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0:train_num]
x_test = brown_tagged_sents[train_num:]
tagger = nltk.UnigramTagger(train = x_train)
print(tagger.evaluate(x_test)) # 0.8121200039868434
0.8121200039868434
对于UniGram,使用90%的数据进行训练,在余下10%的数据上测试的准确率为81%。如果改为BiGram,则正确率会下降到10%左右。
3.2 组合标注器
可以利用backoff参数,将多个组合标注器组合起来,以提高识别精确率。
import nltk
from nltk.corpus import brown
pattern = [
(r'.*ing$','VBG'),
(r'.*ed$','VBD'),
(r'.*es$','VBZ'),
(r'.*\'s$','NN$'),
(r'.*s$','NNS'),
(r'.*', 'NN') #未匹配的仍标注为NN
]
brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0:train_num]
x_test = brown_tagged_sents[train_num:]
t0 = nltk.RegexpTagger(pattern)
t1 = nltk.UnigramTagger(x_train,backoff = t0)
t2 = nltk.BigramTagger(x_train,backoff = t1)
print(t2.evaluate(x_test)) # 0.8627529153792485
0.8627529153792485
从上面可以看出,不需要任何的语言学知识,只需要借助统计数据便可以使得词性标注做的足够好。
对于中文,只要有标注语料,也可以按照上面的过程训练N-gram标注器。
4.更进一步
nltk.tag.BrillTagger实现了基于转换的标注,在基础标注器的结果上,对输出进行基于规则的修正,实现更高的准确度。
import nltk
import nltk.tag.brill
from nltk.corpus import brown
pattern = [
(r'.*ing$','VBG'),
(r'.*ed$','VBD'),
(r'.*es$','VBZ'),
(r'.*\'s$','NN$'),
(r'.*s$','NNS'),
(r'.*', 'NN') #未匹配的仍标注为NN
]
# 划分数据集
brown_tagged_sents = brown.tagged_sents(categories = ['news'])
train_num = int(len(brown_tagged_sents)*0.9)
x_train = brown_tagged_sents[:train_num]
x_test = brown_tagged_sents[train_num:]
#
baseline_tagger = nltk.UnigramTagger(x_train,backoff = nltk.RegexpTagger(pattern))
tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24())
brill_tagger = tt.train(x_train,max_rules=20,min_acc=0.99)
# 评估
print(brill_tagger.evaluate(x_test))# 0.8683344961626632
0.8683344961626632
brown_sents = brown.sents(categories="news")
print(brown_tagged_sents[2007])
print(brill_tagger.tag(brown_sents[2007]))
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
5.中文标注器的训练
下面基于Unigram训练一个中文词性标注器,语料使用网上可以下载得到的人民日报98年1月的标注资料。
import nltk
import json
lines = open('./词性标注人民日报199801.txt',
encoding = 'utf-8'
).readlines()
all_tagged_sents = []
for line in lines:
sent = line.split()
tagged_sent = []
for item in sent:
pair = nltk.str2tuple(item)
tagged_sent.append(pair)
if len(tagged_sent)>0:
all_tagged_sents.append(tagged_sent)
train_size = int(len(all_tagged_sents)*0.8)
x_train = all_tagged_sents[:train_size]
x_test = all_tagged_sents[train_size:]
tagger = nltk.UnigramTagger(train=x_train,backoff=nltk.DefaultTagger('n'))
print(tagger.evaluate(x_test)) # 0.8714095491725319
"""
line:
19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n (/w 附/v 图片/n 1/m 张/q )/w
line.split():
'\nline:\n19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n (/w 附/v 图片/n 1/m 张/q )/w \n'
tagged_sent:
[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('(', 'W'), ('附', 'V'), ('图片', 'N'), ('1', 'M'), ('张', 'Q'), (')', 'W')]
"""
0.8714095491725319
"\nline:\n19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n (/w 附/v 图片/n 1/m 张/q )/w \n\nline.split():\n'\nline:\n19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n (/w 附/v 图片/n 1/m 张/q )/w \n'\n\ntagged_sent:\n[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('(', 'W'), ('附', 'V'), ('图片', 'N'), ('1', 'M'), ('张', 'Q'), (')', 'W')]\n"
6. brown语料库相关方法
# 语料库文件名列表
brown.fileids()
['ca01',
'ca02',
'ca03',
'ca04',
...,
'cp17',
'cp18',
'cp19',
'cp20',
'cp21',
'cp22',
'cp23',
'cp24',
'cp25',
'cp26',
'cp27',
'cp28',
'cp29',
'cr01',
'cr02',
'cr03',
'cr04',
'cr05',
'cr06',
'cr07',
'cr08',
'cr09']
# 返回指定类别('news')的文件名列表
brown.fileids('news')
['ca01',
'ca02',
'ca03',
'ca04',
'ca05',
'ca06',
'ca07',
'ca08',
...
'ca26',
'ca27',
'ca28',
'ca29',
'ca30',
'ca31',
'ca32',
'ca33',
'ca34',
'ca35',
'ca36',
'ca37',
'ca38',
'ca39',
'ca40',
'ca41',
'ca42',
'ca43',
'ca44']
# 返回指定分类的原始文本
brown.raw(categories=['news'])
"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd /
no/at evidence/nn ‘’/‘’ that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, /
deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ‘’/‘’ for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.\n\n\n\tThe/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl
…
Steve/np Barber/np joined/vbd the/at club/nn one/cd week/nn ago/rb after/cs completing/vbg his/pp$ hitch/nn under/in the/at Army’s/nnKaTeX parse error: Undefined control sequence: \nThe at position 108: … ,/, Ky./np ./.\̲n̲T̲h̲e̲/at 22-year-old… bulky/jj spring-training/nn contingent/nn now/rb gradually/rb will/md be/be reduced/vbn as/cs Manager/nn-tl Paul/np Richards/np and/cc his/pp$ coaches/nns seek/vb to/to trim/vb it/ppo down/rp to/in a/at more/ql streamlined/vbn and/cc workable/jj unit/nn ./.\n\n\n\n\n/
Take/vb a/at ride/nn on/in this/dt one/cd ‘’/‘’ ,/, Brooks/np Robinson/np greeted/vbd Hansen/np as/cs the/at Bird/np third/od sacker/nn grabbed/vbd a/at bat/nn ,/, headed/vbd for/in the/at plate/nn and/cc bounced/vbd a/at third-inning/nn two-run/jj double/nn off/in the/at left-centerfield/nn wall/nn tonight/nr ./.\n\n\n\tIt/pps was/bedz the/at first/od of/in two/cd doubles/nns by/in Robinson/np ,/, who/wps was/bedz in/in a/at mood/nn to/to celebrate/vb ./.\n\n\n\tJust/rb before/in game/nn time/nn ,/, Robinson’s/np$ pretty/jj wife/nn ,/, Connie/np informed/vbd him/ppo that/cs an/at addition/nn to/in the/at family/nn can/md be/be expected/vbn late/jj next/ap summer/nn ./.\n\n\n\tUnfortunately/rb ,/, Brooks’s/np$ teammates/nns were/bed not/* in/in such/ql festive/jj mood/nn as/cs the/at Orioles/nps expired/vbd before/in the/at seven-hit
# 返回指定文件名的文本字符串
brown.raw(fileids=['ca01','ca02'])
"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl ... for/in-hl extension/nn-hl \nOther/ap recommendations/nns made/vbn by/in the/at committee/nn are/ber :/: \n\n\tExtension/nn of/in the/at ADC/nn program/nn to/in all/abn children/nns in/in need/nn living/vbg with/in any/dti relatives/nns ,/, including/in both/abx parents/nns ,/, as/cs a/at means/nns of/in preserving/vbg family/nn unity/nn ./.\n\n\n\tResearch/nn projects/nns as/ql soon/rb as/cs possible/jj on/in the/at causes/nns and/cc prevention/nn of/in dependency/nn and/cc illegitimacy/nn ./.\n\n"
# 返回指定文件名的语句列表
brown.sents(fileids=['ca01','ca02'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
# 按分类返回语句列表
brown.sents(categories=['news'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
# 返回指定文件名的单词列表
brown.words('ca01')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
# 返回指定分类的单词列表
brown.words(categories=['news'])
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
# 返回按句子标注好词性的二维数组
brown.tagged_sents(categories=['news'])
[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]