python nltk 5 分类和标注词汇

最新推荐文章于 2021-04-21 21:39:43 发布

lakomi

最新推荐文章于 2021-04-21 21:39:43 发布

阅读量1.6k

点赞数

分类专栏： NLTK 文章标签： python 自然语言处理 nltk

本文链接：https://blog.csdn.net/Q_s_qiu/article/details/106918850

版权

5 分类和标注词汇

Categorizing and Tagging Words（分类和标注词汇）

英文文档 http://www.nltk.org/book/
中文文档 https://www.bookstack.cn/read/nlp-py-2e-zh/0.md
以下编号按个人习惯

Categorizing and Tagging Words（分类和标注词汇）

1 Using a Tagger(使用标注器)

词性标注器处理一个单词序列，为每个词附加一个词性标记。
nltk中提供了标注器pos_tag（），函数参数为词汇列表。

text = nltk.word_tokenize("And now for something completely different")
tag_result = nltk.pos_tag(text)
# 词性情况。cc-并列连词，RB-副词，IN-介词，NN-名词，JJ-形容词
print(tag_result)

2 Tagged Corpora（已经被标记的语料库）

首先使用元组，例如（词符，标记），来表示一个已标注的词符。str2tuple()函数将一个已标注的词符的字符串，例如词汇/标记，转换成元组。示例如下：

tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)         # ('fly', 'NN')
print(tagged_token[0])      # fly
print(tagged_token[1])      # NN

除了可以从一个短字符串构造出元组，还可以将一个长字符串构造成已标注的词符的列表。需要遍历长字符串。代码如下：

 # 三引号可以换行写代码
sent = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX governments/NNS ''/'' ./.
'''
# split()不带参数，会把所有的空格（空格符，制表符，换行符）当做分隔符
tagged_token_1 = [nltk.tag.str2tuple(string) for string in sent.split()]
print(tagged_token_1)

3 Reading Tagged Corpora（读取已标注的语料库）

只要语料库中包含了已标注的文本，可以使用nltk语料库提供的tagged_words()方法得到。

# 读取已标注的语料库
tagged_words_1 = nltk.corpus.brown.tagged_words()
print(tagged_words_1)
# 由于并非所有的语料库都采用同一组标记，因此可以指定tagset参数为universal来获取以通用词性标记的词汇列表
tagged_words_2 = nltk.corpus.brown.tagged_words(tagset='universal')
print(tagged_words_2)

4 Unsimplified Tags（未简化的标记）

dict（字典）：dict是一个可变的数据类型，格式为{key：value，key：value}，dict的key必须是不可变的数据类型，且value的数据类型任意。注意键值对若是字符串用单引号。

5 Mapping Words to Properties Using Python Dictionaries（使用python字典映射单词到属性）

此章节主要是使用字典。字典中的key不能重复，且key是不可变的类型。
首先基本的建立字典，其中有两种方法来定义一个字典（第一个常用）。

# 使用键值对格式来创建一个字典
pos = {
   'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
pos = dict(colorless='ADJ', ideas="N", sleep="V", furiously="ADV")

向其中填充键值对

# 填充键值对
pos = {
   }
print(pos)    # {}
pos['colorless'] = 'ADJ'
print(pos)      # {'colorless': 'ADJ'}
pos['ideas'] = 'N'
pos['sleep'] = 'V'
print(pos)    # {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V'}

字典中检索是通过键来检索的

# 使用键来检索字典
value_1 = pos['ideas']
print(value_1)    # N

最低0.47元/天解锁文章

lakomi

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录