python单词库,标记单词

本篇文章帮大家学习标记单词,包含了标记单词使用方法、操作技巧、实例演示和注意事项,有一定的学习价值,大家可以用来参考。

标记是文本处理的基本特征,我们将单词标记为语法分类。借助tokenization和pos_tag函数来为每个单词创建标签。

import nltk

text = nltk.word_tokenize("A python is a serpent which eats eggs from the nest")

tagged_text=nltk.pos_tag(text)

print(tagged_text)

执行上面示例代码,得到以下结果 -

[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'),

('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'),

('the', 'DT'), ('nest', 'JJS')]

标签说明

可以使用以下显示内置值的程序来描述每个标记的含义。

import nltk

nltk.help.upenn_tagset('NN')

nltk.help.upenn_tagset('IN')

nltk.help.upenn_tagset('DT')

当运行上面的程序时,我们得到以下输出 -

NN: noun, common, singular or mass

common-carrier cabbage knuckle-duster Casino afghan shed thermostat

investment slide humour falloff slick wind hyena override subhumanity

machinist ...

IN: preposition or conjunction, subordinating

astride among uppon whether out inside pro despite on by throughout

below within for towards near behind atop around if like until below

next into if beside ...

DT: determiner

all an another any both del each either every half la many much nary

neither no some such that the them these this those

标记语料库

还可以标记语料库数据并查看该语料库中每个单词的标记结果。参考以下实现代码 -

import nltk

from nltk.tokenize import sent_tokenize

from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

tokenized = sent_tokenize(sample)

for i in tokenized[:2]:

words = nltk.word_tokenize(i)

tagged = nltk.pos_tag(words)

print(tagged)

执行上面示例代码,得到以下结果 -

[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'),

(]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'),

(EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'),

(THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'),

(Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'),

(,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'),

(,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'),

(a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'),

(said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'),

(a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值