python语法单词怎么记_标记单词 - Python文本处理教程™

标记是文本处理的基本特征,我们将单词标记为语法分类。借助tokenization和pos_tag函数来为每个单词创建标签。

import nltk

text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest")

tagged_text=nltk.pos_tag(text)

print(tagged_text)

执行上面示例代码,得到以下结果 -

[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'),

('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'),

('the', 'DT'), ('nest', 'JJS')]

标签说明

可以使用以下显示内置值的程序来描述每个标记的含义。

import nltk

nltk.help.upenn_tagset('NN')

nltk.help.upenn_tagset('IN')

nltk.help.upenn_tagset('DT')

当运行上面的程序时,我们得到以下输出 -

NN: noun, common, singular or mass

common-carrier cabbage knuckle-duster Casino afghan shed thermostat

investment slide humour falloff slick wind hyena override subhumanity

machinist ...

IN: preposition or conjunction, subordinating

astride among uppon whether out inside pro despite on by throughout

below within for towards near behind atop around if like until below

next into if beside ...

DT: determiner

all an another any both del each either every half la many much nary

neither no some such that the them these this those

标记语料库

还可以标记语料库数据并查看该语料库中每个单词的标记结果。参考以下实现代码 -

import nltk

from nltk.tokenize import sent_tokenize

from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

tokenized = sent_tokenize(sample)

for i in tokenized[:2]:

words = nltk.word_tokenize(i)

tagged = nltk.pos_tag(words)

print(tagged)

执行上面示例代码,得到以下结果 -

[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'),

(]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'),

(EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'),

(THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'),

(Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'),

(,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'),

(,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'),

(a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'),

(said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'),

(a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]

¥ 我要打赏

纠错/补充

收藏

上一篇:语料访问

下一篇:块和裂口

加QQ群啦,易百教程官方技术学习群

注意:建议每个人选自己的技术方向加群,同一个QQ最多限加 3 个群。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值