python自然语言处理——NLTK——词性标签（pos_tag）

最新推荐文章于 2024-08-14 17:48:26 发布

JasonJarvan

最新推荐文章于 2024-08-14 17:48:26 发布

阅读量4.3w

点赞数 27

分类专栏： Python 机器学习文章标签： Python NLTK Tag Dictionary

本文链接：https://blog.csdn.net/jasonjarvan/article/details/79955664

版权

Python 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

机器学习

2 篇文章 0 订阅

订阅专栏

最近在做一个分类40000条推特评论的情感分类器。
设计文本情感分类器的时候首先要用到的就是NLTK包来进行单词过滤。

先用NLTK包的pos_tag方法（part-of-speech tagging ）来对单词的词性进行标记，标记后的结果是二元数组格式。之后从这个二元数列中挑出我们所有需要的tag，存放进一个二元数列。

实现代码：

首先别忘了

import nltk

假设我们处理的是like hate这两个词。任意选择一段英语文本，创建它们的token

words=word_tokenize(‘i hate study on monday. Jim like rabbit.’)

然后挑选出所有需要的词性。词性列表：

CC coordinatingconjunction 并列连词

CD cardinaldigit 纯数基数

DT determiner 限定词（置于名词前起限定作用，如 the、some、my 等）

EX existentialthere (like:"there is"... think of it like "thereexists") 存在句；存现句

FW foreignword 外来语；外来词；外文原词

IN preposition/subordinating conjunction介词/从属连词；主从连词；从属连接词

JJ adjective 'big' 形容词

JJR adjective, comparative 'bigger' （形容词或副词的）比较级形式

JJS adjective, superlative 'biggest' （形容词或副词的）最高级

LS listmarker 1)

MD modal (could, will) 形态的，形式的 , 语气的；情态的

NN noun, singular 'desk' 名词单数形式

NNS nounplural 'desks' 名词复数形式

NNP propernoun, singular 'Harrison' 专有名词

NNPS proper noun, plural 'Americans' 专有名词复数形式

PDT predeterminer 'all the kids' 前位限定词

POS possessiveending parent's 属有词结束语

PRP personalpronoun I, he, she 人称代词

PRP$ possessive pronoun my, his, hers 物主代词

RB adverb very, silently, 副词非常静静地

RBR adverb,comparative better （形容词或副词的）比较级形式

RBS adverb,superlative best （形容词或副词的）最高级

RP particle give up 小品词(与动词构成短语动词的副词或介词)

TO to go 'to' the store.

UH interjection errrrrrrrm 感叹词；感叹语

VB verb, baseform take 动词

VBD verb, pasttense took 动词过去时；过去式

VBG verb,gerund/present participle taking 动词动名词/现在分词

VBN verb, pastparticiple taken 动词过去分词

VBP verb,sing. present, non-3d take 动词现在

VBZ verb, 3rdperson sing. present takes 动词第三人称

WDT wh-determiner which 限定词（置于名词前起限定作用，如 the、some、my 等）

WP wh-pronoun who, what 代词（代替名词或名词词组的单词）

WP$ possessivewh-pronoun whose 所有格；属有词

WRB wh-abverb where, when 副词

（https://wenku.baidu.com/view/c63bec3b366baf1ffc4ffe4733687e21af45ffab.html）

因为情感分类，一般需要的是人称代词、动词、形容词、副词等，所以挑选出合适的tags；并且把pos_tag方法创建的词和对应词性保存在pos_tags数列。

tags = set(['MD', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'RP', 'RB', 'RBR', 'RBS', 'JJ', 'JJR', 'JJS'])
pos_tags =nltk.pos_tag(words)

之后创建空数组ret，遍历pos_tags，把有我们需要的词性的数组保存到ret[]

ret = []
for word,pos in pos_tags:
        if (pos in tags):
            ret.append(word)
 return ' '.join(ret)

JasonJarvan

关注

27
点赞
踩
122

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录