nltk英文词性分析

最新推荐文章于 2024-02-02 16:27:53 发布

LRita

最新推荐文章于 2024-02-02 16:27:53 发布

阅读量5.6k

点赞数

分类专栏：机器学习 Python 文章标签： python nltk

本文链接：https://blog.csdn.net/lrita/article/details/48211499

版权

机器学习同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

Python

4 篇文章 0 订阅

订阅专栏

Python 的nltk完成英文词性分析

首先需要安装nltk，大致的安装过程是：

sudo apt-get install python-nltk

需要依赖的包可以自行查阅

安装完后比较重要的一步是下载数据

>>import nltk

>>nltk.download()

如果出现 Connection reset error ，则是因为网络问题，换一下网就可以了。

本文主要对歌曲的评论列表分完词之后的数据进行词性分析。数据输入格式为：

id \t word \t word ......

下面是词性分析的过程：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import json
import re
import string

import nltk

reload(sys)
sys.setdefaultencoding("utf8")

pos= ['NN','NNS','JJ']#只选择名词和形容词
def main():
    if len(sys.argv) < 1:
        sys.stderr.write("err\n")
        return -1
    while True:
        ln = sys.stdin.readline()
        if not ln:
            break
        ln = ln.strip()
        ln_u = ln.decode("utf8")
        items = ln_u.split("\t")
        #print len(items)
        # if comment is empty, strip the row
        songid = items[0]
        commentList = items[1:]
  
        pos_result = nltk.pos_tag(commentList) # the output form is tuple
        
        print (songid +'\t').encode('utf8'),
      
        for tuple in pos_result:
            word = tuple[0]
            pos_word = tuple[1]
            
            #if pos_word in pos: 
             #   print (str(word)+'\t'+str(pos_word)).encode('utf8'),
            print (word+' '+pos_word+'\t'), 
            #output overall word and its pos_word
        print

if __name__ =='__main__':
    main()

运行 python pos.py < inputfile >outfile 即可得到词性分析的结果。可以根据不同项目的需要，选择保留的词性。