python中文分词统计_python 中文字数统计/分词

最新推荐文章于 2022-03-04 13:46:09 发布

weixin_39516956

最新推荐文章于 2022-03-04 13:46:09 发布

阅读量801

点赞数

文章标签： python中文分词统计

本文链接：https://blog.csdn.net/weixin_39516956/article/details/113502343

版权

本文介绍了一个使用Python进行中文分词和字数统计的实例，通过调用API实现对小说文本的处理。首先，将小说读入内存，然后使用正则表达式分词，接着对每段文本进行远程API分词服务，最后计算分词数量和时间消耗，展示了Python在处理中文文本上的应用。

摘要由CSDN通过智能技术生成

因为想把一段文字分词，所以，需要明确一定的词语关系。

在网上随便下载了一篇中文小说。随便的txt小说，就1mb多。要数数这1mb多的中文到底有多少字，多少分词，这些分词的词性是什么样的。

这里是思路

1)先把小说读到内存里面去。

2)再把小说根据正则表达法开始分词，获得小说中汉字总数

3)将内存中的小说每段POST到提供分词服务的API里面去，获取分词结果

4)按照API说明，取词

素材：

1、linux/GNU => debian/ubuntu 12.04/Linuxmint 13Preferred2、python3、中文分词API，这里我们使用的是 http://www.vapsec.com/fenci/

4、分词属性的说明文件下载 http://vdisk.weibo.com/s/qR7KSFDa9ON

这里已经写好了一个测试脚本。只是单个进程访问。还没有加入并发的测试。

在以后的测试中，我会加入并发的概念的。

下面是测试脚本 test.py

#!/usr/bin/env python#coding: utf-8

importsysimporturllibimporturllib2importosimportrefrom datetime importdatetime, timedeltadef url_post(word=‘My name is Jake Anderson‘, geshi="json"):

url= "http://open.vapsec.com/segment/get_word"postDict={"word":word,"format":geshi

}

postData=urllib.urlencode(postDict)

request=urllib2.Request(url, postData)

request.get_method= lambda : ‘POST‘

#request.add_header(‘Authorization‘, basic)

response =urllib2.urlopen(request)

r=response.readlines()printrif __name__ == "__main__":

f= open(‘novel2.txt‘, ‘r‘)#get Chinese characters quantity

regex=re.compile(r"(?x) (?: [\w-]+ | [\x80-\xff]{3} )")

count=0for line inf:

line= line.decode(‘gbk‘)

line= line.encode(‘utf8‘)

word= [w for w inregex.split(line)]

count+=len(word)#print count

f = open(‘novel2.txt‘, ‘r‘)

start_time=datetime.now()for line inf:

line= line.decode(‘gbk‘)

line= line.encode(‘utf8‘)

word2= [w for w inregex.split(line)]printline

url_post(line)

end_time=datetime.now()

tdelta= start_time -end_timeprint "It takes" + str(tdelta.total_seconds()) + "seconds to segment" + str(count) + "Chinese words!"

print "This means it can segment" + str(count/tdelta.total_seconds()) + "Chinese characters per second!"

novel2.txt 是下载的小说。这个小说1.2MB大小。大约有580000字吧。

小说是GBK的格式，所以下载后，要转码成 utf-8的格式。

可以看到的终端效果大致是这样的。

把小说中所有的词，进行远程分词服务。

原文：http://www.cnblogs.com/spaceship9/p/3611317.html

weixin_39516956

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫