python提取word参考文献_Python wordsegment包_程序模块 - PyPI - Python中文网

最新推荐文章于 2024-04-02 09:42:18 发布

weixin_39812577

最新推荐文章于 2024-04-02 09:42:18 发布

阅读量359

点赞数

文章标签： python提取word参考文献

教程

在您自己的python程序中，您通常希望使用segment来划分

词组列表：>>> from wordsegment import load, segment

>>> load()

>>> segment('thisisatest')

['this', 'is', 'a', 'test']

函数从

磁盘。只需加载一次数据。

WordSegment还为批处理提供了一个命令行接口

处理。此接口接受两个参数：in file和out file。线

文件中的from被迭代分段，用空格连接，并写入

输出文件。输入和输出分别默认为stdin和stdout。$ echo thisisatest | python -m wordsegment

this is a test

如果您想将WordSegment作为一种服务器进程运行，那么使用python的

-u用于无缓冲输出的选项。您还可以在中设置PYTHONUNBUFFERED=1。

环境。>>> import subprocess as sp

>>> wordsegment = sp.Popen(

['python', '-um', 'wordsegment'],

stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.STDOUT)

>>> wordsegment.stdin.write('thisisatest\n')

>>> wordsegment.stdout.readline()

'this is a test\n'

>>> wordsegment.stdin.write('workswithotherlanguages\n')

>>> wordsegment.stdout.readline()

'works with other languages\n'

>>> wordsegment.stdin.close()

>>> wordsegment.wait() # Process exit code.

0<>最大分段字长为24个字符。既不是unigram也不是

bigram数据包含超过该长度的单词。语料库也排除了

标点符号和所有字母都已小写。在分割文本之前，

clean被调用以将输入转换为规范形式：>>> from wordsegment import clean

>>> clean('She said, "Python rocks!"')

'shesaidpythonrocks'

>>> segment('She said, "Python rocks!"')

['she', 'said', 'python', 'rocks']

有时，研究unigram和bigram计数很有趣

他们自己。它们存储在python字典中，将单词映射到count。>>> import wordsegment as ws

>>> ws.load()

>>> ws.UNIGRAMS['the']

23135851162.0

>>> ws.UNIGRAMS['gray']

21424658.0

>>> ws.UNIGRAMS['grey']

18276942.0

上面我们看到拼写gray比拼写gray更常见。

大图由空格连接：>>> import heapq

>>> from pprint import pprint

>>> from operator import itemgetter

>>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))

[('of the', 2766332391.0),

('in the', 1628795324.0),

('to the', 1139248999.0),

('on the', 800328815.0),

('for the', 692874802.0),

('and the', 629726893.0),

('to be', 505148997.0),

('is a', 476718990.0),

('with the', 461331348.0),

('from the', 428303219.0)]

有些大论以&lt；s&gt；开头。这表示bigram的开始：>>> ws.BIGRAMS[' ~~where']~~

15419048.0

>>> ws.BIGRAMS[' ~~what']~~

11779290.0

unigrams和bigrams数据存储在

分别是unigrams.txt和bigrams.txt文件。

weixin_39812577

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python提取word参考文献_Python wordsegment包_程序模块 - PyPI - Python中文网

教程在您自己的python程序中，您通常希望使用segment来划分词组列表：>>> from wordsegment import load, segment>>> load()>>> segment('thisisatest')['this', 'is', 'a', 'test']函数从磁盘。只需加载一次数据。WordSegment还为批处...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。