python词组_在Python 3.3.2中计算词组频率

最新推荐文章于 2022-03-07 16:50:09 发布

weixin_39664774

最新推荐文章于 2022-03-07 16:50:09 发布

阅读量46

点赞数

文章标签： python词组

本文链接：https://blog.csdn.net/weixin_39664774/article/details/111439285

版权

I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as follows:

import collections

import re

wanted = set(['inflation', 'gold', 'bank'])

cnt = collections.Counter()

words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())

for word in words:

if word in wanted:

cnt [word] += 1

print (cnt)

If possible, I would also like to count the number of times the phrases 'central bank' and 'high inflation' is used in this text. I appreciate any suggestion or guidance you can give.

解决方案

First of all, this is how I would generate the cnt that you do (to reduce memory overhead)

def findWords(filepath):

with open(filepath) as infile:

for line in infile:

words = re.findall('\w+', line.lower())

yield from words

cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))

Now, on to your question about phrases:

from itertools import tee

phrases = {'central bank', 'high inflation'}

fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))

next(fw2)

for w1,w2 in zip(fw1, fw2)):

phrase = ' '.join([w1, w2])

if phrase in phrases:

cnt[phrase] += 1

Hope this helps

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

weixin_39664774

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python词组_在Python 3.3.2中计算词组频率

I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as fo...
复制链接

扫一扫