python统计单词频率、存放在字典中_python-词典列表-跟踪每个文件的单词频率

最新推荐文章于 2022-12-17 14:59:17 发布

weixin_39743414

最新推荐文章于 2022-12-17 14:59:17 发布

阅读量161

点赞数

文章标签： python统计单词频率、存放在字典中

我编写了一些代码来计算多个文本文件中的单词频率并将其存储在字典中.

我一直在试图找到一种方法来保持每个文件的总运行次数,其形式类似于：

word1 [1] [20] [30] [22]

word2 [5] [7] [0] [4]

我已经尝试使用计数器,但是还没有找到合适的方法/数据结构.

import string

from collections import defaultdict

from collections import Counter

import glob

import os

# Words to remove

noise_words_set = {'the','to','of','a','in','is',...etc...}

# Find files

path = r"C:\Users\Logs"

os.chdir(path)

print("Processing files...")

for file in glob.glob("*.txt"):

# Read file

txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()

# Remove punctuation

for punct in string.punctuation:

txt = txt.replace(punct,"")

# Split into words and make lower case

words = [item.lower() for item in txt.split()]

# Remove unintersting words

words = [w for w in words if w not in noise_words_set]

# Make a dictionary of words

D = defaultdict(int)

for word in words:

D[word] += 1

# Add to some data structure (?) that keeps count per file

#...word1 [1] [20] [30] [22]

#...word2 [5] [7] [0] [4]

解决方法:

使用几乎整个结构！

from collections import Counter

files = dict() # this may be better as a list, tbh

table = str.maketrans('','',string.punctuation)

for file in glob.glob("*.txt"):

with open(file) as f:

word_count = Counter()

for line in f:

word_count += Counter([word.lower() for word in line.translate(table) if

word not in noise_words_set])

files[file] = word_count # if list: files.append(word_count)

如果您想将它们翻译成某些词典,请稍后再执行

words_count = dict()

for file in files:

for word,value in file.items():

try: words_count[word].append(value)

except KeyError: words_count[word] = [value]

标签：python,data-structures,dictionary

weixin_39743414

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python统计单词频率、存放在字典中_python-词典列表-跟踪每个文件的单词频率

我编写了一些代码来计算多个文本文件中的单词频率并将其存储在字典中.我一直在试图找到一种方法来保持每个文件的总运行次数,其形式类似于：word1 [1] [20] [30] [22]word2 [5] [7] [0] [4]我已经尝试使用计数器,但是还没有找到合适的方法/数据结构.import stringfrom collections import defaultdictfrom collect...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。