python统计单词频率、存放在字典中_python-词典列表-跟踪每个文件的单词频率

我编写了一些代码来计算多个文本文件中的单词频率并将其存储在字典中.

我一直在试图找到一种方法来保持每个文件的总运行次数,其形式类似于:

word1 [1] [20] [30] [22]

word2 [5] [7] [0] [4]

我已经尝试使用计数器,但是还没有找到合适的方法/数据结构.

import string

from collections import defaultdict

from collections import Counter

import glob

import os

# Words to remove

noise_words_set = {'the','to','of','a','in','is',...etc...}

# Find files

path = r"C:\Users\Logs"

os.chdir(path)

print("Processing files...")

for file in glob.glob("*.txt"):

# Read file

txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()

# Remove punctuation

for punct in string.punctuation:

txt = txt.replace(punct,"")

# Split into words and make lower case

words = [item.lower() for item in txt.split()]

# Remove unintersting words

words = [w for w in words if w not in noise_words_set]

# Make a dictionary of words

D = defaultdict(int)

for word in words:

D[word] += 1

# Add to some data structure (?) that keeps count per file

#...word1 [1] [20] [30] [22]

#...word2 [5] [7] [0] [4]

解决方法:

使用几乎整个结构!

from collections import Counter

files = dict() # this may be better as a list, tbh

table = str.maketrans('','',string.punctuation)

for file in glob.glob("*.txt"):

with open(file) as f:

word_count = Counter()

for line in f:

word_count += Counter([word.lower() for word in line.translate(table) if

word not in noise_words_set])

files[file] = word_count # if list: files.append(word_count)

如果您想将它们翻译成某些词典,请稍后再执行

words_count = dict()

for file in files:

for word,value in file.items():

try: words_count[word].append(value)

except KeyError: words_count[word] = [value]

标签:python,data-structures,dictionary

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值