我编写了一些代码来计算多个文本文件中的单词频率并将其存储在字典中.
我一直在试图找到一种方法来保持每个文件的总运行次数,其形式类似于:
word1 [1] [20] [30] [22]
word2 [5] [7] [0] [4]
我已经尝试使用计数器,但是还没有找到合适的方法/数据结构.
import string
from collections import defaultdict
from collections import Counter
import glob
import os
# Words to remove
noise_words_set = {'the','to','of','a','in','is',...etc...}
# Find files
path = r"C:\Users\Logs"
os.chdir(path)
print("Processing files...")
for file in glob.glob("*.txt"):
# Read file
txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()
# Remove punctuation
for punct in string.punctuation:
txt = txt.replace(punct,"")
# Split into words and make lower case
words = [item.lower() for item in txt.split()]
# Remove unintersting words
words = [w for w in words if w not in noise_words_set]
# Make a dictionary of words
D = defaultdict(int)
for word in words:
D[word] += 1
# Add to some data structure (?) that keeps count per file
#...word1 [1] [20] [30] [22]
#...word2 [5] [7] [0] [4]
解决方法:
使用几乎整个结构!
from collections import Counter
files = dict() # this may be better as a list, tbh
table = str.maketrans('','',string.punctuation)
for file in glob.glob("*.txt"):
with open(file) as f:
word_count = Counter()
for line in f:
word_count += Counter([word.lower() for word in line.translate(table) if
word not in noise_words_set])
files[file] = word_count # if list: files.append(word_count)
如果您想将它们翻译成某些词典,请稍后再执行
words_count = dict()
for file in files:
for word,value in file.items():
try: words_count[word].append(value)
except KeyError: words_count[word] = [value]
标签:python,data-structures,dictionary