如何用python统计单词的频率_在python中有效地计算单词频率

最新推荐文章于 2023-05-23 20:42:11 发布

weixin_40005887

最新推荐文章于 2023-05-23 20:42:11 发布

阅读量290

点赞数

文章标签：如何用python统计单词的频率

I'd like to count frequencies of all words in a text file.

>>> countInFile('test.txt')

should return {'aaa':1, 'bbb': 2, 'ccc':1} if the target text file is like:

# test.txt

aaa bbb ccc

bbb

I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).

I think borrowing sklearn's power is a candidate.

If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.

What is the most efficient and straightforward way to count words in a file with python?

Update

My (very slow) code is here:

from collections import Counter

def get_term_frequency_in_file(source_file_path):

wordcount = {}

with open(source_file_path) as f:

for line in f:

line = line.lower().translate(None, string.punctuation)

this_wordcount = Counter(line.split())

wordcount = add_merge_two_dict(wordcount, this_wordcount)

return wordcount

def add_merge_two_dict(x, y):

return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }

解决方案

The most succinct approach is to use the tools Python gives you.

from future_builtins import map # Only on Python 2

from collections import Counter

from itertools import chain

def countInFile(filename):

with open(filename) as f:

return Counter(chain.from_iterable(map(str.split, f)))

That's it. map(str.split, f) is making a generator that returns lists of words from each line. Wrapping in chain.from_iterable converts that to a single generator that produces a word at a time. Counter takes an input iterable and counts all unique values in it. At the end, you return a dict-like object (a Counter) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.

In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dict or collections.defaultdict(int) to count (because Counter is implemented in Python, which can make it slower in some cases), but letting Counter do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counter has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.

Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:

from string import punctuation

def countInFile(filename):

with open(filename) as f:

linewords = (line.translate(None, punctuation).lower().split() for line in f)

return Counter(chain.from_iterable(linewords))

Your code runs much more slowly because it's creating and destroying many small Counter and set objects, rather than .update-ing a single Counter once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).