这听起来像是collections.Counter的工作:import collections
with open('gettysburg.txt') as f:
c = collections.Counter(f.read().split())
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)
结果:$ python foo.py
'Four' appears 1 times
'the' appears 9 times
There are 267 total words
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]
当然,这将“自由”和“这个”算作单词(注意单词中的标点符号)。此外,它还将“The”和“The”视为不同的单词。此外,处理整个文件可能会丢失非常大的文件。
这是一个忽略标点和大小写的版本,在大文件上更节省内存。import collections
import re
with open('gettysburg.txt')