单词计数
首先我们先来完成一个小实验,单词计数,首先我们需要一段文本数据
As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.
我们把它保存为
input.txt
文本,我们要统计文本中使用最多的前10个单词,和使用最少的单词,其实这有一定的步骤
- 1.将文本转换为小写,这是因为在文本中car 和Car是一样的
- 2.通过正则来去除文本中标点符号,最好是用空格替换标点符号
- 3.根据空格来分割单词,返回一个列表
- 4.使用字典来统计,也可使用from collections import defaultdict 进行统计
我们先来完成单词统计函数
def count_word(text):
"""Count how times each unique word occurs in text"""
counts = dict() #
# Convert to lowercase
text = text.lower()
# 取出非字母字符
text = re.sub('[^a-zA-Z0-9]', ' ', text)
# 将字符串按照空格分割
text = text.split()
for word in text:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
def test_run():
with open("input.txt", "r") as f:
text = f.read()
counts = count_word(text)
sorted_counts = sorted(counts.items(), key=lambda pair : pair[1], reverse=True)
print("10 most common words:\nWord\tCount")
for word, count in sorted_counts[:10]:
print("{}\t{}".format(word, count))
print("\n10 least common words:\nWord\tCount")
for word, count in sorted_counts[-10:]:
print("{}\t{}".format(word, count))