NLP笔记 --- 1.单词计数

最新推荐文章于 2021-11-15 13:37:03 发布

xf8964

最新推荐文章于 2021-11-15 13:37:03 发布

阅读量258

点赞数

分类专栏：优达NLP笔记

本文链接：https://blog.csdn.net/xf8964/article/details/88919795

版权

优达NLP笔记专栏收录该内容

3 篇文章 1 订阅

订阅专栏

单词计数

首先我们先来完成一个小实验，单词计数，首先我们需要一段文本数据
As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.
我们把它保存为 input.txt文本，我们要统计文本中使用最多的前10个单词，和使用最少的单词，其实这有一定的步骤

1.将文本转换为小写，这是因为在文本中car 和Car是一样的
2.通过正则来去除文本中标点符号，最好是用空格替换标点符号
3.根据空格来分割单词，返回一个列表
4.使用字典来统计，也可使用from collections import defaultdict 进行统计
我们先来完成单词统计函数

def count_word(text):
    """Count how times each unique word occurs in text"""
    counts = dict()   # 
    # Convert to lowercase
    text = text.lower()
    
    # 取出非字母字符
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    # 将字符串按照空格分割
    text = text.split()
    for word in text:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
            
    return counts

def test_run():
    with open("input.txt", "r") as f:
        text = f.read()
        counts = count_word(text)
        sorted_counts = sorted(counts.items(), key=lambda pair : pair[1], reverse=True)
        
        print("10 most common words:\nWord\tCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))
            
        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))