自然语言处理学习4：nltk词频统计FreqDist，ConditionalFreqDist和tabulate 结合汽车评论实例

最新推荐文章于 2023-01-19 10:48:05 发布

zhuzuwei

最新推荐文章于 2023-01-19 10:48:05 发布

阅读量6k

点赞数 1

分类专栏：自然语言处理文章标签： nltk FreqDist ConditionalFreqDist tabulate 词频统计

本文链接：https://blog.csdn.net/zhuzuwei/article/details/80487707

版权

自然语言处理专栏收录该内容

26 篇文章 15 订阅

订阅专栏

1. 加载函数和准备数据

import nltk
import jieba
import numpy as np
import pandas as pd
import re
# 读取评价数据
def load_comments(filename):
    df = pd.read_csv(filename, encoding='gbk')
    pos_comments = list(df['advance'])
    neg_comments = list(df['disadvance'])
    pos_comments = [a for a in pos_comments if len(str(a)) != 0]
    neg_comments = [a for a in neg_comments if len(str(a)) != 0]
    pos_labels = np.ones((len(pos_comments))).tolist()
    neg_labels = np.ones((len(neg_comments))).tolist()
    neg_labels = [-(label) for label in neg_labels]

    return pos_comments,neg_comments,pos_labels,neg_labels

# 将文本评论进行分词，返回词列表
def get_words_list(comments_list):
    words_list = []
    for comment in comments_list:
        if isinstance(comment,str):
            #去除字符串中的数字、字母和标点符号等
            new_comment = re.sub('[0123456789.;\"\']','',comment)
            new_comment = re.sub('[a-zA-Z]*', '', new_comment)
            comment_words = jieba.lcut(new_comment)
            words_list.append(comment_words)

    return words_list

读取数据并转换为词列表

2. FreqDist(words_list), 接受list类型的参数，返回词典，key是元素，value是元素出现的次数

3. ConditionalFreqDist, 以一个配对链表作为输入，需要给分配的每个事件关联一个条件，类似于 (条件,事件) 的元组

构建该元组：通过以下代码构建出的cond_wrd_fd是一个词典，key是条件，此处为'pos'和'neg'，即积极评论和消极评论。

value是一个FreqDist对象，对应积极/消极评论中词的词频统计，该词典key是不同的词，value是词在积极/消极评价中的词频。

cond_word_fd = nltk.ConditionalFreqDist()  # 可统计积极文本中的词频和消极文本中的词频
for word in posWords:
    cond_word_fd['pos'][word] += 1

for word in negWords:
    cond_word_fd['neg'][word] += 1