简单易学的文本分析3——情感分析

爱做科研的桶

已于 2024-08-30 16:46:30 修改

阅读量536

点赞数 17

文章标签：开发语言 python 自然语言处理数据分析

于 2024-08-30 16:45:17 首次发布

本文链接：https://blog.csdn.net/llthxx/article/details/141721479

版权

1.准备工作

1. 导入必要的库

import jieba
from collections import defaultdict
import pandas as pd

jieba：一个常用的中文分词库，可以将中文句子切分成一个个单词。
defaultdict：从collections模块中导入，它是一个带有默认值的字典，可以避免在使用未初始化的键时引发KeyError。
pandas：一个常用的数据处理库，通常用于数据分析和操作。

2. 生成一个新的停用词表，排除了原始停用词列表中的否定词和程度副词

# 生成停用词表，去除否定词和程度词汇
def generate_stopwords():
    with open('停用词.txt', 'r', encoding='utf-8') as fr:
        stopwords = {word.strip() for word in fr}

    with open('否定词.txt', 'r', encoding='utf-8') as not_word_file:
        not_word_list = {w.strip() for w in not_word_file}

    with open('程度副词.txt', 'r', encoding='utf-8') as degree_file:
        degree_list = {item.split(',')[0] for item in degree_file}

    with open('stopwords.txt', 'w', encoding='utf-8') as f:
        for word in stopwords - not_word_list - degree_list:
            f.write(word + '\n')

2.`seg_word`函数与`classify_words`函数

1. `seg_word` 函数

def seg_word(sentence):
    with open('stopwords.txt', 'r', encoding='utf-8') as fr:
        stopwords = {i.strip() for i in fr}
    return [word for word in jieba.cut(sentence) if word not in stopwords]

功能：将输入的句子进行分词，并去除停用词。
详细解释：
1. 打开并读取停用词表：
  - 代码首先打开名为stopwords.txt的文件，并逐行读取内容，生成一个停用词集合stopwords。
2. 分词：
  - 使用jieba.cut(sentence)对输入的句子sentence进行分词，生成一个分词列表。
3. 过滤停用词：
  - 遍历分词结果，将那些不在停用词表中的词汇保留下来，最终返回一个包含有意义词汇的列表。

2. `classify_words` 函数

def classify_words(word_list):
    with open('BosonNLP_sentiment_score.txt', 'r', encoding='utf-8') as sen_file:
        sen_dict = {line.split(' ')[0]: line.split(' ')[1] for line in sen_file if len(line.split(' ')) == 2}

    with open('否定词.txt', 'r', encoding='utf-8') as not_word_file:
        not_word_list = {line.strip() for line in not_word_file}

    with open('程度副词.txt', 'r', encoding='utf-8') as degree_file:
        degree_dict = {line.split(',')[0]: line.split(',')[1] for line in degree_file}

    sen_word, not_word, degree_word = {}, {}, {}
    for i, word in enumerate(word_list):
        if word in sen_dict and word not in not_word_list and word not in degree_dict:
            sen_word[i] = sen_dict[word]
        elif word in not_word_list:
            not_word[i] = -1
        elif word in degree_dict:
            degree_word[i] = degree_dict[word]

    return sen_word, not_word, degree_word

功能：对输入的词汇列表word_list进行分类，将词汇分为情感词、否定词和程度副词三类。
详细解释：
1. 读取情感词典：
  - 打开名为BosonNLP_sentiment_score.txt的文件，将每一行的第一个元素作为情感词，第二个元素作为其对应的情感分数，生成情感词典sen_dict。
2. 读取否定词列表：
  - 打开名为否定词.txt的文件，生成否定词集合not_word_list。
3. 读取程度副词词典：
  - 打开名为程度副词.txt的文件，将每行的第一个元素作为程度副词，第二个元素作为其对应的程度权重，生成程度副词词典degree_dict。
4. 分类词汇：
  - 初始化三个空字典：sen_word用于存储情感词及其对应位置和情感分数，not_word用于存储否定词及其对应位置（值固定为-1），degree_word用于存储程度副词及其对应位置和权重。
  - 遍历输入的词汇列表word_list，对每个词进行分类：
    - 如果词汇在情感词典中，并且不在否定词和程度副词集合中，将其加入sen_word字典。
    - 如果词汇在否定词列表中，将其位置和-1加入not_word字典。
    - 如果词汇在程度副词词典中，将其位置和权重加入degree_word字典。
5. 返回分类结果：
  - 最终，函数返回三个字典，分别对应情感词、否定词和程度副词及其在句子中的位置和属性。

3.情感得分计算

1. `score_sentiment` 函数

def score_sentiment(sen_word, not_word, degree_word, seg_result):
    W, score = 1, 0
    sentiment_indices = list(sen_word.keys())

    for i in range(len(seg_result)):
        if i in sen_word:
            score += W * float(sen_word[i])
            if sentiment_indices.index(i) < len(sentiment_indices) - 1:
                for j in range(i + 1, sentiment_indices[sentiment_indices.index(i) + 1]):
                    if j in not_word:
                        W *= -1
                    elif j in degree_word:
                        W *= float(degree_word[j])

    return score

功能：计算句子的情感得分。
详细解释：
1. 初始化：
  - W 初始化为 1，用于表示权重，它会因否定词和程度副词的影响而变化。
  - score 初始化为 0，用于累计句子的情感得分。
  - sentiment_indices 是情感词在句子中位置的列表。
2. 遍历分词结果：
  - 遍历seg_result中的每一个词，如果该词在情感词典中（即i in sen_word），则根据当前权重W和该情感词的得分更新总得分score。
  - 检查当前情感词的位置在情感词列表中的顺序，如果它不是最后一个情感词，则检查该情感词与下一个情感词之间的词汇。
  - 如果这些词是否定词（即j in not_word），则将权重W乘以-1；如果这些词是程度副词（即j in degree_word），则将权重W乘以对应的程度副词权重。
3. 返回最终得分：
  - 函数最终返回累计的情感得分score。

2. `sentiment_score` 函数

def sentiment_score(sentence):
    seg_list = seg_word(sentence)
    sen_word, not_word, degree_word = classify_words(seg_list)
    return score_sentiment(sen_word, not_word, degree_word, seg_list)

功能：对单个句子进行情感分析，并返回该句子的情感得分。
详细解释：
1. 分词：
  - 使用seg_word(sentence)函数对句子进行分词，得到词汇列表seg_list。
2. 词汇分类：
  - 使用classify_words(seg_list)函数对分词结果进行情感词、否定词和程度副词的分类。
3. 计算情感得分：
  - 使用score_sentiment函数对分类后的词汇进行情感得分计算，并返回得分。

3. `analyze_file` 函数

def analyze_file(filepath):
    total_score, line_count = 0, 0
    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line:
                total_score += sentiment_score(line)
                line_count += 1
    average_score = total_score / line_count if line_count > 0 else 0
    return total_score, average_score

功能：对一个文本文件中的所有句子进行情感分析，计算总情感得分和平均情感得分。
详细解释：
1. 初始化：
  - total_score 初始化为 0，用于累计文件中所有句子的情感总分。
  - line_count 初始化为 0，用于计数文件中句子的总数。
2. 读取文件：
  - 打开文件filepath，逐行读取内容。
  - 使用strip()方法去除每行的前后空白字符。
  - 如果该行内容不为空（即line非空），则调用sentiment_score(line)函数计算该行的情感得分，并将结果累加到total_score中，同时将line_count加1。
3. 计算平均得分：
  - 如果句子总数line_count大于0，则计算平均情感得分average_score，否则平均得分为0。
4. 返回结果：
  - 函数返回两个值：total_score（文件的情感总分）和average_score（文件的情感平均分）。

4.执行函数，保存结果

1. 主程序执行

filepath = 'gpt4_system2_80%.txt'
total_score, average_score = analyze_file(filepath)
print(f"总情感得分: {total_score}")
print(f"平均情感得分: {average_score}")

功能：执行情感分析并输出结果。
详细解释：
1. 指定文件路径：
  - filepath = 'gpt4_system2_80%.txt' 指定了需要分析的文本文件的路径和文件名。
2. 调用情感分析函数：
  - total_score, average_score = analyze_file(filepath) 调用之前定义的analyze_file函数，传入文件路径filepath，返回两个值：文件的情感总得分total_score和平均情感得分average_score。
3. 打印结果：
  - print(f"总情感得分: {total_score}") 和 print(f"平均情感得分: {average_score}") 将计算得到的情感总得分和平均情感得分打印到控制台。

2. 保存结果到Excel

df = pd.DataFrame({
    '组别': ['gpt4_system2_80%'],
    '总情感得分': [total_score],
    '平均情感得分': [average_score]
})
df.to_excel('gpt4_system2_80%.xlsx', index=False)

功能：将情感分析结果保存到一个Excel文件中。
详细解释：
1. 创建DataFrame：
  - df = pd.DataFrame({...}) 使用pandas库创建一个数据框df。数据框包含三个列：
    - '组别'：固定为'gpt4_system2_80%'，表示当前分析的数据组别。
    - '总情感得分'：存储前面计算的情感总得分total_score。
    - '平均情感得分'：存储前面计算的平均情感得分average_score。
2. 保存为Excel文件：
  - df.to_excel('gpt4_system2_80%.xlsx', index=False) 将数据框df保存为一个Excel文件，文件名为'gpt4_system2_80%.xlsx'。参数index=False表示不将行索引写入Excel文件中。

总体代码与结果

import jieba
from collections import defaultdict
import pandas as pd


# 生成停用词表，去除否定词和程度词汇
def generate_stopwords():
    with open('停用词.txt', 'r', encoding='utf-8') as fr:
        stopwords = {word.strip() for word in fr}

    with open('否定词.txt', 'r', encoding='utf-8') as not_word_file:
        not_word_list = {w.strip() for w in not_word_file}

    with open('程度副词.txt', 'r', encoding='utf-8') as degree_file:
        degree_list = {item.split(',')[0] for item in degree_file}

    with open('stopwords.txt', 'w', encoding='utf-8') as f:
        for word in stopwords - not_word_list - degree_list:
            f.write(word + '\n')


def seg_word(sentence):
    with open('stopwords.txt', 'r', encoding='utf-8') as fr:
        stopwords = {i.strip() for i in fr}
    return [word for word in jieba.cut(sentence) if word not in stopwords]


def classify_words(word_list):
    with open('BosonNLP_sentiment_score.txt', 'r', encoding='utf-8') as sen_file:
        sen_dict = {line.split(' ')[0]: line.split(' ')[1] for line in sen_file if len(line.split(' ')) == 2}

    with open('否定词.txt', 'r', encoding='utf-8') as not_word_file:
        not_word_list = {line.strip() for line in not_word_file}

    with open('程度副词.txt', 'r', encoding='utf-8') as degree_file:
        degree_dict = {line.split(',')[0]: line.split(',')[1] for line in degree_file}

    sen_word, not_word, degree_word = {}, {}, {}
    for i, word in enumerate(word_list):
        if word in sen_dict and word not in not_word_list and word not in degree_dict:
            sen_word[i] = sen_dict[word]
        elif word in not_word_list:
            not_word[i] = -1
        elif word in degree_dict:
            degree_word[i] = degree_dict[word]

    return sen_word, not_word, degree_word


def score_sentiment(sen_word, not_word, degree_word, seg_result):
    W, score = 1, 0
    sentiment_indices = list(sen_word.keys())

    for i in range(len(seg_result)):
        if i in sen_word:
            score += W * float(sen_word[i])
            if sentiment_indices.index(i) < len(sentiment_indices) - 1:
                for j in range(i + 1, sentiment_indices[sentiment_indices.index(i) + 1]):
                    if j in not_word:
                        W *= -1
                    elif j in degree_word:
                        W *= float(degree_word[j])

    return score


def sentiment_score(sentence):
    seg_list = seg_word(sentence)
    sen_word, not_word, degree_word = classify_words(seg_list)
    return score_sentiment(sen_word, not_word, degree_word, seg_list)


def analyze_file(filepath):
    total_score, line_count = 0, 0
    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line:
                total_score += sentiment_score(line)
                line_count += 1
    average_score = total_score / line_count if line_count > 0 else 0
    return total_score, average_score


# 主程序执行
filepath = 'gpt4_system2_80%.txt'
total_score, average_score = analyze_file(filepath)
print(f"总情感得分: {total_score}")
print(f"平均情感得分: {average_score}")

# 保存结果到Excel
df = pd.DataFrame({
    '组别': ['gpt4_system2_80%'],
    '总情感得分': [total_score],
    '平均情感得分': [average_score]
})
df.to_excel('gpt4_system2_80%.xlsx', index=False)