*Python*如何使用自定义情感词库进行中文情感分析

        情感分析(Sentiment Analysis)是指通过自然语言处理技术自动判断文本的情感倾向,如正面、负面或中性。对于中文文本,情感分析通常需要考虑多种因素,包括停用词、程度级别词语、否定词等。本文将详细介绍如何构建一个简单的中文情感分析系统。

1、首先,我们需要准备一些必要的工具和资源:
        jieba库:一个流行的中文分词工具。
        停用词库:用于去除无关紧要的词汇。
        情感词库:包括正面情绪词和负面情绪词。
        程度级别词语:用于调整情感得分。
        否定词:用于处理否定句。

2、加载停用词库
        停用词库通常包含一些常见的词汇,如“的”、“是”等,这些词汇对情感分析没有帮助,需要去除。我们可以通过以下函数加载停用词库:

def load_stopwords(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        stopwords = set(line.strip() for line in f)
    return stopwords

3、加载情感词库
        情感词库通常包含正面情绪词和负面情绪词。我们可以将它们合并成一个字典,其中键为词汇,值为情感得分。以下函数用于加载情感词库:

def load_sentiment_lexicon(positive_file, negative_file):
    positive_lexicon = {}
    with open(positive_file, 'r', encoding='utf-8') as f:
        for line in f:
            word, score = line.strip().split('\t')
            positive_lexicon[word] = int(score)

    negative_lexicon = {}
    with open(negative_file, 'r', encoding='utf-8') as f:
        for line in f:
            word, score = line.strip().split('\t')
            negative_lexicon[word] = int(score)

    lexicon = {**positive_lexicon, **negative_lexicon}
    return lexicon

4、加载程度级别词语
        程度级别词语用于调整情感得分,例如“非常”会增加情感得分,“稍微”会减少情感得分。以下函数用于加载程度级别词语:

def load_degree_words(file_path):
    degree_words = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            word, multiplier = line.strip().split('\t')
            degree_words[word] = float(multiplier)
    return degree_words

5、加载否定词
        否定词用于处理否定句,例如“不”、“没”等。以下函数用于加载否定词:

def load_negation_words(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        negation_words = set(line.strip() for line in f)
    return negation_words

6、文本预处理
        文本预处理包括分词和去除停用词。我们可以使用jieba进行分词,并去除停用词:

import jieba
import re

def preprocess_text(text, stopwords):
    text = re.sub(r'[^\u4e00-\u9fa5]', '', text)  # 保留中文字符
    words = jieba.lcut(text)
    filtered_words = [word for word in words if word not in stopwords]
    return filtered_words

7、计算情感得分
        最后,我们需要计算文本的情感得分。这涉及到考虑程度级别词语和否定词的影响:

def sentiment_score(text, lexicon, degree_words, negation_words):
    words = preprocess_text(text, stopwords)
    score = 0
    modifier = 1
    negation_flag = False
    
    for word in words:
        if word in degree_words:
            modifier *= degree_words[word]
        elif word in negation_words:
            negation_flag = True
        elif word in lexicon:
            if negation_flag:
                score += lexicon[word] * modifier * -1
                negation_flag = False
            else:
                score += lexicon[word] * modifier
            modifier = 1  # 重置修饰符
    
    return score

8、示例代码

import jieba
import re

# 加载停用词库
def load_stopwords(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        stopwords = set(line.strip() for line in f)
    return stopwords

# 加载情感词库
def load_sentiment_lexicon(positive_file, negative_file):
    positive_lexicon = {}
    with open(positive_file, 'r', encoding='utf-8') as f:
        for line in f:
            word, score = line.strip().split('\t')
            positive_lexicon[word] = int(score)

    negative_lexicon = {}
    with open(negative_file, 'r', encoding='utf-8') as f:
        for line in f:
            word, score = line.strip().split('\t')
            negative_lexicon[word] = int(score)

    lexicon = {**positive_lexicon, **negative_lexicon}
    return lexicon

# 加载程度级别词语
def load_degree_words(file_path):
    degree_words = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            word, multiplier = line.strip().split('\t')
            degree_words[word] = float(multiplier)
    return degree_words

# 加载否定词
def load_negation_words(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        negation_words = set(line.strip() for line in f)
    return negation_words

# 文本预处理
def preprocess_text(text, stopwords):
    text = re.sub(r'[^\u4e00-\u9fa5]', '', text)  # 保留中文字符
    words = jieba.lcut(text)
    filtered_words = [word for word in words if word not in stopwords]
    return filtered_words

# 计算情感得分
def sentiment_score(text, lexicon, degree_words, negation_words):
    words = preprocess_text(text, stopwords)
    score = 0
    modifier = 1
    negation_flag = False
    
    for word in words:
        if word in degree_words:
            modifier *= degree_words[word]
        elif word in negation_words:
            negation_flag = True
        elif word in lexicon:
            if negation_flag:
                score += lexicon[word] * modifier * -1
                negation_flag = False
            else:
                score += lexicon[word] * modifier
            modifier = 1  # 重置修饰符
    
    return score

# 示例文本
text = "这部电影真是太棒了,但是结局有点糟糕。"

# 文件路径
stopwords_file = 'chineseStopWords.txt'
positive_file = 'positiveWords.txt'
negative_file = 'negativeWords.txt'
degree_file = 'degreeWords.txt'
negation_file = 'negationWords.txt'

# 加载停用词库
stopwords = load_stopwords(stopwords_file)

# 加载情感词库
positive_lexicon = load_sentiment_lexicon(positive_file, negative_file)
negative_lexicon = load_sentiment_lexicon(negative_file, positive_file)
lexicon = {**positive_lexicon, **negative_lexicon}

# 加载程度级别词语
degree_words = load_degree_words(degree_file)

# 加载否定词
negation_words = load_negation_words(negation_file)

# 计算情感得分
score = sentiment_score(text, lexicon, degree_words, negation_words)

print(f"The sentiment score of the text is: {score}")

对于情感分析任务,推荐使用Python中的`jieba`库进行中文分词使用`pandas`库读取和处理Excel文件,以及使用`snownlp`或`TextBlob`库进行情感分析。下面是一个简单的示例代码: ```python import jieba import pandas as pd from snownlp import SnowNLP # 读取Excel文件 df = pd.read_excel('data.xlsx') # 加载词库 jieba.load_userdict('user_dict.txt') # 分词 df['content_cut'] = df['content'].apply(lambda x: ' '.join(jieba.cut(x))) # 情感分析 df['sentiment'] = df['content_cut'].apply(lambda x: SnowNLP(x).sentiments) # 输出结果 print(df) ``` 其中,`user_dict.txt`是自定义词库文件,需要按照一定格式添加词汇。`SnowNLP`库的`snow.sentiments`方法可以返回一个0-1之间的情感得分,越接近1表示越积极,越接近0表示越消极。 对于热点趋势分析,可以使用Python中的`wordcloud`库绘制词云图,并结合`jieba.analyse`库提取关键词。具体操作可以参考以下示例代码: ```python import jieba import jieba.analyse from wordcloud import WordCloud import matplotlib.pyplot as plt # 读取文本文件 with open('data.txt', 'r', encoding='utf-8') as f: text = f.read() # 提取关键词 keywords = jieba.analyse.extract_tags(text, topK=50) # 生成词云图 wordcloud = WordCloud(font_path='msyh.ttf', width=800, height=600, background_color='white').generate(' '.join(keywords)) # 显示词云图 plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.show() ``` 其中,`data.txt`是待分析的文本文件,`topK`参数可以控制提取的关键词数量。`WordCloud`库可以根据关键词生成词云图,`font_path`参数可以指定字体文件,`background_color`参数可以设置背景颜色。最后使用`matplotlib`库显示词云图。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值