基于爬虫的微博舆情分析

苏州河247

已于 2024-08-29 21:29:37 修改

阅读量2k

点赞数 24

文章标签： python 爬虫

于 2024-08-29 18:42:27 首次发布

本文链接：https://blog.csdn.net/2403_87013898/article/details/141679977

版权

本次实践基于爬虫技术搜集新浪微博相关词条下的文本数据，使用SnowNLP模型进行情感倾向分析，并对不同情感倾向的评论进行LDA主题模型分析。

一、爬虫技术应用

下载

实践基于GitHub开源Scrapy爬虫项目weibo-search进行数据的爬取，下载地址：GitHub - dataabc/weibo-search: 获取微博搜索结果信息，搜索即可以是微博关键词搜索，也可以是微博话题搜索

通过GitHub下载相关脚本，在requirements.txt所在文件目录下打开终端，输入pip install -r requirements.txt下载相关依赖。

获取cookie

登录微博后F12打开开发者工具，在搜索栏输入cookie

即可获取自己的cookie值

设置相关词

打开setting.py文件，将上面获得的cookie值直接填入

对照项目的使用说明/注释调整参数。

运行

在终端界面输入scrapy crawl search或者scrapy crawl search -s JOBDIR=crawls/search就可以进行数据的爬取了，显示下面界面即为成功。

如果出现'weibo' is not a package，可以尝试新建一个环境；时隔多日再次运行出现没有结果的情况，尝试重新更改一下setting里的cookie。

如果结果输出为excel的话，当前界面下会生成一个文件夹，里面就是爬取的数据。

二、数据预处理

缺失值、重复值检验

重复值检验。

duplicates = data.duplicated()
duplicates.sum()

缺失值检验

data_cleaned = data.drop_duplicates()
data_cleaned.isnull().sum()
data_cleaned.dropna()

数据清洗

后续步骤参考了文本聚类（一）—— LDA 主题模型_lda模型文本分类-CSDN博客

删除表情，只保留中英文。

def clear_character(sentence):    
    pattern = re.compile('[^\u4e00-\u9fa5^a-z^A-Z^0-9]')   
    line = re.sub(pattern, '', sentence)   
    new_sentence = ''.join(line.split())
    return new_sentence

删除话题内的内容，即删除##和##里的信息，这里使用正则表达式。

re.sub(r'\##(.*?)##', '', word)

同时删除一些固定的内容，比如我想要调查的是大众对于流浪动物的看法，在话题讨论中势必出现流浪动物、猫、狗等信息，但这些数据在分析时价值不大。

def remove_words(text, words_to_remove):
    for word in words_to_remove:
        word = re.sub(r'\##(.*?)##', '', word) 
        text = text.replace(word, '')
    return text

words_to_remove传入想要删除的词汇列表。

分词

使用jieba进行分词

jieba_text = [jieba.lcut(s) for s in list(train_text)]

在进行情感分析时，一些特定的词性对情感分析的贡献较小，甚至可能会引入噪声。因此考虑通过词性标注过滤掉不重要的词。选取形容词、副词或动词作为关键词。

def filter_sentences_by_pos(tagged_sentences, pos_filters=['n', 'v', 'a']):  
    filtered_sentences = []  
    for sentence in tagged_sentences:  
        filtered_words = [(word, flag) for word, flag in sentence if flag[0] in pos_filters]  
        if filtered_words: 
            filtered_sentences.append(filtered_words)  
    return filtered_sentences  

filtered_tagged_sentences = filter_sentences_by_pos(tagged_sentences)  
filtered_sentences = [[word for word, _ in sentence] for sentence in filtered_tagged_sentences]

构建了bigram模型来进行文本处理，捕捉高频率同时出现的词语。

def make_bigrams(texts):    
    bigram = phrases.Phrases(texts, min_count=5, threshold=100)  
    bigram_mod = phrases.Phraser(bigram)  
    return [bigram_mod[doc] for doc in texts]  

data_words_bigrams = make_bigrams(filtered_sentences)

词云图可视化呈现。

bigrams = []  
for sublist in data_words_bigrams:  
    bigrams.extend(sublist)  
   
word_counts = Counter(bigrams)  
  
wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', width=500, height=400)   
wordcloud.generate_from_frequencies(word_counts)  

plt.figure(figsize=(12, 9))  
plt.imshow(wordcloud, interpolation='bicubic')  
plt.axis('off')  
plt.show()

三、基于SnowNLP模型的情感倾向分析

内容参考了python情感分析：基于jieba的分词及snownlp的情感分析！_python snownlp情感分析-CSDN博客

进行SnowNLP进行整体的情感分析。

from snownlp import SnowNLP
pos_num = 0
neg_num = 0
for word in data_words_bigrams:
    sl = SnowNLP(word)
    if sl.sentiments > 0.5:
        pos_num = pos_num + 1
    else:
        neg_num = neg_num + 1

计算正面情绪所占比例。

print('正面情绪关键词数量：{}'.format(pos_num))
print('负面情绪关键词数量：{}'.format(neg_num))
print('正面情绪所占比例：{}'.format(pos_num/(pos_num + neg_num)))

然后对每条评论进行情感分析。

mid = 0
zheng = 0
fu = 0
fenshu = []
keywords1 = []
for i in range(len(data_words_bigrams)):
    pos_num = 0
    neg_num = 0
    for word in data_words_bigrams:
        sl = SnowNLP(word)
        if sl.sentiments > 0.5:
            pos_num = pos_num + 1
        else:
            neg_num = neg_num + 1
    if pos_num + neg_num == 0:
        fenshu.append(0.5)
        mid = mid+1
    else:
        fenshu.append(pos_num/(pos_num + neg_num))
        if pos_num/(pos_num + neg_num) >0.6:
            zheng = zheng + 1
        elif pos_num/(pos_num + neg_num)<0.4:
            fu = fu + 1
        else:
            mid = mid + 1

将情绪分为正面（zheng），负面（fu），中立（mid）三种。如果正面词汇和负面词汇数量相同，则认为它是中立的；正面词汇数量大于负面词汇数量，则认为它是正面的；正面词汇数量小于负面词汇数量，则认为它是负面的。将正面情绪所占比例计为该句分数。

进行可视化呈现。

x = ['正向评论_positive', '负向评论_negative', '中性评论_neutral']  
y = [zheng, fu, mid]
colors = plt.colormaps['Pastel1'].colors
plt.pie(y, pctdistance=0.85, autopct='%.1f%%', labels=x, colors=colors, wedgeprops=dict(width=0.3, edgecolor='w'))  
plt.legend(x, loc='upper left')  
plt.show()

将情感标注到每句话之后。

需要注意的是，如果句子在过滤的过程中被完全删除，需要修改filter_sentences_by_pos记录被删除的句子的索引。

def filter_sentences_by_pos(tagged_sentences, pos_filters=['n', 'v', 'a']):  
    filtered_sentences = []  
    deleted_indices = []  
    sentence_index = 0
  
    for sentence in tagged_sentences:  
        filtered_words = [(word, flag) for word, flag in sentence if flag[0] in pos_filters]  
        if filtered_words:  
            filtered_sentences.append(filtered_words)  
        else:  
            deleted_indices.append(sentence_index)  
        sentence_index += 1

使用data.drop(index=deleted_indices)删除这些行。

统计每个ip最多的情感。

grouped = data.groupby('ip')

result = grouped['情感'].value_counts()
most_common_emotion = result.groupby(level=0).idxmax()
print("每个IP最多的情感：")
print(most_common_emotion)

计算每个ip平均分数，并进行可视化呈现。

average_scores = grouped['评分'].mean()
print(average_scores)

四、基于LDA主题模型的主题确定分析

zhengxiang = [keywords1 for keywords1, num in zip(keywords1, fenshu) if num > 0.5]

通过corpora.Dictionary方法对输入的数据建立一个数字映射词典，每个不同的词语映射到一个唯一的整数ID。使用doc2bow方法将数据中的每个文本转换为词袋模型。

id2word = corpora.Dictionary(zhengxiang)  
id2word.save_as_text("zhengxiang")              
texts = zhengxiang                
corpus = [id2word.doc2bow(text) for text in texts]   # Term Document Frequency
id2word[2],print([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]])

以正向情感的数据为例，通过遍历方法计算一致性得分数据，选取一致性得分最高的主题数。

Coherence = []
Perplexity = []

for i in range(1,11):
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=i, 
                                           random_state=200,
                                           update_every=1,
                                           chunksize=100,
                                           passes=20,
                                           alpha='auto',
                                           per_word_topics=True)
    doc_lda = lda_model[corpus]
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words_bigrams, dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    Coherence.append(coherence_lda)
    Perplexity.append(lda_model.log_perplexity(corpus))

同理对负面、中性数据采取一样的处理方式。

按照确定的最优主题数对数据构建LDA模型，最后进行分析归纳。