晋江文学城爬取小说评论情感分析

最新推荐文章于 2023-12-08 23:23:55 发布

小萝卜丝丝丝丝

最新推荐文章于 2023-12-08 23:23:55 发布

阅读量5.5k

点赞数 25

分类专栏： python 文章标签： python 爬虫数据分析

本文链接：https://blog.csdn.net/weixin_43367971/article/details/115803189

版权

该博客介绍了如何爬取并分析晋江文学城小说评论，包括收集小说信息、评论，数据预处理，如去重、分词、情绪标签，以及使用TfidfVectorizer和朴素贝叶斯、逻辑回归进行情感分类，最后进行结果分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 收集数据

1.1 爬取晋江文学城收藏排行榜前50页的小说信息

获取收藏榜前50页的小说列表，第一页网址为 ‘http://www.jjwxc.net/bookbase.php?fw0=0&fbsj=0&ycx0=0&xx2=2&mainview0=0&sd0=0&lx0=0&fg0=0&sortType=0&isfinish=0&collectiontypes=ors&searchkeywords=&page=1’ , 第二页网址中page=2，以此类推，直到第50页中page=50。爬取每个小说的ID，小说名字，小说作者。将爬取到的信息存储到晋江排行榜【按收藏数】.txt文件中。

import requests
from bs4 import BeautifulSoup
import bs4
import re
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import jieba
import seaborn as sns
import xlrd
from xlutils.copy import copy
# 一些魔法命令，使得matplotlib画图时嵌入单元中而不是新开一个窗口
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection  import train_test_split

爬取小说基本信息 ，主要思路：
找到需要爬取的所有信息主体tbody；
分别找到每个信息对应的小标签td(a)，数清楚在所有标签中的顺序；
存进txt文档时按顺序存储。

headers = {"User-Agent": "Mozilla/5.0"}

for n in range(1,50):
    url = 'http://www.jjwxc.net/bookbase.php?fw0=0&fbsj=0&ycx0=0&xx2=2&mainview0=0&sd0=0&lx0=0&fg0=0&sortType=0&isfinish=0&collectiontypes=ors&searchkeywords=&page={}'.format(n)
    html = requests.get(url,headers=headers)
    html.encoding = html.apparent_encoding
    soup = BeautifulSoup(html.text, 'html.parser')
    for tr in soup.find_all('tbody'):
        
            tds=tr('td')  
            a = tr('a') 
            count=0
            id=[]
            for u in tr.find_all('a'):
                    count=count+1
                    book_url=u.get('href') # 获取小说主页的url
                    p = re.compile(r'\d+')
                    book_id = p.findall(book_url)[0]  # 获取小说ID
                    if(count%2==0):
                        id.append(book_id)
            for n in range(0,100):
                
                    with open('./data/晋江排行榜【按收藏数】.txt','a+',encoding='utf-8') as f:
                        
                            print("{0}\t{1}\t{2}".format(id[n],a[n*2+1].string,a[n*2].string),file=f)  # 序号 书名 作者

查看爬虫结果 ，分别查看前8部小说的ID和名字

# 查看收藏榜前8部小说的ID
with open('./data/晋江排行榜【按收藏数】.txt','r',encoding='utf-8',errors

最低0.47元/天解锁文章