Python爬虫实战：分析《战狼2》豆瓣影评

最新推荐文章于 2023-04-24 11:57:24 发布

bailixuance

最新推荐文章于 2023-04-24 11:57:24 发布

阅读量3.1k

点赞数 2

分类专栏： python爬虫

本文链接：https://blog.csdn.net/bailixuance/article/details/84677515

版权

python爬虫专栏收录该内容

6 篇文章

订阅专栏

本文详细介绍了使用Python爬取豆瓣电影《战狼2》评论数据的过程，包括环境配置、网页分析、数据抓取及异常处理等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、介绍：

环境：win10 ,jupyter notebook, python3.6,，re, bs4，requests

爬取豆瓣电影《战狼2》

主页：

https://movie.douban.com/subject/26363254/

短评主页：

https://movie.douban.com/subject/26363254/comments?sort=new_score&status=P

事实上，并不能爬取上万条消息，

不登陆账号的直接爬取只能爬取十页200条信息，登陆账号的话，能爬取大约500条信息，下面会有介绍

主要内容为网页分析，程序编写，爬虫

二、网页分析：

要想爬取数据，就要知道数据在网页中存在方式，寻找对应的方法爬取相应的数据

1、主页分析

第1页短评网址：https://movie.douban.com/subject/26363254/comments?sort=new_score&status=P

第2页短评网址：https://movie.douban.com/subject/26363254/comments?start=20&limit=20&sort=new_score&status=P

第3页短评网址：https://movie.douban.com/subject/26363254/comments?start=40&limit=20&sort=new_score&status=P

第3页短评网址：https://movie.douban.com/subject/26363254/comments?start=80&limit=20&sort=new_score&status=P

。

由此我们得出，网址中，只有start在变化，即递增20，那第一页能不能用这种方式呢，答案是能，把start的值改为0，和短评主页的数据一模一样，这样我么使用循环迭代爬取数据即可：

for i in range(0,10000,20):
    print("爬取第{0}页......".format(int(i)))
    requrl = "https://movie.douban.com/subject/26363254/comments?start=" + str(i) + "&limit=20&sort=new_score&status=P"
    getContent(requrl,headers,cookies,i)
    time.sleep(3)

2、数据分析

如图，在短评首页，我们可以看到总评论数量，这个不用爬取，而且总评论数高达20W+,那我们到底能不能爬取这么多数据呢，我们拭目以待，我们可以看到每页总共20个评论，对应于网页递增20，

在每一个评论里，我们可以得到数据由用户ID，评分星级，评论时间，点赞数，评论内容，

接下来，我们就来分析网页源码来看如何爬取这些数据，右键页面，选择检查进入开发者模式，我们使用箭头来定位源码

我们可以发现，每一个评论都在div标签中，class都等于”comment-item",

进入其中一个，我们可以发现，我们需要的信息结构如下，

我们从网页源码中找到这五个元素对应的源码，（这一步为什么不能直接按照上图的代码来呢，因为源码和上图中显示的可能不一样，比如说一个标签中有class和title两个属性，上图中可能显示class在前，但在源码中可能就显示title在前了，如果两者不一样，那么用正则表达式就会匹配不到相应的源码，活生生的一个教训），

五个对应的源码分别是：

<a href="https://www.douban.com/people/z286424115/" class="">俏皮面</a>

 <span class="votes">33120</span>

<span class="allstar20 rating" title="较差"></span>

<span class="comment-time " title="2017-07-23 16:55:44">

<span class="short">首映礼看的。太恐怖了这个电影，不讲道理的，完全就是吴京在实现他这个小粉红的英雄梦。各种装备轮番上场，视物理逻辑于不顾，不得不说有钱真好，随意胡闹</span>

用户ID的class值为空，评分星级在隐藏在标签中的字符中，这两个有点麻烦，这两个使用正则表达式获取,

先构造正则表达式：

'<a class="" href="(.*?)">(.*?)</a>'

'<span class="allstar(.*?) rating" title="(.*?)"></span>'

相应的爬取代码如下：

# 用户ID,name
pattern_Name = re.compile(r'<a class="" href="(.*?)">(.*?)</a>')
patter_name = pattern_Name.findall(str(item))
if patter_name != []:
    name = str(patter_name[0][1])
else:
    print("第 {0} 页某行有空用户ID... ".format(int(page)))
        
# 评论星级,score
#<span class="allstar20 rating" title="较差"></span>
pattern = re.compile(r'<span class="allstar(.*?) rating" title="(.*?)"></span>')
patter_score = pattern.findall(str(item))
if patter_score == []:
    print("第 {0} 页某行有空评分星级... ".format(int(page)))
    continue        
score = str(int(patter_score[0][0])//10)

另外三个，所需要的的数据都在内容中，直接使用find_all即可，相应的代码如下：

# 评论时间
if item.find_all('span',class_='comment-time')[0].string is not None:
    time = str(item.find_all('span',class_='comment-time')[0].string.split())
else:
    print("第 {0} 页某行评论时间为空... ".format(int(page)))
        
# 点赞数
if item.find_all('span',class_="votes")[0].string is not None:
    votes = item.find_all('span',class_="votes")[0].string
else:
    print("第 {0} 页某行点赞数为0... ".format(int(page)))
            
# 评论内容
if item.find_all('span',class_="short")[0].string is not None: 
    comment = item.find_all('span',class_="short")[0].string
else:
    print("第 {0} 页某行有空短评... ".format(int(page)))
    continue

这里要注意的是，对特殊值进行数理，这里主要是缺失值进行处理，处理的一个原则是，如果评论或者评分星级没有的话，那么丢弃这条数据。

三、代码构建，

爬取规则定好后，接下来就是构建整个代码了，

1、导入需要的库：

import re 
from bs4 import BeautifulSoup as bs
import time 
import csv
import requests

请求头和cookie：

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}

cookie = {
    'cookies':'你的cookies'
}

豆瓣影评不登陆的话只能爬取10页，所有要想多爬，就要登陆，

cookie获取方法参考：

https://blog.csdn.net/bailixuance/article/details/84715924

完整代码如下：

import re 
from bs4 import BeautifulSoup as bs
import time 
import csv
import requests


def getContent(requrl,headers,cookies,page):
    
    resp = requests.get(requrl,cookies=cookies,headers=headers)
    
    #res = requests.get(url, headers=headers)
    html_data = resp.text
    

    # 接下来使用bs进行爬虫
    soup = bs(html_data, 'html.parser') 
    # 所要爬取的内容所在位置
    comment_div_lits = soup.find_all('div', class_='comment-item')
    #print(type(html_data))
    #print(comment_div_lits[0])
    
    # print("第{0}页输出： ".format(int(page)))
    eachList = []
    
    if len(comment_div_lits) == 0:
        print("第 {0} 页爬取不到信息.....".format(int(page)))
        print("len(comment_div_lits): ",len(comment_div_lits))
        return 
    
    for item in comment_div_lits:
        name = ''
        score = ''
        time = ''
        comment = ''
        votes = ''
        each = []
        #<a href=(.*?) class>(.*?)</a>
        
        # 用户ID,name
        pattern_Name = re.compile(r'<a class="" href="(.*?)">(.*?)</a>')
        patter_name = pattern_Name.findall(str(item))
        if patter_name != []:
            name = str(patter_name[0][1])
        else:
            print("第 {0} 页某行有空用户ID... ".format(int(page)))
        
        # 评论星级,score
        #<span class="allstar20 rating" title="较差"></span>
        pattern = re.compile(r'<span class="allstar(.*?) rating" title="(.*?)"></span>')
        patter_score = pattern.findall(str(item))
        if patter_score == []:
            print("第 {0} 页某行有空评分星级... ".format(int(page)))
            continue        
        score = str(int(patter_score[0][0])//10)
        
        # 评论时间
        if item.find_all('span',class_='comment-time')[0].string is not None:
            time = str(item.find_all('span',class_='comment-time')[0].string.split())
        else:
            print("第 {0} 页某行评论时间为空... ".format(int(page)))
        
        # 点赞数
        if item.find_all('span',class_="votes")[0].string is not None:
            votes = item.find_all('span',class_="votes")[0].string
        else:
            print("第 {0} 页某行点赞数为0... ".format(int(page)))
            
        # 评论内容
        if item.find_all('span',class_="short")[0].string is not None: 
            comment = item.find_all('span',class_="short")[0].string
        else:
            print("第 {0} 页某行有空短评... ".format(int(page)))
            continue
        
        each = [name,score,votes,time,comment]

        #print([name,score,time])
        with open('./zhanlangall.csv','a+',encoding='utf-8',newline='') as f:
            writer = csv.writer(f)
            writer.writerow(each)

def main():
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
    }
    cookie = {
        'cookies': '你的cookies'
    }
    
    for i in range(0,10000,20):
        print("爬取第{0}页......".format(int(i)))
        requrl = "https://movie.douban.com/subject/26363254/comments?start=" + str(i) + "&limit=20&sort=new_score&status=P"
        getContent(requrl,headers,cookie,i)
        time.sleep(3)
    print("爬到所有数据，爬虫结束")

main()

四、爬虫结果：

第500页后就一直没内容，我们打开该页看看，把start改为500试试，

页面是空的，也就是说，虽然有20w+评论，但你实质上只能看500条评论，看都不让看，还能怎么抓，，

我们看下结果，把爬取的结果导入excel，打开乱码的话，参考：

https://blog.csdn.net/bailixuance/article/details/84678133

结果：

500条左右

五、爬虫总结：

步骤1：通过Chrome浏览器检查元素

步骤2：获取单个页面HTML文本

步骤3：用正则表达式解析出所需要的信息并存入列表

步骤4：将列表中的信息存成csv文件

步骤5：利用start参数爬取其他页的短评

分析网页要素，分析数据结构，注意数据，

class值可能为空，可以使用正则表达式，

数据在标签中，使用正则表达式，

使用cookies，可尝试使用post，

异常可以使用try/except来处理

没有使用通用框架来写代码，但好处是，学习理解很快

几万条评论的貌似都是再猫眼上爬的？？？？？？

刚开始爬的时候，把一类的组成一列表，然后再写入csv中，比如，把一页20个用户ID组成一个用户列表，其余四个也是这样，然后将五个列表写入csv，这样导致了好多次超出下标的错误，可能有的数据确实什么的，

后来就换成了现在这个策略，一个人的数据组成一个列表，写进csv，即使有缺失，也没事

六、词云显示

1、预处理

import pandas as pd
from matplotlib import pyplot as plt
import re
import jieba

filepath = 'zhanlangall_5.csv'
# 添加行标题
data = pd.read_csv(filepath,header=None,names=['用户ID','评分星级','点赞数','发布日期','评论内容'])

# 查看数据整体信息
print(data.info())

# 查看数据前5个
data.head()

# 是否有缺失值
print(data.isnull().sum())

print(len(data['用户ID']))
print(len(data['评分星级']))
print(len(data['点赞数']))
print(len(data['发布日期']))
print(len(data['评论内容']))

结果：

用户ID    0
评分星级    0
点赞数     0
发布日期    0
评论内容    0
dtype: int64
484
484
484
484
484

2、合成字符串

# 将所有评论变为一个字符串
comments = ''
for k in range(len(data['评论内容'])):
    comments = comments +(str(data['评论内容'][k])).strip()

print(comments)

结果：

去标点和表情：

# 使用正则表达式去标点和表情
pattern = re.compile(r'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern, comments)
cleaned_comments = ''.join(filterdata)

print(cleaned_comments)

分词：

# 分词
segment = jieba.lcut(cleaned_comments)
words_df=pd.DataFrame({'segment':segment})

words_df.head()

去停用词：

# 去停用词
stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

words_df.head()

词频统计：

# 词频统计
import numpy 

words_stat=words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})
words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False)

words_stat.head()

词云显示：

# 词云显示
import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#词云包
from wordcloud import WordCloud,ImageColorGenerator # 词云包

wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80) #指定字体类型、字体大小和字体颜色
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
word_frequence_list = []
for key in word_frequence:
    temp = (key,word_frequence[key])
    word_frequence_list.append(temp)

wordcloud=wordcloud.fit_words(word_frequence)

#image_colors = ImageColorGenerator(bg_pic) # 根据图片生成词云颜色
plt.imshow(wordcloud)
wordcloud.to_file('show_Chinese.png')  # 把词云保存下来

结果：

七、数据分析