爬取豆瓣《想见你》高赞评论

最新推荐文章于 2024-08-24 17:01:15 发布

vasy

最新推荐文章于 2024-08-24 17:01:15 发布

阅读量441

点赞数

文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_37670807/article/details/104470057

版权

爬取豆瓣《想见你》高赞评论

在这里插入图片描述就是这部最近超火的剧，不过我还没有看。嘤嘤嘤。

分析网页

url = ‘https://movie.douban.com/subject/30468961/comments?start=0&limit=20&sort=new_score&status=P’
进入浏览器，点击F12就可以看见网页的结构啦
在这里插入图片描述
大家都知道网页有静态网页与动态网页之分，如何判断静态与动态网页呢。在网页中单击右键，点击查看源代码。如果源代码的内容与上面点击F12查看到的内容完全一致的话，那么就可以判断这个网页是静态网页啦。

经过判断，可以知道这个网页是静态网页，所以就很好操作啦。

所涉及到的库

进行这个简单的项目用到库主要是requests库，Beautifulsoup库，re库，以及pandas库
这里也简单介绍下库的安装吧。
windows用户通过cmd命令行
以此输入
pip install requests
pip install Beautifulsoup
pip install re
pip install pandas
如果大家嫌安装速度太慢就通过国内镜像进行安装

临时使用

可以在使用pip的时候加参数-i https://pypi.tuna.tsinghua.edu.cn/simple
例如：pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas，这样就会从清华这边的镜像去安装pandas库。

永久使用

windows下，直接在user目录中创建一个pip目录，如：C:\Users\任意目录\pip\pip.ini
也就是在User文件夹中的任意一个文件夹中创建一个名为pip的文件夹，在文件夹里用笔记本创建一个文档。
文档内容为：

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
[install]
trusted-host=mirrors.aliyun.com

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
[install]
trusted-host=mirrors.aliyun.com
然后保存，名字改为pip.ini
接下来通过cmd命令行访问这个文件，然后就可以通过pip高速下载了。

项目代码

首先已定义一个main()方法，在这个方法中进行输入爬取页数的操作

def main():
    num = int(input('爬取页数：'))
    name = []#用于存放爬取的用户名
    comment = []#用于存放爬取的评论
    time = []#用于存放爬取的评论数据
    vote = []#用于存放爬取的点赞数
    for i in range(num):
        url = 'https://movie.douban.com/subject/30468961/comments?start={}&limit=20&sort=new_score&status=P'.format(i*20)
        print('正在爬取第{}页内容'.format(i+1))
        reg_name,reg_comment,reg_vote,comment_time,item = get_content(url)#这个get_content(url)在下面的代码
   		for k in range(len(reg_name)):
            name.append(reg_name[k])
            comment.append(reg_comment[k])
            time.append(comment_time[k])
            vote.append(reg_vote[k])
        if item ==0: #这个标准主要用于判断爬取的内容是否为空，如果为空get_content(url) 返回的对应参数就会变为 0 ，然后终值循环
            break
    frame = pd.DataFrame({'用户名':name,'评论内容':comment,'点赞数':vote,'评论时间':time})
    frame.to_excel('想见你评论数据.xls')#数据内容保存为excel文件，为啥不保存csv文件呢，应为会遇到乱码问题，这个我不会解决23333
    print('爬取完成')

然后定义一个get_content()的方法传入参数url，进行连接内容的获取。

def get_content(url):
    headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
               'Cookie':'....这里自己输入自己浏览器中的cookie'}
    response = requests.get(url=url,headers=headers,timeout = 10).text
    soup = BeautifulSoup(response,'lxml')
    info = soup.find_all(class_="comment-info")
    reg_name = []
    for item in info:
        item = str(item)#原本item 类型为bs4 tag 类 强制转换为str 不然正则表达式匹配不了
        #利用正则表达式进行匹配
        pattern = r'>.*</a>'
        result = re.findall(pattern,item)[0]#re.findall 返回的是列表形式
        result = result.replace('>','')
        result = result.replace('</a','')#删除多余标签
        reg_name.append(result)
    comment_time = []
    for time in soup.find_all(class_='comment-time'):
        try:
            result = time.string
            pattern = '[1-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*'
            result  = re.findall(pattern,result)
            if len(result[0])==0:
                comment_time.append('NaN')
            else:
                comment_time.append(result[0])
        except:
            comment_time.append('NaN')
    reg_comment = []
    for comment in soup.find_all(class_='short'):
        comment = str(comment)
        comment = comment.replace('\n','')
        comment = comment.replace('<span class="short">','')
        comment = comment.replace('</span>','')
        reg_comment.append(comment)
    reg_vote = []
    for vote in soup.find_all(class_='votes'):
        vote = str(vote)
        vote = vote.replace('<span class="votes">','')
        vote = vote.replace('</span>','')
        reg_vote.append(int(vote))
    if len(reg_name)==0:
        flag = 0
        print('该页无内容，停止爬虫')
    else:
        flag = 1
    return reg_name,reg_comment,reg_vote,comment_time,flag

为啥要填写headers 与cookie 呢这主要是为了应对网站的反爬虫手段，欺骗后台，我是一个纯良的浏览器，我不是爬虫。以免被封ip。
下面说下怎么找header与cookie，一定要先登录了在获得cookie，否则有些数据没有权限获得。
登录后点击F12然后找到network,没有内容就刷新一下。然后就自己慢慢找吧。用字典的形式创建headers
在这里插入图片描述
然后调用方法就好了。

if __name__ == '__main__':    
    main()

运行效果

在这里插入图片描述
可以判断出评论也就25页。
看看爬取的文件内容吧

简单数据分析

爬出了这些数据有什么用呢。那就做个词云分析下吧。我也是个小白。只能简单的分析分析。
这里用到的库主要是

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt     #数学绘图库
import jieba               #分词库
from wordcloud import WordCloud   #词云库
import PIL.Image as Image ##图像转换
import jieba.analyse

没有安装的朋友通过pip进行安装

pip install jieba
pip install Image
pip install matplotlib
pip install WordCloud

首先加载下之前爬到的数学

data = pd.read_excel('想见你评论数据.xls')
print(data)

在这里插入图片描述
发现有一列数据是不要的，也就是Unnamed:0这一列。

del data['Unnamed: 0']#删除该列
print(data)

在这里插入图片描述
就变成这个样子了

#根据点赞数进行排序
frame = data.sort_values(by=['点赞数'],ascending=False)
print(frame[['评论内容','点赞数']][:20])
print()
frame1 = frame['评论内容'][:20]#获取点赞数前20的评论内容
#写入文本内容
with open('点赞数前20内容.txt','w',encoding='utf-8') as f:
    for i in frame1[0:20]:
        f.write(i)
        f.write('\n')
    f.close()
#读取文本
with open('点赞数前20内容.txt','r',encoding='utf8') as f:
    text = f.read()

然后在网上下载一个词云的背景图，我用的是这张。
在这里插入图片描述
接着就是创建词云了。

mask=np.array(Image.open('1.jpg'))#1.jpg就是词云背景图
sep_list=jieba.cut(text,cut_all=False)
sep_list=" ".join(sep_list) #转为字符串
print(sep_list)

wc=WordCloud(
    font_path=r'..\simfang.ttf',#使用的字体库
    margin=3,
    mask=mask,#背景图片
    background_color='white', #背景颜色
    max_font_size=70,
    max_words=50,
)
wc.generate(sep_list) #制作词云
wc.to_file('2.jpg') #保存到当地文件

# 图片展示
plt.imshow(wc,interpolation='bilinear')
plt.axis('off')
plt.show()