我的大数据之路 -- 爬取猫眼电影复联4的影评

最新推荐文章于 2021-09-06 10:14:56 发布

小牛头#

最新推荐文章于 2021-09-06 10:14:56 发布

阅读量914

点赞数 2

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_41562377/article/details/89763224

版权

大数据专栏收录该内容

38 篇文章 1 订阅

订阅专栏

吐槽- - - 刚刚没电了，写的东西TM全没了，又要重写一遍。CSDN啊，你已经长大了，该学会自动保存了。

昨天和两位小伙伴去看了，总体感觉还是不错的。整个的过程中能引起观众笑的恐怕就只有浩克出现的那几段。
看3D带两副眼睛是真的难受。再加上临时出现一些人生大事（其实我不想发生的）。看完后脑袋愈发觉得疼痛，记昨晚第一次失眠。

脑袋还是有点疼，但是技术还是要学的。我很好奇观众对复联4的评价，所以今天就打算爬取猫眼电影关于复联4的影评。
具体实现如下，只做学习使用，不想给其服务器增加负担。

首先打开网页，传送门

发现只有少数几条浏览器，这怎么行呢？但是打开手机端复联4，却能看到所有的影评。

chrom浏览器是个好东西，它能把电脑版浏览器变成手机版浏览器。具体操作如下，点击F12–>然后点击红色小框框–>按F5刷新一下，两下也行。
在这里插入图片描述
点击如下图所示可以选择手机的类型，选择之后记得刷新。

在这里插入图片描述
然后一直往下拉，找到 “查看全部 \d+ 条评论”，点击它

之后一直往下拉就会出现各种的评论的JSON数据

接下来就需要寻找出影评JSON的url规律就行啦

http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=15&ts=0&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=15&limit=15&ts=1556790644827&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=30&limit=15&ts=1556790644827&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=45&limit=15&ts=1556790644827&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=60&limit=15&ts=1556790644827&type=3

发现规律了吗？变化的就只有 offset ，每一个url的offset增加15

现在开始写代码，建议登录进去，加上你的cookies
FuLian4.py

import requests
import json
import time
class FL4:
    def __init__(self):
        self.headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Mobile Safari/537.36',
                      'Referer': 'http://m.maoyan.com/movie/248172/comments?_v_=yes',
                      'Connection': 'keep-alive',
                      'Cookie': 'lxsdk_cuid=16a77029578c8-09b499b0040059-39395704-1fa400-16a77029579c8; uuid_n_v=v1; iuuid=134B71006C9C11E984F25B6CA47A6EB12DA16CDD3CFA49059091A26926EFF957; webp=true; ci=20%2C%E5%B9%BF%E5%B7%9E; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; _lxsdk=D41218F06C9A11E991BFF33D07D9D8F114AEFA62DB3048D1A0816CCD72F7EA47; __mta=217338537.1556774819301.1556783927287.1556788162297.7',
                      'Host': 'm.maoyan.com'}
        self.files=open('FuLian42.txt','w',encoding='utf-8')

    def req(self,url):
        response=[]
        try:
            response=requests.get(url=url,headers=self.headers)
            time.sleep(2)
        except ConnectionRefusedError:
            time.sleep(3)
            self.req(url)

        return response

    def get_json(self,response):
        data=json.loads(response.text).get('data')
        comments=data.get('comments')
        for comment in comments:
            infos={
            'userId':comment.get('userId'), #用户ID
            'nick':comment.get('nick'), #用户昵称
            'gender':comment.get('gender'), #用户性别
            'content':comment.get('content'), #用户评论
            'score':comment.get('score'), #用户评分
            'time':comment.get('time'), #用户评论时间
            'userLevel':comment.get('userLevel') #用户等级
            }
            info=json.dumps(infos,ensure_ascii=False)
            print(info)
            self.files.write(info)
            self.files.write('\n')


    def main(self):
        urls=['http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset={}&limit=15&ts=1556792832710&type=3'.format(i*15) for i in range(0,100)]
        for  url in urls:
            print(url)
            response=self.req(url)
            time.sleep(2)
            self.get_json(response)

if __name__=='__main__':
        fl4=FL4()
        fl4.main()

以上代码是一种方法，但是只要超过数据达到1000条，猫眼大哥就立刻不给你爬取了。

对于想数据分析来说是1000条数据是远远不够的。
查看url，再来分析一遍。
url如下，可以看出来的是，url的构造当中除了offset还有一个ts，应该是时间戳没错了。

http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=15&limit=15&ts=1556844579617&type=3

再由输出的单条comments可以看到，里面有一个startTime和time，初次判断是时间戳。把它放在在线转换中查看如下图，确实是时间戳。

{'avatarUrl': 'https://img.meituan.net/maoyanuser/cf2d33a3e16435e47a3c4c8e69fb22ba4458.jpg', 'buyTicket': False, 'content': '不错，有点小感动，希望还有第五部。', 'gender': 2, 'id': 1065529952, 'imageUrls': [], 'likedByCurrentUser': False, 'major': False, 'movie': {'id': 0, 'sc': 0}, 'movieId': 248172, 'nick': '请勿打扰～', 'replyCount': 0, 'score': 8, 'spoiler': False, 'startTime': '1556844600000', 'tagList': [{'id': 1, 'name': '好评'}, {'id': 4, 'name': '购票'}], 'time': 1556844600000, 'upCount': 0, 'userId': 1649520068, 'userLevel': 2, 'vipType': 0}

在这里插入图片描述
把时间戳转换成日期查看一下

 def get_json(self,response):
       data=json.loads(response.text).get('data')
       comments=data.get('comments')
       for comment in comments:
			times=comment.get('time')  
			timeArray = time.localtime(times/1000)
			otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
			print(otherStyleTime)

在这里插入图片描述
由此就可以判断出：
1）后台的时间是按照每分钟的时间进行降序的。
2）由于每次抓取的时候不知道时间戳多少变化一次

解决思路如下：
1）由于每次可以得到很多的时间戳。
2）发出请求
3）记录第一个时间戳
4）记录第二个时间戳
5）当遇到第三个时间戳时，将ts设置为第二个时间戳，重新构建url
6）如果单次的请求都是遇到第三个时间戳，这时就通过修改offset参数继续抓取，直到遇到第三个时间戳
什么意思呢？
我来画个图解释一下吧，不能吐槽图不好看

在这里插入图片描述

看不懂的自己找规律

http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=0&type=3
获得到的时间 1556849940000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849820000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849880000&type=3
获得到的时间 1556849820000
获得到的时间 1556849820000
获得到的时间 1556849820000
获得到的时间 1556849760000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849820000&type=3
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849700000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849760000&type=3
获得到的时间 1556849760000
获得到的时间 1556849700000
获得到的时间 1556849700000
获得到的时间 1556849700000
获得到的时间 1556849640000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849700000&type=3
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849580000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849640000&type=3
获得到的时间 1556849640000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849520000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849580000&type=3
获得到的时间 1556849520000
获得到的时间 1556849520000
获得到的时间 1556849400000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849520000&type=3
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849340000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849400000&type=3
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849280000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849340000&type=3
获得到的时间 1556849280000
获得到的时间 1556849280000
获得到的时间 1556849280000
获得到的时间 1556849220000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849280000&type=3
获得到的时间 1556849220000
获得到的时间 1556849220000
获得到的时间 1556849160000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849220000&type=3

Process finished with exit code -1

再次构造url，发现，可以查看的条数增加到21条，一到22条就不行了
在这里插入图片描述
最终代码

import requests
import json
import time
import csv


class FL4:
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Mobile Safari/537.36',
            'Referer': 'http://m.maoyan.com/movie/248172/comments?_v_=yes',
            'Connection': 'keep-alive',
            'Cookie': 'lxsdk_cuid=16a77029578c8-09b499b0040059-39395704-1fa400-16a77029579c8; uuid_n_v=v1; iuuid=134B71006C9C11E984F25B6CA47A6EB12DA16CDD3CFA49059091A26926EFF957; webp=true; ci=20%2C%E5%B9%BF%E5%B7%9E; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; _lxsdk=D41218F06C9A11E991BFF33D07D9D8F114AEFA62DB3048D1A0816CCD72F7EA47; __mta=217338537.1556774819301.1556783927287.1556788162297.7',
            'Host': 'm.maoyan.com'}

        self.count = 1
        # 每次抓取评论数，猫眼最大支持21
        self.limit = 21
        self.movieId = '248172'
        self.ts = 0
        self.offset = 0
     

    def req(self):

        url = 'http://m.maoyan.com/review/v2/comments.json?movieId=' + self.movieId + '&userId=-1&offset=' + str(self.offset) + '&limit=21&ts=' + str(self.ts) + '&type=3'
        print(url)
        return url

    def open_url(self,url):
        response=[]
        try:
            response = requests.get(url=url, headers=self.headers)
            time.sleep(2)
        except ConnectionRefusedError:
            time.sleep(3)
            self.req()

        return response

    def get_json(self, response):
        ts_duration = self.ts
        res = json.loads(response.text)
        data = res.get('data')
        comments = data.get('comments')
        for comment in comments:
            comment_time = comment['time']
            print('获得到的时间', comment_time)

            if self.ts == 0:
                self.ts = comment_time
                ts_duration = comment_time

            if comment_time != self.ts and self.ts == ts_duration:
                ts_duration = comment_time

            if comment_time != ts_duration:
                self.ts = ts_duration
                self.offset = 0
                return self.req()

            # 这时第二次请求就是comments_time等于第一次请求的comments_time
            else:
                infos = {
                    'userId': comment.get('userId'),  # 用户ID
                    'nick': comment.get('nick'),  # 用户昵称
                    'gender': comment.get('gender'),  # 用户性别
                    'content': comment.get('content'),  # 用户评论
                    'score': comment.get('score'),  # 用户评分
                    'time': comment.get('time'),  # 时间
                    'userLevel': comment.get('userLevel')  # 用户等级
                }
                info = json.dumps(infos, ensure_ascii=False)
                print(info)
                with open('FL4.txt','a',encoding='utf-8' )as f:
                    f.write(info)
                    f.write('\n')

                list=[infos['userId'],infos['nick'],infos['gender'],infos['content'],infos['score'],infos['time'],infos['userLevel']]

                with open('FL4.csv','a',newline='',encoding='utf-8') as c:
                    film=csv.writer(c,delimiter=';')
                    film.writerow(list)
                self.count += 1

        if res['paging']['hasMore']:
            self.offset += (self.limit+9)
            print('offset', self.offset)
            return self.req()
        else:
            return None


    def save_csv(self, info):
        self.file_csv= csv.writer(info, delimiter=';')
        self.file_csv.writerow(info)

    def main(self):
        url=self.req()
        while True:
            try:
                data = self.open_url(url)
                if data:
                    url = self.get_json(data)
            except Exception as e:
               print('error',e)


if __name__ == '__main__':
    fl4 = FL4()
    fl4.main()

还可以使用多进程或者多线程，之后再说吧。