Python爬虫实例（五）：爬取XX网站电视剧 json格式数据

最新推荐文章于 2024-05-01 16:15:14 发布

199铱

最新推荐文章于 2024-05-01 16:15:14 发布

阅读量3.2k

点赞数 2

分类专栏：爬虫 Python爬虫专栏文章标签：爬虫 cookies retry json session

本文链接：https://blog.csdn.net/linzhjbtx/article/details/87628131

版权

爬虫同时被 2 个专栏收录

9 篇文章 2 订阅

订阅专栏

Python爬虫专栏

9 篇文章 6 订阅

订阅专栏

本文结合之前的练习，完成项目目标：爬取XX网站的电视剧json数据。

项目思路

首先发送cookie请求爬取登录后的内容（具体方法见Python爬虫之个人笔记（四）：发送Cookie请求），发送请求时加入超时错误重试功能（具体方法见Python爬虫之个人笔记（三）：错误重试，超时处理）；成功登录后，发送get请求，利用json.loads和json.dumps方法爬取电视剧数据，并保存到本地html文件。

该项目使用到以下几个模块，所以先在项目开头导入

# 【加载所需模块】
import requests
from retrying import retry
import json
import os

首先定义几个url地址，以便后续使用。url1是登录界面的地址，url2是登录成功后个人主页的地址，url3是美剧的请求地址。

class DoubanTVSpider:

    def __init__(self):  # 初始化一个类
        self.tmp_url1 = 'https://accounts.douban.com/j/mobile/login/basic'
        self.tmp_url2 = 'https://m.douban.com/mine/'
        self.tmp_url3 = 'https://m.douban.com/rexxar/api/v2/subject_collection/tv_american/items?os=android&for_mobile=1&start={}&count=18&loc_id=108288&_=0'

然后，定义几个函数，发送post请求获取登录后的内容。利用@retry 装饰器修饰_post_request函数，该函数发送get请求时进行了超时限制timeout=1，如果请求时间超过1秒，则会重新发送请求，两次请求之间等待1000毫秒，最大尝试10次之后停止。该函数实例化了一个叫做post_session的session，设置在本地的cookie会保存在post_session中，其中设置请求体data=post_data参数时，需带上登录该网站的账号和密码。

post_request函数则尝试捕获异常。先尝试执行try后面的请求，如果报错，则会被except捕获，返回post请求失败信息。

    # POST请求，获取登录后的页面数据
    @retry(stop_max_attempt_number=10, wait_fixed=1000)
    def _post_request(self, post_url, get_url_mine, post_data, post_headers):
            post_session = requests.session()
            post_session.post(post_url, data=post_data, headers=post_headers)
            post_response = post_session.get(get_url_mine, headers=post_headers, timeout=1)
            return post_response.content.decode()

    def post_request(self, post_url, get_url_mine, post_data, post_headers):
        try:
            post_res = self._post_request(post_url, get_url_mine, post_data, post_headers)
            # print('POST Request Content:\n', post_res)
        except Exception:
            post_res = 'Post Request Failed.'
        return post_res

接着把爬取的内容或错误信息保存到本地的html文件中。

    def save_post_data(self, post_res):
        try:
            with open('douban_post_res.html', 'w', encoding='utf-8') as f:
                f.write(post_res)
            print('Save Post Data Successfully.')
        except(IOError, TimeoutError):
            print('Save Post Data Failed.')

然后，再用与上述post请求类似的方式发送get请求get_request_2，返回解码后的内容get_str。parse_data是对解码后内容进行解析的函数。json.loads()方法将get请求获取的字符串转换为字典，再提取想要的相应信息，“subject_collection_items”中保存美剧的相应信息，total是美剧总的条目数量，count则是每页条目的数量。例如美剧共有100条数据，每页展示20条，分为5页。因此需要循环5次发送请求获取内容信息。total和count就是控制循环的变量。

接下来是保存数据，保存之前先调用file_exit_dec函数检测本地文件是否存在，确保删除了之前的数据后再进行保存。由于list_data是一个列表，所以需要将列表中的每一个元素都用json.dumps()方法转换为字符串，才可以写入本地文件。

    def parse_data(self, get_str):
        list_s1 = json.loads(get_str)
        list_s2 = list_s1['subject_collection_items']
        total = list_s1['total']
        count = list_s1['count']
        return list_s2, total, count

    def file_exit_dec(self, file_name):
        try:
            os.remove(file_name)
        except IOError:
            print('File does not exit, now you can append the file by "with open"! ')

    def save_data(self, list_data, file_name):
        try:
            with open(file_name, 'a', encoding='utf-8') as f:
                for list_ele in list_data:
                    str_ele = json.dumps(list_ele, ensure_ascii=False)
                    f.write(str_ele)
                    f.write(',\n')
            print('Save Data Successfully!')
        except(IOError, Exception):
            print('Failed Saving The Data.')

最后，定义程序主体，按照以上步骤调用函数，再调用主程序。

    def run(self):  # 程序主体

        # ******************************************************************************************************
        # POST请求，获取登录豆瓣后的页面数据
        # 1、POST请求数据准备
        print('*'*100)
        post_url = self.tmp_url1
        get_url_mine = self.tmp_url2
        post_headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36'}
        post_data = {'ck': '', 'redir': 'https://m.douban.com', 'name': 'aaa', 'password': 'bbb'}

        # 2、POST请求登录
        post_res = self.post_request(post_url=post_url, get_url_mine=get_url_mine, post_data=post_data, post_headers=post_headers)

        # 3、POST数据保存
        self.save_post_data(post_res)

        # ******************************************************************************************************
        # GET请求，获取豆瓣电视剧的数据
        # 1、GET请求数据准备
        print('\n', '*'*100, '\n')
        headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36',
                   'Referer': 'https://m.douban.com/tv/american'}
        get_cookie = 'bid=aaa; douban-fav-remind=bbb; __utmc=ccc; …………'
        cookie_para = {get_co.split('=')[0]: get_co.split('=')[1] for get_co in get_cookie.split('; ')}
        start = 0
        total = 20
        count = 0
        get_file_name = 'json_douban_tv.html'
        self.file_exit_dec(get_file_name)

        while start < total+count:
            get_url = self.tmp_url3.format(start)

            # 2、发送请求，获取响应
            get_str = self.get_request_2(get_url, headers, cookie_para)

            # 3、提取数据
            list_s, total, count = self.parse_data(get_str)

            # 4、保存数据
            self.save_data(list_s, get_file_name)

            # 5、准备下一次url
            start += count

if __name__ == '__main__':
    tecent = DoubanTVSpider()
    tecent.run()

如果觉得内容不错，请扫码关注微信公众号，获取更多内容

199铱

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫实例（五）：爬取XX网站电视剧 json格式数据

本文结合之前的练习，完成项目目标：爬取XX网站的电视剧json数据。项目思路首先发送cookie请求爬取登录后的内容（具体方法见Python爬虫之个人笔记（四）：发送Cookie请求），发送请求时加入超时错误重试功能（具体方法见Python爬虫之个人笔记（三）：错误重试，超时处理）；成功登录后，发送get请求，利用json.loads和json.dumps方法爬取电视剧数据，并...
复制链接

扫一扫