以今日头条为例，Ajax图片数据爬取

最新推荐文章于 2021-08-05 16:18:08 发布

泽阳Alex

最新推荐文章于 2021-08-05 16:18:08 发布

阅读量667

点赞数 1

分类专栏： Python3网络爬虫文章标签： python Ajax

本文链接：https://blog.csdn.net/qq_38379983/article/details/85269896

版权

Python3网络爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文以今日头条为例，分析Ajax请求抓取网页图片并存储本地的方法。

1.准备工作

Ajax抓取原理已在上一节以微博为例，Ajax数据抓取中说明

本节主要针对图片爬取和文件存储处理Ajax数据。

(Chrome浏览器)首先打开今日头条首页https://www.toutiao.com/，右上角搜索“街拍”二字。打开开发者选项工具，选择“检查”（或者鼠标直接右键选择“检查”），切换到Network，选择XHR，滑动浏览器界面，等待显示所有Ajax请求信息。如下图所示。

点进去某一个请求后，在右侧选择Preview观看数据，data字段里其中有一个image_list字段，里面是字段形式，包含了组图的所有图片列表。我们的任务就是将URL字段提取下来并下载。每一组图新建一个文件夹，每个文件夹的名字就为组图的标题。

2.观察

回到Headers，观察这个GET请求的链接（即Request URL），包含offset ，keyword ，autoload，count，cur_tab，from，pd等参数，但比较其它链接可知，只有offset在变化，分析可知offset为偏移量，用于可知数据分页。接下来，我们将模拟Ajax请求，通过接口获取批量数据，然后解析，最后下载即可。

3.代码

首先，用get_page()方法将参数offset作传递。代码如下：

import requests
from urllib.parse import urlencode

def get_page(offset):
   params={
       'offset':offset,
       'format':'json',
       'keyword':'街拍',
       'autoload':'true',
       'count':'20',
       'cur_tab':'3'
   }
   url = 'https://www.toutiao.com/search_content/?' + urlencode(params)
   try:
       response = requests.get(url)
       if response.status_code == 200:
           return response.json()
   except requests.ConnectionError:
       return None

【代码解读】这里用urlencode（）方法构造请求链接，用requests请求这个链接，若状态码返回200，则调用response的json（）方法将结果转换为json格式。

接下来，我们通过解析方法，提取每条数据的image_list字段中的每一张图片链接，将图片链接个标题一并返回，代码如下：

def get_images(json):
     if json.get('data'):
         for item in json.get('data'):
             title = item.get('title')
             images = item.get('image_list')
             for image in images:
                 yield {
                     'image':image.get('url'),
                     'title':title
                 }

接下来，实现一个保存图片的方法save_image()，其中item就是前面get_images()方法中返回的一个字典。在这个方法中，首先根据item的title来创建文件夹，然后请求图片链接，获取图片的二进制数据，以二进制写入文件。图片的名称可以使用MD5值，去除重复。代码如下：

import os
from hashlib import md5

def save_image(item):
     if not os.path.exists(item.get('title')):   #新建文件夹
         os.mkdir(item.get('title'))
     try:
         response = requests.get("http:" + item.get('image'))
         if response.status_code == 200:
             file_path = '{0}/{1}.{2}'.format(item.get('title'),md5(response.content).hexdigest(),'jpg')
             if not os.path.exists(file_path):
                 with open(file_path,'wb') as f:
                     f.write(response.content)
             else:
                 print('Already Download',file_path)
     except requests.ConnectionError:
         print('Failed to save Image!')

最后，构造一个offset数组，遍历offset，提取图片链接并下载，代码如下：

if __name__ == '__main__':
     for offset in range(1,3):
         json = get_page(offset*20)
         for item in get_images(json):
               print(item)
               save_image(item)

最终生成文件夹及图片如下：