python3爬虫学习笔记之分析Ajax爬取今日头条街拍美图（八）

最新推荐文章于 2022-05-29 13:39:39 发布

不吃鱼的猫~

最新推荐文章于 2022-05-29 13:39:39 发布

阅读量464

点赞数

分类专栏： python3爬虫代码文章标签： python3爬虫

本文链接：https://blog.csdn.net/u012433049/article/details/100895471

版权

代码同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

python3爬虫

10 篇文章 0 订阅

订阅专栏

通过以上第6章节的学习，我们应该学习到了Ajax请求页面的分析、提取等，该章节将通过一个实例来深入学习Ajax数据的爬取：抓取今日头条的街拍美图，抓取之后，将每组图片分文件夹下载到本地并保存下来。

1. 准备工作

环境安装，requests，BeautifulSoup等。

2. 抓取分析

在抓取之前，首先要分析抓取的逻辑，打开今日头条，并搜索框输入‘街拍’：

打开开发者工具，切换到XHR过滤卡，可以看到有Ajax请求。看看结果是否包含页面中的数据，点击data发现有许多条数据，点击第一条展开，有一个title字段，正好是第一条数据的标题。

每条数据有一个image_list字段，以列表形式存在，包含了组图的所有图片列表。我们只需要将rul字段提取出来然后下载图片就好。

接下来需要用Python模拟Ajax请求，然后提取美图链接并下载。首先，分析URL规律：

https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=4&from=media&pd=user×tamp=1568618965291

https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=2&from=video&pd=video×tamp=1568618964133

https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis×tamp=1568618971945

规律：

1、参数aid恒为24，app_name恒为web_search,format恒为json,keyword恒为%E8%A1%97%E6%8B%8D，autoload恒为true，count恒为20，en_qc恒为1

2、cur_tab为1，2，4（综合为1，视频为2，用户为4），from可选为：media（用户）,video（视频）和search_tab（综合），pd可选：user(用户)，video(视频)和synthesis（综合）；offset每页递加20

3. 实战

现在来提取综合页面图片。

那么cur_tab恒不变为1，from恒为search_tab,pd恒为synthesis；

4. 全部代码见：toutiao_images.py

# toutiao_images.py
# -*- coding: utf-8 -*-
"""
Created on Mon Sep 16 16:00:47 2019

@author: Administrator
"""

import requests
from urllib.parse import urlencode

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0',
        'X-Requested-With':'XMLHttpRequest',
        
        }




def get_page(offset):
    params={
            'format':'json',
            'keyword':'街拍',
            'aid':'24',
            'app_name':'web_search',
            'autoload':'true',
            'count':'20',
            'offset':offset,
            'cur_tab':'1',
            'from':'search_tab',
            'pd':'synthesis'
            }
    
    url = 'https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset='+str(offset)+'&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis'
   
#    url = 'http://www.toutiao.com/search_content/?'+\
#    urlencode(params)
    print(url)
    
    try:
        response = requests.get(url,headers=headers)
        print(response.json())
        if response.status_code==200:
            return response.json()
        else:
            print(response.satus_code)
        
    except requests.ConnectionError as e:
        print(e.args)
        return None
    
    
def get_images(json):
    #print(json)
    if json.get('data'):
        
        print(json.get('data'))
        
        
        print(len(json.get('data')))
        for item in json.get('data'):
            title = item.get('title')
            print('title',title)
            #print(item)
            if not title==None:
                images = item.get('image_list')
                for image in images:
                    yield{
                            'image':image.get('url'),
                            'title':title
                            }
               
import os
from hashlib import md5

def save_image(item):
    if not os.path.exists(item.get('title')):
        os.makedirs(item.get('title'))
    try:
        response = requests.get(item.get('image'))
        if response.status_code==200:
            file_path = '{0}/{1}.{2}'.format(item.get('title'),md5(response.content).hexdigest(),'jpg')
            if not os.path.exists(file_path):
                with open(file_path,'wb') as f:
                    f.write(response.content)
            else:
                print('already download',file_path)
    except requests.ConnectionError:
        print('failed to save image')
        
def main(offset):
    json = get_page(offset)
    #print(json)
    for item in get_images(json):
        #print(item)
        save_image(item)
        
        
if __name__=='__main__':
    for i in range(0,10):
        print('offset:',i*20)
        main(offset=i*20)

如果对你有用，点个赞手动笑脸（*_*）