爬取今日头条-街拍，了解Ajax分析的流程、Ajax分页的模拟以及图片的下载

最新推荐文章于 2020-05-24 15:56:18 发布

狼性书生

最新推荐文章于 2020-05-24 15:56:18 发布

阅读量204

点赞数 1

分类专栏：爬虫 python

本文链接：https://blog.csdn.net/qq_42278240/article/details/88585031

版权

python 同时被 2 个专栏收录

36 篇文章 3 订阅

订阅专栏

爬虫

5 篇文章 0 订阅

订阅专栏

爬取今日头条-街拍

什么是Ajax

什么是Ajax

Ajax，即异步的Javascript和XML，利用Javascript在保证页面不被刷新、页面链接不改变的情况下与服务器交换数据并更新部分网页的技术.想进一步了解的可以到W3School中学习.

查看并分析请求

1.查看请求：
以Chrome浏览器为例，用Chrome浏览器打开今日头条-街拍的链接：https://www.toutiao.com/search/?keyword=街拍
鼠标右键检查，切换到Network选项卡，点击XHR，然后F5加载。
在这里插入图片描述
其中Request Headers中有一个x-requested-with: XMLHttpRequest，这表明此请求为Ajax请求

2.分析请求：
观察Headers下的Query String Parameters
在这里插入图片描述
一般GET请求为Query String Parameters，POST请求为Form Data，
向下拉更新页面，可以发现除了timestamp和offset这两个参数外，其余的都不变，timestamp这个参数应该是random随机生成的，对于GET请求的页面，并不影响我们获取内容（对于POST请求的页面就有很大影响），判断多个Ajax链接可知，offest是以20这个数增长，有一定规律。
接下来我们可以调用urllib的urlencode()方法来构造这个请求：

params = {
       'aid': '24',
       'app_name':'web_search',
       'offset': offset,
       'format': 'json',
       'keyword': '街拍',
       'autoload': 'true',
       'count': '20',
       'en_qc':'1',
       'cur_tab': '1',
       'from': 'search_tab',
       'pd': 'synthesis'
   }
   base_url = 'https://www.toutiao.com/api/search/content/?'
   url = base_url + urlencode(params)

解析内容

点击Preview，可以发现内容是以键值对的形式存在的
在这里插入图片描述

图片存在于data键下的image_list子键，标题在data键下的title子键，另外，不存在图片和标题的多余信息中都存在cell_type这个键，所以我们可以通过判断是否存在cell_type这个键来过滤多余的数据

图片下载：

1.一种方式是：采用with open（）二进制的方式下载图片，
先请求图片链接：

resp = requests.get(item.get('image'))

再以二进制的形式写入数据

  with open(file_path, 'wb') as f:
   		 f.write(resp.content)

2.另一种：调用urllib的urlretrieve()方法下载图片

urllib.request.urlretrieve(item.get('image'), file_path)

代码实现

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool



def get_page(offset):
   #注意参数的顺序与想要请求的链接保持一致
   params = {
       'aid': '24',
       'app_name':'web_search',
       'offset': offset,
       'format': 'json',
       'keyword': '街拍',
       'autoload': 'true',
       'count': '20',
       'en_qc':'1',
       'cur_tab': '1',
       'from': 'search_tab',
       'pd': 'synthesis'
   }
   base_url = 'https://www.toutiao.com/api/search/content/?'
   url = base_url + urlencode(params)
   try:
       resp = requests.get(url)
       #print(url)
       if 200  == resp.status_code:
           #print(resp.json())
           return resp.json()
   except requests.ConnectionError:
       return None



def get_images(json):
   if json.get('data'):
       data = json.get('data')
       for item in data:
           if item.get('cell_type') is not None:
               continue
           #将冒号替换为1，处理目录名格式不正确的情况
           title = item.get('title').replace(":","1")
           images = item.get('image_list')
           for image in images:
               origin_image =image.get('url')
               yield {
                   'image':  origin_image,
                   'title': title
               }



def save_image(item):
   #创建名为img的主目录、和title内容为目录名的子目录
   img_path = 'img' + os.path.sep + item.get('title')
   if not os.path.exists(img_path):
   	 os.makedirs(img_path)
   try:
       #请求图片链接
       resp = requests.get(item.get('image'))
       if codes.ok == resp.status_code:
           #图片内容使用其内容的MD5值，避免重复
           file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
               file_name=md5(resp.content).hexdigest(),
               file_suffix='jpg')
           if not os.path.exists(file_path):
               '''
               #也可以用urllib的urlretrieve()方法下载图片
               urllib.request.urlretrieve(item.get('image'), file_path)
               '''
               with open(file_path, 'wb') as f:
                   f.write(resp.content)
               print('Downloaded image path is %s' % file_path)
           else:
               print('Already Downloaded', file_path)
   except requests.ConnectionError:
       print('Failed to Save Image，item')


def main(offset):
   json = get_page(offset)
   for item in get_images(json):
       save_image(item)



if __name__ == '__main__':
   for i in range(0,7):
       offest=20*i
       main(offest)

'''
#也可使用多进程的进程池实现多进程下载
GROUP_START = 0
GROUP_END = 7

if __name__ == '__main__':
   pool = Pool()
   groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
   pool.map(main, groups)
   pool.close()
   pool.join()
'''

运行结果：

在这里插入图片描述

狼性书生

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
爬取今日头条-街拍，了解Ajax分析的流程、Ajax分页的模拟以及图片的下载

爬取今日头条-街拍什么是Ajax查看并分析请求解析内容图片下载：代码实现运行结果：什么是AjaxAjax，即异步的Javascript和XML，利用Javascript在保证页面不被刷新、页面链接不改变的情况下与服务器交换数据并更新部分网页的技术.想进一步了解的可以到W3School中学习.查看并分析请求1.查看请求：以Chrome浏览器为例，用Chrome浏览器打开今日头条-街拍的链接...
复制链接

扫一扫