最近在看崔大的ajax,于是自己试了一把,觉得大神之所以是大神还是有原因的。。。。。。作为小白,我有我自己的编码方式(大佬们别喷,我口水过敏)
废话不多说!
首先来看今日头条的街拍图响应内容:
点击network选项卡,可以看到这是个GET请求。因为ajax请求的type是XHR,所以我们点击XHR(我下拉了几次,所以XHR有7个name,不下拉页面的话是一个XHR)。通过分析每个请求的url我们可以看出只有offset参数发生变化,所以我们可以通过改变offset的值来迭代。
再点击preview:
可以看到它的响应内容,然后我们通过点击data中的0、1、2...可以看到每个话题的title、image_url等(前边没有title、url的是干扰项),我们只需要模拟ajax请求来提取title和图片的url并通过这些url下载图片即可。
首先我们将base_url和headers确定下来,再编写获取json的函数:
import requests
base_url = "https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset={}&format=json&keyword=" \
"%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
# get json
def get_page(offset):
url = base_url.format(offset)
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError as e:
print('Error!', e.args)
接下来我们来获取响应中的title和url:
# parse page
def parse_page(content):
# 判断content是否为空
if content:
# 获取data
items = content.get('data')
for item in items:
# 过滤掉title为None的item
if item.get('title') is None:
continue
else:
# 过滤掉前边所说的干扰项
if item.get('image_list') is None:
continue
else:
# 获取图片url
for ite in item.get('image_list'):
yield{
'url': ite.get('url'),
'title': item.get('title')
}
然后我们就可以通过url获取图片并保存啦!
def write_to_file(dictionary):
list = []
response = requests.get(dictionary.get('url'), headers=headers)
picture = dictionary.get('url').split('/')[-1]
list.append(picture)
for pic in list:
if response.status_code == 200:
with open(pic + '.png', 'wb') as f:
f.write(response.content)
源码:
#!/usr/bin/python
# -*- coding:utf-8 -*-
# @Time : ****
# @Author : *******
# @File : toutiao_picture.py
import requests
base_url = "https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset={}&format=json&keyword=" \
"%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
def get_page(offset):
url = base_url.format(offset)
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError as e:
print('Error!', e.args)
# parse page
def parse_page(content):
if content:
items = content.get('data')
for item in items:
if item.get('title') is None:
continue
else:
if item.get('image_list') is None:
continue
else:
for ite in item.get('image_list'):
yield{
'url': ite.get('url'),
'title': item.get('title')
}
# save
def write_to_file(dictionary):
list = []
response = requests.get(dictionary.get('url'), headers=headers)
picture = dictionary.get('url').split('/')[-1]
list.append(picture)
for pic in list:
if response.status_code == 200:
with open(pic + '.png', 'wb') as f:
f.write(response.content)
# main
def main(offset):
content = get_page(offset)
for item in parse_page(content):
write_to_file(item)
if __name__ == '__main__':
for i in range(0, 121, 20):
main(offset=i)
写的太简陋了,大佬们别喷哈~
效果图: