废话不多说直接看主题
爬取网页地址
https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D%E7%BE%8E%E5%A5%B3&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202108220858130102121920511F8DC562%22,%22origin_keyword%22:%22%E8%A1%97%E6%8B%8D%E7%BE%8E%E5%A5%B3%22,%22image_keyword%22:%22%E8%A1%97%E6%8B%8D%E7%BE%8E%E5%A5%B3%22}
或在https://www.toutiao.com/的搜索框输入街拍美女点击搜索,点击下图红色圈出的那里,进入爬取网页
爬取链接
进入到爬取网页后右键点击检查,打开network,打开XHR过滤器,因为这是爬取Ajax数据,通过下拉网页会刷新出新的图片,同时也会有新的Ajax请求发出,在preview可以找到爬取的内容
爬取链接
在Headers中可以看到爬取的url,通过与下面那个Ajax请求的url对比,可以看出page_num相差1
爬取代码
# 导入python库和请求头信息
import requests
from urllib.parse import urlencode
import os
from multiprocessing.pool import Pool
os.mkdir('美女')
headers = {
'Cookie': '_S_DPR=1.25; _S_IPAD=0; MONITOR_WEB_ID=6998732191069242893; _S_WIN_WH=1536_754; ttwid=1%7CoFxjodGO-vtY_O_K8G9x4pwu4gz1ICVhReOIH8j_rNI%7C1629523025%7C8a1555a0c15f383da52d213ef8966aba0fd175bb1e07f6d269e4be9aa0c09083',
'Host': 'so.toutiao.com',
'Referer': 'https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D%E7%BE%8E%E5%A5%B3&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202108211225210101501390121ED0A112%22,%22origin_keyword%22:%22%E8%A1%97%E6%8B%8D%E7%BE%8E%E5%A5%B3%22,%22image_keyword%22:%22%E8%A1%97%E6%8B%8D%E7%BE%8E%E5%A5%B3%22}',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
headers1 = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
通过此图可知url的各个参数,其中page_num是一个可变参数
def get_page(page_num):
# 构造参数字典
params = {
'keyword': '街拍美女',
'pd': 'atlas',
'dvpf': 'pc',
'aid': '4916',
'page_num': page_num,
'search_json': '{"from_search_id":"202108211225210101501390121ED0A112","origin_keyword":"街拍美女","image_keyword":"街拍美女"}',
'rawJSON': '1',
'search_id': '202108211225520102122020840E88BAED'
}
# base_url是完整url的前一部分
# base_url = 'https://so.toutiao.com/search?'
url = 'https://so.toutiao.com/search?' + urlencode(params) # 组合url
try:
res = requests.get(url,headers=headers)
if res.status_code == 200:
return res.json()
except requests.ConnectionError:
return None
def dowmload(json):
if json.get('rawData'):
items = json.get('rawData').get('data')
for i in range(len(items)):
item = items[i]
id = item.get('id') # 获取图片id
url = item.get('img_url') # 获取图片下载地址
print(url)
filename = '美女/{}.jpg'.format(id)
try:
res = requests.get(url,headers=headers1)
if res.status_code == 200:
if not os.path.exists(filename):
with open(filename,'wb') as fp:
fp.write(res.content) # 下载图片
else:
print('Already Downloaded',filename)
except requests.ConnectionError:
print('Failed to Save Image')
def main(page_num):
json = get_page(page_num)
dowmload(json)
START = 1 # 起始页
END = 23 # 结束页
if __name__ == '__main__':
# 利用多进程的进程池,调用map()方法实现多进程下载
pool = Pool()
groups = ([x * 20 for x in range(START,END + 1)])
pool.map(main,groups)
pool.join()
此篇文章爬取的是Ajax数据
同系列文章