实例学习——爬取Pexels高清图片（学习异步加载）

最新推荐文章于 2023-08-21 10:57:31 发布

JUNECODE

最新推荐文章于 2023-08-21 10:57:31 发布

阅读量1.1k

点赞数

分类专栏： Python 爬虫文章标签： python 爬虫实例

本文链接：https://blog.csdn.net/JUNECODE/article/details/100590163

版权

Python 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

近来学习爬取Pexels图片时，发现书上代码会抛出ConnectionError,经查阅资料知，可能是向网页申请过于频繁被禁，可使用time.sleep()，减缓爬取速度，但考虑到爬取数据较多，运行时间过长，所以选择对抛出的异常pass，在此修正。

开发环境：（Windows）eclipse+pydev

爬取网址：https://www.pexels.com/search/book/

1、通过观察网页可一直下滑更新知，该网页使用了异步加载技术（AJAX）

2、观察网页源代码，F12——>NETWORK——>Headers，得请求URL

3、逐步删除URL字符串，把URL缩短，当使用"https://www.pexels.com/search/book/?page=2"时，可返回正常网页内容

代码：

# _*_ coding:utf-8 _*_

import requests
from bs4 import BeautifulSoup

headers ={
    'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
    }

urls = ['https://www.pexels.com/search/book/?page={}'.format(str(i)) for i in range(1,20)]

list = []        #初始化列表，存储图片URLS

path = 'D:\Pyproject\pexels\picture'

for url in urls:
    try:
        wb_data = requests.get(url, headers = headers)
        soul = BeautifulSoup(wb_data.text, 'lxml')
        imgs = soul.select('article > a > img')
        for img in imgs:
            photo = img.get('src')
            list.append(photo)
            print('加载成功')
    except ConnectionError:
        print('pass disappoint')
        
        
    
for item in list:
    try:
        data = requests.get(item, headers = headers)
        fp = open(path + item.split('?')[0][-10:], 'wb')
        fp.write(data.content)
        fp.close
        print('下载成功')
    except ConnectionError:
        print('pass')

可以加入Time.time()观察程序运行时间

import time

start_time = time.time()
# program code
end_time = time.time()
print(start_time - end_time)

写入图片内容时代码迭代

with open(path + item.split('?')[0][-10:]) as fp:
    fp.write(data.content)

JUNECODE

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
实例学习——爬取Pexels高清图片（学习异步加载）

近来学习爬取Pexels图片时，发现书上代码会抛出ConnectionError,经查阅资料知，可能是向网页申请过于频繁被禁，可使用time.sleep()，减缓爬取速度，但考虑到爬取数据较多，运行时间过长，所以选择对抛出的异常pass，在此修正。开发环境：（Windows）eclipse+pydev爬取网址：https://www.pexels.com/search/book/1...
复制链接

扫一扫

专栏目录