并发爬取网站图片

最新推荐文章于 2019-06-23 15:29:00 发布

dirac1993

最新推荐文章于 2019-06-23 15:29:00 发布

阅读量193

点赞数

原文链接：http://www.cnblogs.com/guxh/p/10351655.html

版权

某网站的图片：

通过“https://photo.fengniao.com/#p=4”（人像）进入某一主题。

显示的是几十张缩略的小图片以及相应的跳转地址，点击小图片后获取大图片。

想获取小图片背后的大图片，如果通过串行方法依次访问大图链接后保存，会非常耗时。

1，使用多线程获取图片

import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor
from functools import partial


def get_paths(path, regex, code):
    """
    :param path: 网页
    :param regex: 解析规则
    :param code: 编码
    :return: 根据解析规则，解析网页后返回内容列表
    """
    resp = requests.get(path)
    if resp.status_code == 200:
        select = etree.HTML(resp.text)
        paths = select.xpath(regex)
        return paths


def save_pic(path, pic_name, directory):
    """
    :param pic_name: 保存的图片名称
    :param path: 图片的地址
    :param directory: 保存的图片目录
    :return:
    """
    resp = requests.get(path, stream=True)
    if resp.status_code == 200:
        with open('{}/{}.jpg'.format(directory, pic_name), 'wb') as f:
            f.write(resp.content)


if __name__ == '__main__':
    paths = get_paths('https://photo.fengniao.com/#p=4', '//a[@class="pic"]/@href', 'utf-8')
    paths = ['https://photo.fengniao.com/' + p for p in paths]

    # 获取所有大图片路径
    p = partial(get_paths, regex='//img[@class="picBig"]/@src', code='utf-8')  # 冻结解析规则，编码
    with ThreadPoolExecutor() as excutor:
        res = excutor.map(p, paths)
    big_paths = [i[0] for i in res]  # 拿到所有图片的路径

    # 保存图片
    p = partial(save_pic, directory='fn_pics')   # 冻结保存目录
    with ThreadPoolExecutor() as excutor:
        res = excutor.map(p, big_paths, range(len(big_paths)))
    [r for r in res]

转载于:https://www.cnblogs.com/guxh/p/10351655.html

dirac1993

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
并发爬取网站图片

某网站的图片：通过“https://photo.fengniao.com/#p=4”（人像）进入某一主题。显示的是几十张缩略的小图片以及相应的跳转地址，点击小图片后获取大图片。想获取小图片背后的大图片，如果通过串行方法依次访问大图链接后保存，会非常耗时。1，使用多线程获取图片import requestsfrom lxml import etreef...
复制链接

扫一扫