利用异步爬虫爬完一个壁纸网站

༺ༀ少年ༀ༻

已于 2022-10-18 12:17:30 修改

阅读量276

点赞数 3

文章标签： python 网络爬虫爬山算法爬虫开发语言

于 2022-10-18 10:23:43 首次发布

本文链接：https://blog.csdn.net/qq_73804934/article/details/127382870

版权

主要使用python自带的asyncio模块进行爬取操作，主要比起一般的爬虫速度更快，也可以添加一些多线程多进程，来进一步提高速度，但是有被封IP的风险，短时间内不允许访问。

首先，这个是全部的代码

import aiofiles
import aiohttp
import asyncio
import requests
from lxml import etree
import urllib.parse


def get_href(url):
    print("正在获取href")
    req = requests.get(url)
    req.encoding = "gbk"
    tree = etree.HTML(req.text)
    href = tree.xpath('//ul[@class="clearfix"]/li/a/@href')
    print("成功获取到href")
    return href


def get_src(new_href):
    req = requests.get(new_href)
    req.encoding = "gbk"
    tree = etree.HTML(req.text)
    src = tree.xpath('//div[@class="photo-pic"]/a/img/@src')[0]
    return src


async def down_load(url):
    print("正在进行保存")
    name = url.split("/")[-1]
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as req:
            content = await req.content.read()
            async with aiofiles.open(name, mode="wb") as f:
                await f.write(content)
    print("下载完成")


async def main():
    for i in range(1, 31): # 该网站此分区一共有120页，可以把31改为121
        url = f"https://pic.netbian.com/4kdongman/index_{i}.html"
        href = get_href(url)
        tasks = []
        for item in href:
            print("正在进行组装")
            new_href = urllib.parse.urljoin(url, item)
            src = get_src(new_href)
            new_src = urllib.parse.urljoin(url, src)
            print("组装完成")
            task = asyncio.create_task(down_load(new_src))
            tasks.append(task)
            await asyncio.wait(tasks)


if __name__ == '__main__':
    event_loop = asyncio.get_event_loop()
    event_loop.run_until_complete(main())

开头部分为导包，其中aiohttp和aiofiles为第三方模块

# 可以在pycharm终端下载
pip install aiohttp

pip install aiofiles

这个是异步当中发送请求以及下载的第三方库。

然后就是利用传统方式得到页面源代码，得到里面每张图片的href,这个是图片的下载链接所在页面的链接，进入子页面之后，拿到图片的src，这个是图片的下载地址，之前得到的href和src都不是完整的需要urllib这个模块来进行拼接，得到完整的下载地址后用aiohttp进行异步的请求，这样速度更快，其中获取href和src都是用xpath来进行获取。如果有乱码可以通过encoding来进行矫正，一般页面源代码里面都有提示。
链接的url可换，但必须是那个网站的。