简单比较一下同步|多线程|协程的爬取速度

最新推荐文章于 2023-10-12 16:51:43 发布

爱吃香菜

最新推荐文章于 2023-10-12 16:51:43 发布

阅读量135

点赞数

分类专栏：软件测试程序人生自动化测试文章标签：单元测试自动化测试软件测试职场和发展程序人生

本文链接：https://blog.csdn.net/wx17343624830/article/details/129246622

版权

软件测试同时被 3 个专栏收录

2860 篇文章 35 订阅

订阅专栏

程序人生

2644 篇文章 26 订阅

订阅专栏

自动化测试

2348 篇文章 14 订阅

订阅专栏

「本章，啥也不干，就来简略的比较比较爬虫速度」

先上结果，以下结果是多次运行后取的最优结果。不同时间段对于速率影响还是有的。参考即可

"""
普通函数执行：总耗时 3.330171585083008 S
线程池执行：总耗时 总耗时 1.6058530807495117 S
多线程执行：总耗时 总耗时 1.8512330055236816 S
协程异步执行：总耗时 总耗时 总耗时 1.091230869293213 S
线程池协程异步：总耗时 0.8936080932617188 S
"""

普通函数

也就是同步爬虫

import requests
import time

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

def run():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    for Url in List_Url:
        response = requests.get(Url,headers=headers)
        with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
            content =response.content
            w.write(content)

if __name__ == '__main__':
    s = time.time()
    run()
    print("总耗时 {} S".format(time.time() - s))

这个没什么看的，下一个

线程池与多线程爬虫

import requests
import time
from concurrent.futures import ThreadPoolExecutor
import threading

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

def run(Url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    response = requests.get(Url,headers=headers)
    with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
        content =response.content
        w.write(content)

if __name__ == '__main__':
    s = time.time()

    with ThreadPoolExecutor(max_workers=5) as pool:
        # for Url in List_Url:
        #     pool.submit(run,Url)
        pool.map(run,List_Url)

    # threads = []
    # for url in List_Url:
    #     thread = threading.Thread(target=run, args=(url,))
    #     thread.start()
    #     threads.append(thread)
    # [j.join() for j in threads]
    print("总耗时 {} S".format(time.time() - s))

一般来说，爬虫最常用的就是它两了，效率上其实影响不大，主要还是跟当前的响应什么的挂钩。

值得注意的是用法上的区别。

协程

import asyncio
import time
import aiohttp

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

async def run(session, Url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    async with session.get(Url,headers=headers) as response:
        with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
            content = await response.content.read()
            w.write(content)

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(run(session, Url)) for Url in List_Url]
        await asyncio.wait(tasks)

if __name__ == '__main__':
    s = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

    print("总耗时 {} S".format(time.time() - s))

对于协程不懂的就看看基础，再来吧，此处的并发协程，速率还是有比较的明显的提升的。网络波动不大的情况下，是要比多线程要快的。

线程池协程异步

「小小的说一句，此处并没有做任何验证，只是单纯的测试了一下」如果不对欢迎指出。

import asyncio
import time
import aiohttp
from concurrent.futures import ThreadPoolExecutor

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

async def run(session, Url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    async with session.get(Url, headers=headers) as response:
        with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
            content = await response.content.read()
            w.write(content)

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(run(session, Url)) for Url in List_Url]
        await asyncio.wait(tasks)

if __name__ == '__main__':
    s = time.time()

    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor(max_workers=5) as pool:
        pool.submit(loop.run_until_complete(main()))

    print("总耗时 {} S".format(time.time() - s))