aiohttp与asyncio实现并发爬虫模式

最新推荐文章于 2023-11-26 18:31:02 发布

cloveses

最新推荐文章于 2023-11-26 18:31:02 发布

阅读量2.8k

点赞数 5

分类专栏： python 文章标签： python aiohttp asyncio

本文链接：https://blog.csdn.net/cloveses/article/details/87305421

版权

python 专栏收录该内容

38 篇文章

订阅专栏

近日需要实现一个站点的爬虫，尝试了下aiohtp结合asyncio来实现,也参考了网上相关资料。

第一回合异步并发居然和同步一样工作

代码如下：

async def fetch_get(session, url):
    asyncio.sleep(random.randint(3,6))
    # print('get:', url)
    async with session.get(url) as response:
        return await response.text(encoding='utf-8')
        
async def result_get(session, url):
    pass
    
async def fetch_main():
    async with aiohttp.ClientSession() as session:
        shelf_text = await fetch_get(session, FBASE_URL)
        shelf_html = etree.HTML(shelf_text)
        urls = parse(shelf_html)
        for url in urls:
            await  result_get(session, url)
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())

运行后会发现还是一个个URL获取内容，并没有达到并发效果。

第二回合添加完全部任务并发执行

async def fetch_get(session, url):
    asyncio.sleep(random.randint(3,6))
    # print('get:', url)
    async with session.get(url) as response:
        return await response.text(encoding='utf-8')
        
async def result_get(session, url):
    pass
    
async def fetch_main():
    async with aiohttp.ClientSession() as session:
        shelf_text = await fetch_get(session, FBASE_URL)
        shelf_html = etree.HTML(shelf_text)
        urls = parse(shelf_html)
        tasks = []
        for url in urls:
            task = asyncio.ensure_future(result_get(session, url))
            tasks.append(task)
        await asyncio.wait(tasks)
        
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())

这种模式下，如果urls中很多，就会连续不停添加异步任务。

第三回合逐步添加任务并发执行

sem = asyncio.Semaphore(30)

async def fetch_get(session, url):
    asyncio.sleep(random.randint(3,6))
    async with sem:
        async with session.get(url) as response:
            return await response.text(encoding='utf-8')
        
async def result_get(session, url):
    pass
    
async def fetch_main():
    async with sem:
        async with aiohttp.ClientSession() as session:
            shelf_text = await fetch_get(session, FBASE_URL)
            shelf_html = etree.HTML(shelf_text)
            urls = parse(shelf_html)
            tasks = []
            part_tasks = []
            for index,url in enumerate(urls):
                if index % 15 == 0:
                    asyncio.sleep(240)
                    part_tasks = []
                task = asyncio.ensure_future(result_get(session, url))
                tasks.append(task)
                part_tasks.append(task)
                if index % 15 == 0:
                    await asyncio.wait(part_tasks)
            await asyncio.wait(tasks)
        
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())

目前用的就是这种模式。设置为每15个url添加后会开始异步执行，并等待240秒后再开始，总并发连接数为30。
如果通过一个异步任务获取URL放数据库中，再通过另一个异步任务从数据库中获取URL来获取结果。可以在获取结果的异步任务中使用一个while循环，每次从数据库中取出一定量的URL添加至异步执行的任务中，直到数据库中全部URL执行完成为止。