Python 爬虫性能相关（ asyncio 模块 --- 高性能爬虫）

最新推荐文章于 2024-07-26 17:36:26 发布

「已注销」

最新推荐文章于 2024-07-26 17:36:26 发布

阅读量969

点赞数 1

文章标签： python 爬虫开发语言

全网优质文章转载收藏，均不代表本人立场！

本文链接：https://blog.csdn.net/lyshark_lyshark/article/details/125848392

版权

From：https://www.cnblogs.com/bravexz/p/7741633.html

爬虫应用 asyncio 模块 ( 高性能爬虫 )：https://www.cnblogs.com/morgana/p/8495555.html

python异步编程之asyncio(百万并发)：https://www.cnblogs.com/shenh/p/9090586.html

深入理解 Python 异步编程(上)：https://blog.csdn.net/catwan/article/details/84975893

https://mp.weixin.qq.com/s?__biz=MjM5OTA1MDUyMA==&mid=2655439072&idx=3&sn=07ca0046b92998ea216958afa5baff8f

requests + asyncio ：https://github.com/wangy8961/python3-concurrency-pics-02

python 高并发模块 asynio：https://www.jianshu.com/p/9ea1198beb49

aiohttp 官网文档 ：https://docs.aiohttp.org/en/latest/

关键字：python 异步编程、asyncio requests

写爬虫时性能的消耗主要在IO请求中，当单进程单线程模式下请求URL时必然会引起等待，从而使得请求整体变慢。

同步执行

示例代码：

import requests


def fetch_async(url=None):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']

for url in url_list:
    fetch_async(url)

多线程执行

示例代码：

from concurrent.futures import ThreadPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ThreadPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)
pool.shutdown(wait=True)

多线程 + 回调函数执行

示例代码：

# -*- coding: utf-8 -*-

from concurrent.futures import ThreadPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


def callback(future):
    print(future.result())


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ThreadPoolExecutor(5)
for url in url_list:
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
pool.shutdown(wait=True)

多进程执行

示例代码：

# -*- coding: utf-8 -*-

from concurrent.futures import ProcessPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ProcessPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)
pool.shutdown(wait=True)

多进程 + 回调函数执行

示例代码：

# -*- coding: utf-8 -*-

from concurrent.futures import ProcessPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


def callback(future):
    print(future.result())


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ProcessPoolExecutor(5)
for url in url_list:
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
pool.shutdown(wait=True)

通过上述代码均可以完成对请求性能的提高，对于多线程和多进程的缺点是在IO阻塞时会造成了线程和进程的浪费，所以异步IO会是首选：

异步 IO

Python 中异步协程的使用方法介绍：https://blog.csdn.net/freeking101/article/details/88119858

python---异步IO(asyncio)协程：https://www.cnblogs.com/ssyfj/p/9219360.html

python 由于 GIL(全局锁)的存在，不能发挥多核的优势，其性能一直饱受诟病。然而在 IO 密集型的网络编程里，异步处理比同步处理能提升成百上千倍的效率，弥补了 python 性能方面的短板，如最新的微服务框架 japronto，每秒的请求 可达百万级。

python 还有一个优势是库(第三方库)极为丰富，运用十分方便。asyncio 是 python3.4 版本引入到标准库，python2x 没有加这个库，毕竟 python3x 才是未来！python3.5 又加入了 async/await 特性。

在学习 asyncio 之前，先的理清楚 同步/异步的概念：

· 同步 是指完成事务的逻辑，先执行第一个事务，如果阻塞了，会一直等待，直到这个事务完成，再执行第二个事务，顺序执行。。。
· 异步 是和同步相对的，异步是指在处理调用这个事务的之后，不会等待这个事务的处理结果，直接处理第二个事务去了，通过状态、通知、回调 来通知调用者处理结果。

调用步骤：

1. 当我们给一个函数添加了async 关键字，或者使用 asyncio.coroutine 装饰器装饰，就会把它变成一个异步函数。
2. 每个线程 有一个 事件循环，主线程调用 asyncio.get_event_loop 时会创建事件循环，
3. 将任务封装为集合 asyncio.gather(*args)，之后一起传入事件循环中
4. 要把异步的任务丢给这个循环的 run_until_complete 方法，事件循环会安排协同程序的执行。和方法名字一样，该方法会等待异步的任务完全执行才会结束。

asyncio 示例 1

# -*- coding: utf-8 -*-

import asyncio


@asyncio.coroutine
def func1():
    print('before...func1......')
    yield from asyncio.sleep(5)
    print('end...func1......')


tasks = [func1(), func1()]

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

asyncio 示例 2

# -*- coding: utf-8 -*-

import asyncio


@asyncio.coroutine
def fetch_async(host, url='/'):
    print(host, url)
    reader, writer = yield from asyncio.open_connection(host, 80)

    request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
    request_header_content = bytes(request_header_content, encoding='utf-8')

    writer.write(request_header_content)
    yield from writer.drain()
    text = yield from reader.read()
    print(host, url, text)
    writer.close()


tasks = [
    fetch_async('www.cnblogs.com', '/wupeiqi/'),
    fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

asyncio + aiohttp

参考：https://www.cnblogs.com/zhanghongfeng/p/8662265.html

用 aiohttp 写爬虫：https://luca-notebook.readthedocs.io/zh_CN/latest/c01/用aiohttp写爬虫.html

aiohttp

　　如果需要并发 http 请求怎么办呢，通常是用 requests，但 requests 是同步的库，如果想异步的话需要引入 aiohttp。这里引入一个类，from aiohttp import ClientSession，首先要建立一个 session 对象，然后用 session 对象去打开网页。session 可以进行多项操作，比如 post, get, put, head 等。

示例：

import asyncio
from aiohttp import ClientSession

tasks = []
test_url = "https://www.baidu.com/{}"


async def hello(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            response = await response.read()
            print(response)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(hello(test_url))

首先async def 关键字定义了这是个异步函数，await 关键字加在需要等待的操作前面，response.read()等待request响应，是个耗IO操作。然后使用ClientSession类发起http请求。

多链接异步访问

如果我们需要请求多个URL该怎么办呢，同步的做法访问多个URL只需要加个for循环就可以了。但异步的实现方式并没那么容易，在之前的基础上需要将hello()包装在asyncio的Future对象中，然后将Future对象列表作为任务传递给事件循环。

import time
import asyncio
from aiohttp import ClientSession

tasks = []
test_url = "https://www.baidu.com/{}"


async def hello(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            response = await response.read()
            #            print(response)
            print('Hello World:%s' % time.time())


def run():
    for i in range(5):
        task = asyncio.ensure_future(hello(test_url.format(i)))
        tasks.append(task)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    run()
    loop.run_until_complete(asyncio.wait(tasks))

收集 http 响应

上面介绍了访问不同链接的异步实现方式，但是我们只是发出了请求，如果要把响应一一收集到一个列表中，最后保存到本地或者打印出来要怎么实现呢，可通过asyncio.gather(*tasks)将响应全部收集起来，具体通过下面实例来演示。

import datetime
import asyncio
from aiohttp import ClientSession

tasks = []
test_url = "https://www.baidu.com/{}"


async def hello(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            # print(response)
            print(f'Hello World : {datetime.datetime.now().replace(microsecond=0)}')
            return await response.read()


def run():
    for i in range(5):
        task = asyncio.ensure_future(hello(test_url.format(i)))
        tasks.append(task)
    result = loop.run_until_complete(asyncio.gather(*tasks))
    print(result)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    run()

假如你的并发达到2000个，程序会报错：ValueError: too many file descriptors in select()。报错的原因字面上看是 Python 调取的 select 对打开的文件有最大数量的限制，这个其实是操作系统的限制，linux打开文件的最大数默认是1024，windows默认是509，超过了这个值，程序就开始报错。

这里我们有 三种方法解决 这个问题：

1.限制并发数量。(一次不要塞那么多任务，或者限制最大并发数量)
2.使用回调的方式。
3.修改操作系统打开文件数的最大限制，在系统里有个配置文件可以修改默认值，具体步骤不再说明了。

不修改系统默认配置的话，个人推荐限制并发数的方法，设置并发数为 500，处理速度更快。

# coding:utf-8
import time, asyncio, aiohttp

test_url = 'https://www.baidu.com/'


async def hello(url, semaphore):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                print(f'status:{response.status}')
                return await response.read()


async def run():
    semaphore = asyncio.Semaphore(500)  # 限制并发量为500
    to_get = [hello(test_url.format(), semaphore) for _ in range(1000)]  # 总共1000任务
    await asyncio.wait(to_get)


if __name__ == '__main__':
    # now = lambda :time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(run())
    loop.close()

示例代码：

# -*- coding: utf-8 -*-

import aiohttp
import asyncio


@asyncio.coroutine
def fetch_async(url):
    print(url)

    # request函数是个IO阻塞型的函数
    # response = yield from aiohttp.request('GET', url)
    response = yield from aiohttp.ClientSession().get(url)
    print(response.status)
    print(url, response)
    # data = yield from response.read()
    return response


tasks = [
    # fetch_async('http://www.google.com/'),
    fetch_async('http://www.chouti.com/')
]

event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

Python3 协程控制并发数的两种方法

1、TCPConnector 链接池

import asyncio
import aiohttp

CONCURRENT_REQUESTS = 0


async def aio_http_get(url, session):
    global CONCURRENT_REQUESTS
    async with session.get(url) as response:
        CONCURRENT_REQUESTS += 1
        html = await response.text()
        print(f'[{CONCURRENT_REQUESTS}] : {response.status}')
        return html


def main():
    urls = ['http://www.baidu.com' for _ in range(1000)]
    loop = asyncio.get_event_loop()
    connector = aiohttp.TCPConnector(limit=10)  # 限制同时链接数，连接默认是100，limit=0 无限制
    session = aiohttp.ClientSession(connector=connector, loop=loop)
    loop.run_until_complete(asyncio.gather(*(aio_http_get(url, session=session) for url in urls)))
    loop.close()
    pass


if __name__ == "__main__":
    main()

2、Semaphore 信号量

import asyncio
from aiohttp import ClientSession, TCPConnector


async def async_spider(sem, url):
    """异步任务"""
    async with sem:
        print('Getting data on url', url)
        async with ClientSession() as session:
            async with session.get(url) as response:
                html = await response.text()
                return html


def parse_html(task):
    print(f'Status:{task.result()}')
    pass


async def task_manager():
    """异步任务管理器"""
    tasks = []
    sem = asyncio.Semaphore(10)  # 控制并发数

    url_list = ['http://www.baid

最低0.47元/天解锁文章

「已注销」

关注

1
点赞
踩
14

收藏

觉得还不错? 一键收藏
1
评论
Python 爬虫性能相关（ asyncio 模块 --- 高性能爬虫）

From：https://www.cnblogs.com/bravexz/p/7741633.html 爬虫应用 asyncio 模块 ( 高性能爬虫 )：https://www.cnblogs.com/morgana/p/8495555.html python异步编程之asyncio(百万并发)：https://www.cnblogs.com/sh...
复制链接

扫一扫