Python 爬虫 性能 相关( asyncio 模块 --- 高性能爬虫 )

From:https://www.cnblogs.com/bravexz/p/7741633.html

爬虫应用 asyncio 模块 ( 高性能爬虫 ):https://www.cnblogs.com/morgana/p/8495555.html

python异步编程之asyncio(百万并发):https://www.cnblogs.com/shenh/p/9090586.html

深入理解 Python 异步编程(上):https://blog.csdn.net/catwan/article/details/84975893

https://mp.weixin.qq.com/s?__biz=MjM5OTA1MDUyMA==&mid=2655439072&idx=3&sn=07ca0046b92998ea216958afa5baff8f

requests + asyncio :https://github.com/wangy8961/python3-concurrency-pics-02

python 高并发模块 asynio:https://www.jianshu.com/p/9ea1198beb49

aiohttp 官网文档 https://docs.aiohttp.org/en/latest/

关键字:python 异步编程   、asyncio requests

写爬虫时性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

同步执行

示例代码:

import requests


def fetch_async(url=None):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']

for url in url_list:
    fetch_async(url)

多线程执行

示例代码:

from concurrent.futures import ThreadPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ThreadPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)
pool.shutdown(wait=True)

多线程 + 回调函数执行

示例代码:

# -*- coding: utf-8 -*-

from concurrent.futures import ThreadPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


def callback(future):
    print(future.result())


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ThreadPoolExecutor(5)
for url in url_list:
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
pool.shutdown(wait=True)

多进程执行

示例代码:

# -*- coding: utf-8 -*-

from concurrent.futures import ProcessPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ProcessPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)
pool.shutdown(wait=True)

多进程 + 回调函数执行

示例代码:

# -*- coding: utf-8 -*-

from concurrent.futures import ProcessPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


def callback(future):
    print(future.result())


url_list = ['http://www.github.com', 'http://www.bing.com']

pool = ProcessPoolExecutor(5)
for url in url_list:
    v = pool.submit(fetch_async, url)
    v.add_done_callback(callback)
pool.shutdown(wait=True)

通过上述代码均可以完成对请求性能的提高,对于多线程和多进程的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO会是首选:

异步 IO

Python 中 异步协程 的 使用方法介绍:https://blog.csdn.net/freeking101/article/details/88119858

python---异步IO(asyncio)协程 :https://www.cnblogs.com/ssyfj/p/9219360.html

python 由于 GIL(全局锁)的存在,不能发挥多核的优势,其性能一直饱受诟病。然而在 IO 密集型的网络编程里,异步处理比同步处理能提升成百上千倍的效率,弥补了 python 性能方面的短板,如最新的微服务框架 japronto,每秒的请求 可达百万级。

python 还有一个优势是库(第三方库)极为丰富,运用十分方便。asyncio 是 python3.4 版本引入到标准库,python2x 没有加这个库,毕竟 python3x 才是未来!python3.5 又加入了 async/await 特性。

在学习 asyncio 之前,先的理清楚 同步/异步的概念

  • · 同步 是指完成事务的逻辑,先执行第一个事务,如果阻塞了,会一直等待,直到这个事务完成,再执行第二个事务,顺序执行。。。
  • · 异步 是和同步相对的,异步是指在处理调用这个事务的之后,不会等待这个事务的处理结果,直接处理第二个事务去了,通过状态、通知、回调 来通知 调用者处理结果

调用步骤:

  • 1. 当我们给一个函数添加了async 关键字,或者使用 asyncio.coroutine 装饰器装饰,就会把它变成一个异步函数。 
  • 2. 每个线程 有一个 事件循环,主线程调用 asyncio.get_event_loop 时会创建事件循环,
  • 3. 将任务封装为集合 asyncio.gather(*args),之后一起传入事件循环中
  • 4. 要把异步的任务丢给这个循环的 run_until_complete 方法,事件循环会安排协同程序的执行。和方法名字一样,该方法会等待异步的任务完全执行才会结束。

asyncio 示例 1

# -*- coding: utf-8 -*-

import asyncio


@asyncio.coroutine
def func1():
    print('before...func1......')
    yield from asyncio.sleep(5)
    print('end...func1......')


tasks = [func1(), func1()]

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

asyncio 示例 2

# -*- coding: utf-8 -*-

import asyncio


@asyncio.coroutine
def fetch_async(host, url='/'):
    print(host, url)
    reader, writer = yield from asyncio.open_connection(host, 80)

    request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
    request_header_content = bytes(request_header_content, encoding='utf-8')

    writer.write(request_header_content)
    yield from writer.drain()
    text = yield from reader.read()
    print(host, url, text)
    writer.close()


tasks = [
    fetch_async('www.cnblogs.com', '/wupeiqi/'),
    fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

asyncio + aiohttp

参考:https://www.cnblogs.com/zhanghongfeng/p/8662265.html

用 aiohttp 写爬虫:https://luca-notebook.readthedocs.io/zh_CN/latest/c01/用aiohttp写爬虫.html

aiohttp

  如果需要并发 http 请求怎么办呢,通常是用 requests,但 requests 是同步的库,如果想异步的话需要引入 aiohttp。这里引入一个类,from aiohttp import ClientSession,首先要建立一个 session 对象,然后用 session 对象去打开网页。session 可以进行多项操作,比如 post, get, put, head 等。

示例:

import asyncio
from aiohttp import ClientSession

tasks = []
test_url = "https://www.baidu.com/{}"


async def hello(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            response = await response.read()
            print(response)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(hello(test_url))

首先async def 关键字定义了这是个异步函数,await 关键字加在需要等待的操作前面,response.read()等待request响应,是个耗IO操作。然后使用ClientSession类发起http请求。

多链接 异步 访问

如果我们需要请求多个URL该怎么办呢,同步的做法访问多个URL只需要加个for循环就可以了。但异步的实现方式并没那么容易,在之前的基础上需要将hello()包装在asyncio的Future对象中,然后将Future对象列表作为任务传递给事件循环

import time
import asyncio
from aiohttp import ClientSession

tasks = []
test_url = "https://www.baidu.com/{}"


async def hello(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            response = await response.read()
            #            print(response)
            print('Hello World:%s' % time.time())


def run():
    for i in range(5):
        task = asyncio.ensure_future(hello(test_url.format(i)))
        tasks.append(task)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    run()
    loop.run_until_complete(asyncio.wait(tasks))

收集 http 响应

上面介绍了访问不同链接的异步实现方式,但是我们只是发出了请求,如果要把响应一一收集到一个列表中,最后保存到本地或者打印出来要怎么实现呢,可通过asyncio.gather(*tasks)将响应全部收集起来,具体通过下面实例来演示。

import datetime
import asyncio
from aiohttp import ClientSession

tasks = []
test_url = "https://www.baidu.com/{}"


async def hello(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            # print(response)
            print(f'Hello World : {datetime.datetime.now().replace(microsecond=0)}')
            return await response.read()


def run():
    for i in range(5):
        task = asyncio.ensure_future(hello(test_url.format(i)))
        tasks.append(task)
    result = loop.run_until_complete(asyncio.gather(*tasks))
    print(result)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    run()

假如你的并发达到2000个,程序会报错:ValueError: too many file descriptors in select()。报错的原因字面上看是 Python 调取的 select 对打开的文件有最大数量的限制,这个其实是操作系统的限制,linux打开文件的最大数默认是1024,windows默认是509,超过了这个值,程序就开始报错。

这里我们有 三种方法解决 这个问题:

  • 1.限制并发数量。(一次不要塞那么多任务,或者限制最大并发数量)
  • 2.使用回调的方式
  • 3.修改操作系统打开文件数的最大限制,在系统里有个配置文件可以修改默认值,具体步骤不再说明了。

不修改系统默认配置的话,个人推荐限制并发数的方法,设置并发数为 500,处理速度更快。

# coding:utf-8
import time, asyncio, aiohttp

test_url = 'https://www.baidu.com/'


async def hello(url, semaphore):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                print(f'status:{response.status}')
                return await response.read()


async def run():
    semaphore = asyncio.Semaphore(500)  # 限制并发量为500
    to_get = [hello(test_url.format(), semaphore) for _ in range(1000)]  # 总共1000任务
    await asyncio.wait(to_get)


if __name__ == '__main__':
    # now = lambda :time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(run())
    loop.close()

示例代码:

# -*- coding: utf-8 -*-

import aiohttp
import asyncio


@asyncio.coroutine
def fetch_async(url):
    print(url)

    # request函数是个IO阻塞型的函数
    # response = yield from aiohttp.request('GET', url)
    response = yield from aiohttp.ClientSession().get(url)
    print(response.status)
    print(url, response)
    # data = yield from response.read()
    return response


tasks = [
    # fetch_async('http://www.google.com/'),
    fetch_async('http://www.chouti.com/')
]

event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

Python3 协程控制 并发数 的 两种方法

1、TCPConnector 链接池

import asyncio
import aiohttp

CONCURRENT_REQUESTS = 0


async def aio_http_get(url, session):
    global CONCURRENT_REQUESTS
    async with session.get(url) as response:
        CONCURRENT_REQUESTS += 1
        html = await response.text()
        print(f'[{CONCURRENT_REQUESTS}] : {response.status}')
        return html


def main():
    urls = ['http://www.baidu.com' for _ in range(1000)]
    loop = asyncio.get_event_loop()
    connector = aiohttp.TCPConnector(limit=10)  # 限制同时链接数,连接默认是100,limit=0 无限制
    session = aiohttp.ClientSession(connector=connector, loop=loop)
    loop.run_until_complete(asyncio.gather(*(aio_http_get(url, session=session) for url in urls)))
    loop.close()
    pass


if __name__ == "__main__":
    main()

2、Semaphore 信号量

import asyncio
from aiohttp import ClientSession, TCPConnector


async def async_spider(sem, url):
    """异步任务"""
    async with sem:
        print('Getting data on url', url)
        async with ClientSession() as session:
            async with session.get(url) as response:
                html = await response.text()
                return html


def parse_html(task):
    print(f'Status:{task.result()}')
    pass


async def task_manager():
    """异步任务管理器"""
    tasks = []
    sem = asyncio.Semaphore(10)  # 控制并发数

    url_list = ['http://www.baid
  • 1
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值