From:https://www.cnblogs.com/bravexz/p/7741633.html
爬虫应用 asyncio 模块 ( 高性能爬虫 ):https://www.cnblogs.com/morgana/p/8495555.html
python异步编程之asyncio(百万并发):https://www.cnblogs.com/shenh/p/9090586.html
深入理解 Python 异步编程(上):https://blog.csdn.net/catwan/article/details/84975893
requests + asyncio :https://github.com/wangy8961/python3-concurrency-pics-02
python 高并发模块 asynio:https://www.jianshu.com/p/9ea1198beb49
aiohttp 官网文档 :https://docs.aiohttp.org/en/latest/
关键字:python 异步编程 、asyncio requests
写爬虫时性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。
同步执行
示例代码:
import requests
def fetch_async(url=None):
response = requests.get(url)
return response
url_list = ['http://www.github.com', 'http://www.bing.com']
for url in url_list:
fetch_async(url)
多线程执行
示例代码:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch_async(url):
response = requests.get(url)
return response
url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:
pool.submit(fetch_async, url)
pool.shutdown(wait=True)
多线程 + 回调函数执行
示例代码:
# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch_async(url):
response = requests.get(url)
return response
def callback(future):
print(future.result())
url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:
v = pool.submit(fetch_async, url)
v.add_done_callback(callback)
pool.shutdown(wait=True)
多进程执行
示例代码:
# -*- coding: utf-8 -*-
from concurrent.futures import ProcessPoolExecutor
import requests
def fetch_async(url):
response = requests.get(url)
return response
url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:
pool.submit(fetch_async, url)
pool.shutdown(wait=True)
多进程 + 回调函数执行
示例代码:
# -*- coding: utf-8 -*-
from concurrent.futures import ProcessPoolExecutor
import requests
def fetch_async(url):
response = requests.get(url)
return response
def callback(future):
print(future.result())
url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:
v = pool.submit(fetch_async, url)
v.add_done_callback(callback)
pool.shutdown(wait=True)
通过上述代码均可以完成对请求性能的提高,对于多线程和多进程的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO会是首选:
异步 IO
Python 中 异步协程 的 使用方法介绍:https://blog.csdn.net/freeking101/article/details/88119858
python---异步IO(asyncio)协程 :https://www.cnblogs.com/ssyfj/p/9219360.html
python 由于 GIL(全局锁)的存在,不能发挥多核的优势,其性能一直饱受诟病。然而在 IO 密集型的网络编程里,异步处理比同步处理能提升成百上千倍的效率,弥补了 python 性能方面的短板,如最新的微服务框架 japronto,每秒的请求 可达百万级。
python 还有一个优势是库(第三方库)极为丰富,运用十分方便。asyncio 是 python3.4 版本引入到标准库,python2x 没有加这个库,毕竟 python3x 才是未来!python3.5 又加入了 async/await 特性。
在学习 asyncio 之前,先的理清楚 同步/异步的概念:
- · 同步 是指完成事务的逻辑,先执行第一个事务,如果阻塞了,会一直等待,直到这个事务完成,再执行第二个事务,顺序执行。。。
- · 异步 是和同步相对的,异步是指在处理调用这个事务的之后,不会等待这个事务的处理结果,直接处理第二个事务去了,通过状态、通知、回调 来通知 调用者处理结果。
调用步骤:
- 1. 当我们给一个函数添加了async 关键字,或者使用 asyncio.coroutine 装饰器装饰,就会把它变成一个异步函数。
- 2. 每个线程 有一个 事件循环,主线程调用 asyncio.get_event_loop 时会创建事件循环,
- 3. 将任务封装为集合 asyncio.gather(*args),之后一起传入事件循环中
- 4. 要把异步的任务丢给这个循环的 run_until_complete 方法,事件循环会安排协同程序的执行。和方法名字一样,该方法会等待异步的任务完全执行才会结束。
asyncio 示例 1
# -*- coding: utf-8 -*-
import asyncio
@asyncio.coroutine
def func1():
print('before...func1......')
yield from asyncio.sleep(5)
print('end...func1......')
tasks = [func1(), func1()]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
asyncio 示例 2
# -*- coding: utf-8 -*-
import asyncio
@asyncio.coroutine
def fetch_async(host, url='/'):
print(host, url)
reader, writer = yield from asyncio.open_connection(host, 80)
request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
request_header_content = bytes(request_header_content, encoding='utf-8')
writer.write(request_header_content)
yield from writer.drain()
text = yield from reader.read()
print(host, url, text)
writer.close()
tasks = [
fetch_async('www.cnblogs.com', '/wupeiqi/'),
fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]
loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
asyncio + aiohttp
参考:https://www.cnblogs.com/zhanghongfeng/p/8662265.html
用 aiohttp 写爬虫:https://luca-notebook.readthedocs.io/zh_CN/latest/c01/用aiohttp写爬虫.html
aiohttp
如果需要并发 http 请求怎么办呢,通常是用 requests,但 requests 是同步的库,如果想异步的话需要引入 aiohttp。这里引入一个类,from aiohttp import ClientSession,首先要建立一个 session 对象,然后用 session 对象去打开网页。session 可以进行多项操作,比如 post, get, put, head 等。
示例:
import asyncio
from aiohttp import ClientSession
tasks = []
test_url = "https://www.baidu.com/{}"
async def hello(url):
async with ClientSession() as session:
async with session.get(url) as response:
response = await response.read()
print(response)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(hello(test_url))
首先async def 关键字定义了这是个异步函数,await 关键字加在需要等待的操作前面,response.read()等待request响应,是个耗IO操作。然后使用ClientSession类发起http请求。
多链接 异步 访问
如果我们需要请求多个URL该怎么办呢,同步的做法访问多个URL只需要加个for循环就可以了。但异步的实现方式并没那么容易,在之前的基础上需要将hello()包装在asyncio的Future对象中,然后将Future对象列表作为任务传递给事件循环。
import time
import asyncio
from aiohttp import ClientSession
tasks = []
test_url = "https://www.baidu.com/{}"
async def hello(url):
async with ClientSession() as session:
async with session.get(url) as response:
response = await response.read()
# print(response)
print('Hello World:%s' % time.time())
def run():
for i in range(5):
task = asyncio.ensure_future(hello(test_url.format(i)))
tasks.append(task)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
run()
loop.run_until_complete(asyncio.wait(tasks))
收集 http 响应
上面介绍了访问不同链接的异步实现方式,但是我们只是发出了请求,如果要把响应一一收集到一个列表中,最后保存到本地或者打印出来要怎么实现呢,可通过asyncio.gather(*tasks)将响应全部收集起来,具体通过下面实例来演示。
import datetime
import asyncio
from aiohttp import ClientSession
tasks = []
test_url = "https://www.baidu.com/{}"
async def hello(url):
async with ClientSession() as session:
async with session.get(url) as response:
# print(response)
print(f'Hello World : {datetime.datetime.now().replace(microsecond=0)}')
return await response.read()
def run():
for i in range(5):
task = asyncio.ensure_future(hello(test_url.format(i)))
tasks.append(task)
result = loop.run_until_complete(asyncio.gather(*tasks))
print(result)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
run()
假如你的并发达到2000个,程序会报错:ValueError: too many file descriptors in select()。报错的原因字面上看是 Python 调取的 select 对打开的文件有最大数量的限制,这个其实是操作系统的限制,linux打开文件的最大数默认是1024,windows默认是509,超过了这个值,程序就开始报错。
这里我们有 三种方法解决 这个问题:
- 1.限制并发数量。(一次不要塞那么多任务,或者限制最大并发数量)
- 2.使用回调的方式。
- 3.修改操作系统打开文件数的最大限制,在系统里有个配置文件可以修改默认值,具体步骤不再说明了。
不修改系统默认配置的话,个人推荐限制并发数的方法,设置并发数为 500,处理速度更快。
# coding:utf-8
import time, asyncio, aiohttp
test_url = 'https://www.baidu.com/'
async def hello(url, semaphore):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
print(f'status:{response.status}')
return await response.read()
async def run():
semaphore = asyncio.Semaphore(500) # 限制并发量为500
to_get = [hello(test_url.format(), semaphore) for _ in range(1000)] # 总共1000任务
await asyncio.wait(to_get)
if __name__ == '__main__':
# now = lambda :time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
loop.close()
示例代码:
# -*- coding: utf-8 -*-
import aiohttp
import asyncio
@asyncio.coroutine
def fetch_async(url):
print(url)
# request函数是个IO阻塞型的函数
# response = yield from aiohttp.request('GET', url)
response = yield from aiohttp.ClientSession().get(url)
print(response.status)
print(url, response)
# data = yield from response.read()
return response
tasks = [
# fetch_async('http://www.google.com/'),
fetch_async('http://www.chouti.com/')
]
event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()
Python3 协程控制 并发数 的 两种方法
1、TCPConnector 链接池
import asyncio
import aiohttp
CONCURRENT_REQUESTS = 0
async def aio_http_get(url, session):
global CONCURRENT_REQUESTS
async with session.get(url) as response:
CONCURRENT_REQUESTS += 1
html = await response.text()
print(f'[{CONCURRENT_REQUESTS}] : {response.status}')
return html
def main():
urls = ['http://www.baidu.com' for _ in range(1000)]
loop = asyncio.get_event_loop()
connector = aiohttp.TCPConnector(limit=10) # 限制同时链接数,连接默认是100,limit=0 无限制
session = aiohttp.ClientSession(connector=connector, loop=loop)
loop.run_until_complete(asyncio.gather(*(aio_http_get(url, session=session) for url in urls)))
loop.close()
pass
if __name__ == "__main__":
main()
2、Semaphore 信号量
import asyncio
from aiohttp import ClientSession, TCPConnector
async def async_spider(sem, url):
"""异步任务"""
async with sem:
print('Getting data on url', url)
async with ClientSession() as session:
async with session.get(url) as response:
html = await response.text()
return html
def parse_html(task):
print(f'Status:{task.result()}')
pass
async def task_manager():
"""异步任务管理器"""
tasks = []
sem = asyncio.Semaphore(10) # 控制并发数
url_list = ['http://www.baid