简介
asyncio实现了TCP、UDP、SSL等协议,aiohttp则是基于asyncio实现的异步请求HTTP框架。可以用于爬虫三部曲的第一步,替换之前的requests请求,进行请求加速,后边解析存储还是不变。
中文文档地址: https://www.cntofu.com/book/127/index.html
安装
pip install aiohttp
但是使用清华源更快:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests
知识
这个库,基本上就是requests的异步版,语法使用方式,几乎一样。
Session机制
和requests的session类似。
aiohttp.ClientSession. 首先要建立一个session对象,然后用该session对象去打开网页。
import asyncio
import aiohttp
from aiohttp import ClientSession,ClientResponse
import time
import requests
# aiohttp实现
async def print_page():
async with aiohttp.ClientSession() as session:
async with session.get("http://www.baidu.com") as resp:
print(resp.status)
body = await resp.read()
print("响应内容长度:",len(body))
startTime = time.time()
loop = asyncio.get_event_loop()
tasks = asyncio.wait([print_page() for i in range(100)])
loop.run_until_complete(tasks)
endTime = time.time()
print("aiohttp库请求百度首页消耗时间:",endTime - startTime)
这里顺便也记录了一下消耗的时间,大家可以通过简单的requests的代码,直观的比较aiohttp和requests的区别
# requests实现
startTime = time.time()
for i in range(100):
response = requests.get("http://www.baidu.com")
print(response.status_code,len(response.content))
endTime = time.time()
print("requests库请求百度首页消耗时间:",endTime - startTime)
虽然结果会受到各种各样的因素影响,尤其是网络因素,但是明显aiohttp还是完胜的。
添加请求头
这个比较简单,将headers放于session.get/post的选项中即可。注意headers数据要是dict格式。
import asyncio
import aiohttp
# aiohttp实现
async def print_page():
headers = {
"User-Agent":"my test UA"
}
async with aiohttp.ClientSession() as session:
async with session.get("http://127.0.0.1:5000/",headers=headers) as resp:
print(resp.status)
body = await resp.read()
print("响应内容长度:",len(body))
print(body.decode())
loop = asyncio.get_event_loop()
loop.run_until_complete(print_page())
loop.close()
添加Cookie
import asyncio
import aiohttp
# aiohttp实现
async def print_page():
cookies = {
"keyA": "valueA",
"keyB": "valueB",
"keyC": "valueC",
}
async with aiohttp.ClientSession() as session:
async with session.get("http://127.0.0.1:5000/Cookie",cookies = cookies) as resp:
print(resp.status)
body = await resp.read()
print("响应内容长度:",len(body))
print(body.decode())
loop = asyncio.get_event_loop()
loop.run_until_complete(print_page())
loop.close()
添加代理
import asyncio
import aiohttp
# aiohttp实现
async def print_page():
headers = {
"User-Agent":"my test UA"
}
async with aiohttp.ClientSession() as session:
async with session.get("http://ip.27399.com/",proxy="http://47.95.249.140:8118") as resp:
print(resp.status)
body = await resp.read()
print("响应内容长度:",len(body))
print(body.decode())
loop = asyncio.get_event_loop()
loop.run_until_complete(print_page())
loop.close()
任务
可以拿之前的一些练习,测试对比一下requests与aiohttp的差别