异步爬虫: async/await 与 aiohttp的使用，以及例子

最新推荐文章于 2025-03-05 17:00:03 发布

multiangle

最新推荐文章于 2025-03-05 17:00:03 发布

阅读量3.3w

点赞数 8

分类专栏： python 文章标签： python 异步爬虫 aiohttp

本文链接：https://blog.csdn.net/u014595019/article/details/52295642

版权

python 专栏收录该内容

39 篇文章

订阅专栏

在python3.5中，加入了asyncio/await 关键字，使得回调的写法更加直观和人性化。而aiohttp是一个提供异步web服务的库，分为服务器端和客户端。这里主要使用其客户端。本文分为三步分，第一部分简单介绍python3.5的异步，asyncio/await 关键字。第二部分介绍aiohttp客户端部分的使用。第三部分是一个例子，列举了如何爬取CSDN某个博客中的所有文章。

1. async/await

async/await 是 python3.5中新加入的特性，将异步从原来的yield 写法中解放出来，变得更加直观。
在3.5之前，如果想要使用异步，主要使用yield语法。举例如下：

import asyncio

@asyncio.coroutine  # 修饰符，等同于 asyncio.coroutine(hello())
def hello():
    print('Hello world! (%s)' % threading.currentThread())
    yield from asyncio.sleep(1)  # 执行到这一步以后，直接切换到下一个任务，等到一秒后再切回来
    print('Hello again! (%s)' % threading.currentThread())

loop = asyncio.get_event_loop()  
tasks = [hello(), hello()]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

其实上面的例子由于使用了asyncio库以及用coroutine进行修饰，已经比较简化了。而引入了async/await以后，hello()可以写成这样：

async def hello():
    print("Hello world!")
    r = await asyncio.sleep(1)
    print("Hello again!")

注意此时已经不需要使用@asyncio.coroutin进行修饰，而是在def之前加async表示这是个异步函数，其内有异步操作。此外，使用await 替换了yield from, 表示这一步为异步操作。

2. aiohttp

aiohttp是一个用于web服务的库，网上的资料，包括廖雪峰的网站中的资料大部分是关于服务器端(server)的，关于客户端(client)的资料不多。事实上aiohttp的资料并不完善，官网上只有一些例子，我就也就照着例子依葫芦画瓢了。

2.1 基本用法

async with aiohttp.get('https://github.com') as r:
        await r.text()

其中r.text(), 可以在括号中指定解码方式，编码方式，例如

await resp.text(encoding='windows-1251')

或者也可以选择不编码，适合读取图像等，是无法编码的

await resp.read()

这里要注意的是with…as的语法，不过这个跟本文无关，读者可以自行搜索了解。

2.2 设置timeout

需要加一个with aiohttp.Timeout(x)

with aiohttp.Timeout(0.001):
    async with aiohttp.get('https://github.com') as r:
        await r.text()

2.3 使用session获取数据

这里要引入一个类，aiohttp.ClientSession. 首先要建立一个session对象，然后用该session对象去打开网页。session可以进行多项操作，比如post, get, put, head等等，如下面所示

async with aiohttp.ClientSession() as session:
    async with session.get('https://api.github.com/events') as resp:
        print(resp.status)
        print(await resp.text())

如果要使用post方法，则相应的语句要改成

session.post('http://httpbin.org/post', data=b'data')

2.4 自定义headers

这个比较简单，将headers放于session.get/post的选项中即可。注意headers数据要是dict格式

url = 'https://api.github.com/some/endpoint'
headers = {'content-type': 'application/json'}

await session.get(url, headers=headers)

显然，该方法对于post等其他方法也都有效

2.5 使用代理

要实现这个功能，需要在生产session对象的过程中做一些修改。

conn = aiohttp.ProxyConnector(proxy="http://some.proxy.com")
session = aiohttp.ClientSession(connector=conn)
async with session.get('http://python.org') as resp:
    print(resp.status)

这边没有写成with….. as….形式，但是原理是一样的，也可以很容易的改写成之前的格式
如果代理需要认证，则需要再加一个proxy_auth选项。

conn = aiohttp.ProxyConnector(
    proxy="http://some.proxy.com",
    proxy_auth=aiohttp.BasicAuth('user', 'pass')
)
session = aiohttp.ClientSession(connector=conn)
async with session.get('http://python.org') as r:
    assert r.status == 200

2.6 自定义cookie

同样是在session中做修改。

url = 'http://httpbin.org/cookies'
async with ClientSession({'cookies_are': 'working'}) as session:
    async with session.get(url) as resp:
        assert await resp.json() == {"cookies":
                                         {"cookies_are": "working"}}

3. 样例

在看完第二部分的各功能的用法以后，完成一个例子其实已经很简单了，无非就是将各个功能排列组合而已。下面这个简单的爬虫，是用来爬取我博客下所有的文章的

import urllib.request as request
from bs4 import BeautifulSoup as bs
import asyncio
import aiohttp

@asyncio.coroutine
async def getPage(url,res_list):
    print(url)
    headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
    # conn = aiohttp.ProxyConnector(proxy="http://127.0.0.1:8087")
    async with aiohttp.ClientSession() as session:
        async with session.get(url,headers=headers) as resp:
            assert resp.status==200
            res_list.append(await resp.text())


class parseListPage():
    def __init__(self,page_str):
        self.page_str = page_str
    def __enter__(self):
        page_str = self.page_str
        page = bs(page_str,'lxml')
        # 获取文章链接
        articles = page.find_all('div',attrs={'class':'article_title'})
        art_urls = []
        for a in articles:
            x = a.find('a')['href']
            art_urls.append('http://blog.csdn.net'+x)
        return art_urls
    def __exit__(self, exc_type, exc_val, exc_tb):
        pass


page_num = 5
page_url_base = 'http://blog.csdn.net/u014595019/article/list/'
page_urls = [page_url_base + str(i+1) for i in range(page_num)]
loop = asyncio.get_event_loop()
ret_list = []
tasks = [getPage(host,ret_list) for host in page_urls]
loop.run_until_complete(asyncio.wait(tasks))

articles_url = []
for ret in ret_list:
    with parseListPage(ret) as tmp:
        articles_url += tmp
ret_list = []

tasks = [getPage(url, ret_list) for url in articles_url]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()