使用异步IO、协程的网络爬虫

阿科1989

已于 2022-02-12 16:28:05 修改

阅读量94

点赞数

分类专栏： Python 文章标签： python

于 2022-02-12 15:41:46 首次发布

原文链接：http://www.aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html

版权

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

原文： http://www.aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html

A Web Crawler With asyncio Coroutines

A. Jesse Jiryu Davis and Guido van Rossum

A. Jesse Jiryu Davis is a staff engineer at MongoDB in New York. He wrote Motor, the async MongoDB Python driver, and he is the lead developer of the MongoDB C Driver and a member of the PyMongo team. He contributes to asyncio and Tornado. He writes at http://emptysqua.re.

A.杰西·吉鲁·戴维斯是一名在纽约MongoDB工作的主任工程师。他是异步MongoDB Python驱动程序Motor的作者，是MongoDB C驱动程序的主要开发人员，也是PyMongo团队的一员。他为asyncio 和Tornado贡献代码。他的博文地址http://emptysqua.re。

Guido van Rossum is the creator of Python, one of the major programming languages on and off the web. The Python community refers to him as the BDFL (Benevolent Dictator For Life), a title straight from a Monty Python skit. Guido's home on the web is Guido's Personal Home Page.

吉多·范罗苏姆，Python之父。在 Python 社区，吉多·范罗苏姆被人们认为是“仁慈的独裁者（BDFL）”。吉多的个人主页 Guido's Personal Home Page.

Introduction

Classical computer science emphasizes efficient algorithms that complete computations as quickly as possible. But many networked programs spend their time not computing, but holding open many connections that are slow, or have infrequent events. These programs present a very different challenge: to wait for a huge number of network events efficiently. A contemporary approach to this problem is asynchronous I/O, or "async".

经典计算机科学强调的是能尽快完成计算的高效算法。但是，许多联网程序花费的时间并不是在计算上，而是保持打开了许多缓慢或事件不频繁的连接。这些程序提出了一个截然不同的挑战：高效地等待大量网络事件。当前解决这个问题的方法是异步I/O，或“异步”。

This chapter presents a simple web crawler. The crawler is an archetypal async application because it waits for many responses, but does little computation. The more pages it can fetch at once, the sooner it completes. If it devotes a thread to each in-flight request, then as the number of concurrent requests rises it will run out of memory or other thread-related resource before it runs out of sockets. It avoids the need for threads by using asynchronous I/O.

本文介绍了一个简单的网络爬虫。爬虫是一个典型的异步应用程序，因为它等待许多响应，但计算量很少。它一次可以获取的页面越多，完成的就越快。如果它为每个正在进行的请求分配一个线程，那么随着并发请求数量的增加，它将在耗尽套接字之前耗尽内存或其他与线程相关的资源。它通过使用异步 I/O 避免了对线程的需求。

We present the example in three stages. First, we show an async event loop and sketch a crawler that uses the event loop with callbacks: it is very efficient, but extending it to more complex problems would lead to unmanageable spaghetti code. Second, therefore, we show that Python coroutines are both efficient and extensible. We implement simple coroutines in Python using generator functions. In the third stage, we use the full-featured coroutines from Python's standard "asyncio" library1, and coordinate them using an async queue.

我们分三个阶段展示这个例子。首先，我们展示了一个异步事件循环并绘制了一个使用带有回调的事件循环的爬虫：它非常高效，但是将其扩展到更复杂的问题会导致难以管理的意大利面条式代码。其次，因此，我们展示了 Python 协程既高效又可扩展。我们使用生成器函数在 Python 中实现简单的协程。在第三阶段，我们使用 Python 标准“asyncio”库中功能齐全的协程，并使用异步队列协调它们。

The Task

A web crawler finds and downloads all pages on a website, perhaps to archive or index them. Beginning with a root URL, it fetches each page, parses it for links to unseen pages, and adds these to a queue. It stops when it fetches a page with no unseen links and the queue is empty.

网络爬虫查找并下载网站上的所有页面，可能是为了存档或索引它们。从根 URL 开始，它获取每个页面，解析它以获取到未处理的新链接，然后将这些链接添加到队列中。当它获取的页面没有未处理的新链接，并且队列为空时，爬虫会停止。

We can hasten this process by downloading many pages concurrently. As the crawler finds new links, it launches simultaneous fetch operations for the new pages on separate sockets. It parses responses as they arrive, adding new links to the queue. There may come some point of diminishing returns where too much concurrency degrades performance, so we cap the number of concurrent requests, and leave the remaining links in the queue until some in-flight requests complete.

我们可以通过同时下载许多页面来加速这个过程。当爬虫找到新链接时，它会同时在不同的套接字上启动获取新页面的操作。它在响应到达时对其进行解析，将新链接添加到队列中。当过多的并发会降低性能时，可能会出现收益递减点，因此我们限制并发请求的数量，并将剩余的链接留在队列中，直到一些正在进行的请求完成。

The Traditional Approach

How do we make the crawler concurrent? Traditionally we would create a thread pool. Each thread would be in charge of downloading one page at a time over a socket. For example, to download a page from xkcd.com:

我们如何让爬虫并发？传统上我们会创建一个线程池。每个线程将负责通过套接字一次下载一个页面。例如，要从 xkcd.com 下载页面：

def fetch(url):
    sock = socket.socket()
    sock.connect(('xkcd.com', 80))
    request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
    sock.send(request.encode('ascii'))
    response = b''
    chunk = sock.recv(4096)
    while chunk:
        response += chunk
        chunk = sock.recv(4096)

    # Page is now downloaded.
    links = parse_links(response)
    q.add(links)

未完待续