pyspider爬虫学习-文档翻译-Architecture.md

最新推荐文章于 2024-07-12 17:59:12 发布

weixin_34291004

最新推荐文章于 2024-07-12 17:59:12 发布

阅读量81

点赞数

文章标签：爬虫 javascript ui ViewUI

原文链接：https://my.oschina.net/sijinge/blog/1527247

版权

2019独角兽企业重金招聘Python工程师标准>>>

Architecture 体系结构
============
此文档描述我创造pyspider的原因和pyspider的体系结构
This document describes the reason why I made pyspider and the architecture.

Why 为什么？
---
两年前，我还在一家垂直搜索引擎，我们面临以下爬行需求
Two years ago, I was working on a vertical search engine. We are facing following needs on crawling:
收集100-200个网站，他们可能在线/离线或随时更换自己的模板
1. collect 100-200 websites, they may on/offline or change their templates at any time
我们需要一个正真强大的监控工具来发现哪个网站正在变化，而且它是一个帮助我们为每个网站编写脚本/模板的好工具
> We need a really powerful monitor to find out which website is changing. And a good tool to help us write script/template for each website.
需要在网站更新5分钟内收集到数据
2. data should be collected in 5min when website updated
我们经常通过检查索引页来解决这个问题，并使用“最后更新时间”或“最后回复时间”来确定哪个页面被更改了。此外，我们在x天后从新检查页防止遗漏
> We solve this problem by check index page frequently, and use something like 'last update time' or 'last reply time' to determine which page is changed. In addition to this, we recheck pages after X days in case to prevent the omission.
只要www不断变化，pyspider永不停止
> **pyspider will never stop as WWW is changing all the time**
此外，我们有一些从我们的合作伙伴的API,API可能需要POST，proxy，request签名等。从脚本完全控制比组件的某些全局参数更为方便。
Furthermore, we have some APIs from our cooperators, the API may need POST, proxy, request signature etc. Full control from script is more convenient than some global parameters of components.

Overview 概述
--------
下面的图表展示和概述了pyspider的体系结构以及它的组件和概述数据流在系统中的实现
The following diagram shows an overview of the pyspider architecture with its components and an outline of the data flow that takes place inside the system.

![pyspider](imgs/pyspider-arch.png)
组件由消息队列连接。每个组件，包括消息队列，都在自己的进程/线程中运行，并可替换。这意味着，当进程缓慢时，您可以拥有许多处理器实例并充分利用多个cpu，或者部署到多台机器上。这种架构让pyspider变得非常快。
Components are connected by message queue. Every component, including message queue, is running in their own process/thread, and replaceable. That means, when process is slow, you can have many instances of processor and make full use of multiple CPUs, or deploy to multiple machines. This architecture makes pyspider really fast. [benchmarking](https://gist.github.com/binux/67b276c51e988f8e2c31#comment-1339242).

Components 组件
----------

### Scheduler 调度器
调度器接收来自处理器的newtask_queue的任务。决定任务是新的还是需要重新爬行。根据优先级排序任务，并将其喂给具有流量控制的fetcher
The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control ([token bucket](http://en.wikipedia.org/wiki/Token_bucket) algorithm). Take care of periodic tasks, lost tasks and failed tasks and retry later.
以上的一切都可以通过“self.crawl”[API](apis/)来设定
All of above can be set via `self.crawl` [API](apis/).

注意，当前的调度器的实现中，只允许使用一个调度器
Note that in current implement of scheduler, only one scheduler is allowed.

### Fetcher 提取者
Fetcher负责抓取web页面，然后将结果发送给处理器。非常灵活，支持通过URL以及javascript 动态抓取，可以通过api控制抓取method、head、cookie、proxy、etag等
The Fetcher is responsible for fetching web pages then send results to processor. For flexible, fetcher support [Data URI](http://en.wikipedia.org/wiki/Data_URI_scheme) and pages that rendered by JavaScript (via [phantomjs](http://phantomjs.org/)). Fetch method, headers, cookies, proxy, etag etc can be controlled by script via [API](apis/self.crawl/#fetch).

### Phantomjs Fetcher
Phantomjs Fetcher像代理一样工作。它连接到一般的Fetcher，以JavaScript的方式获取和呈现页面，输出一个通用的HTML到Fetcher
Phantomjs Fetcher works like a proxy. It's connected to general Fetcher, fetch and render pages with JavaScript enabled, output a general HTML back to Fetcher:

```
scheduler -> fetcher -> processor
|
phantomjs
|
internet
```

### Processor 处理器
处理器负责运行用户编写的脚本以解析和提取信息。您的脚本运行在任何环境中。并且我们有各种工具(如[PyQuery](https://pythonhosted.org/pyquery/))以便提取信息和链接,你可以使用任何你想要处理的响应。您可以参考[Script-Environment](Script-Environment)和[API引用](API /)来获取更多关于脚本的信息。
The Processor is responsible for running the script written by users to parse and extract information. Your script is running in an unlimited environment. Although we have various tools(like [PyQuery](https://pythonhosted.org/pyquery/)) for you to extract information and links, you can use anything you want to deal with the response. You may refer to [Script Environment](Script-Environment) and [API Reference](apis/) to get more information about script.
理器将捕获异常和日志，发送状态(任务跟踪)和新任务到“调度器”，将结果发送给“结果工作者”。
Processor will capture the exceptions and logs, send status(task track) and new tasks to `scheduler`, send results to `Result Worker`.

### Result Worker (optional) 结果工作者(可选)
结果工作者接收`Processor`的处理结果，Pyspider有一个内置的结果工作者，可以将结果保存为“resultdb”，重写它以满足你的需要。
Result worker receives results from `Processor`. Pyspider has a built-in result worker to save result to `resultdb`. Overwrite it to deal with result by your needs.

### WebUI
WebUI是一个web前端。它包含
WebUI is a web frontend for everything. It contains:

* script editor, debugger 脚本编辑、调试
* project manager 项目管理
* task monitor 任务监控
* result viewer, exporter 结果展示，输出
也许webui是pyspider最吸引人的部分。有了这个强大的UI，您就可以像pyspider一样一步一步地调试脚本。启动或停止一个项目。发现哪个项目出错了，什么请求失败了，再用调试器再次尝试。
Maybe webui is the most attractive part of pyspider. With this powerful UI, you can debug your scripts step by step just as pyspider do. Starting or stop a project. Finding which project is going wrong and what request is failed and try it again with debugger.

Data flow 数据流
---------
pyspider中的数据流正如上面图中所示
The data flow in pyspider is just as your seen in diagram above:
每个脚本都有一个名为“on_start”的回调，当您按下WebUI上的“Run”按钮时。“on_start”的新任务被提交给调度程序作为项目的条目
1. Each script has a callback named `on_start`, when you press the `Run` button on WebUI. A new task of `on_start` is submitted to Scheduler as the entries of project.
调度程序将这个' on_start '任务与一个数据URI作为一个正常的任务给Fetcher。
2. Scheduler dispatches this `on_start` task with a Data URI as a normal task to Fetcher.
etcher提出请求并响应它(对于数据URI来说，这是一个虚假的请求和响应，但与其他正常任务没有区别)，然后feed到处理器。
3. Fetcher makes a request and a response to it (for Data URI, it's a fake request and response, but has no difference with other normal tasks), then feeds to Processor.
处理器调用“on_start”方法，生成一些新的URL抓取。处理器将消息发送给调度程序，将任务完成并通过消息队列将新任务发送给调度程序(在大多数情况下，这里没有“on_start”的结果)。如果有结果，处理器将它们发送到“result_queue”)。
4. Processor calls the `on_start` method and generated some new URL to crawl. Processor send a message to Scheduler that this task is finished and new tasks via message queue to Scheduler (here is no results for `on_start` in most case. If has results, Processor send them to `result_queue`).
调度器接收新的任务，在数据库中查找，确定任务是新的还是需要重新爬行的，确定后，将它们放到任务队列中。按排序分派任务。
5. Scheduler receives the new tasks, looking up in the database, determine whether the task is new or requires re-crawl, if so, put them into task queue. Dispatch tasks in order.
这个过程重复(从第3步开始)，直到WWW死了，才停止。调度器将检查周期性的任务来抓取最新的数据。
6. The process repeats (from step 3) and wouldn't stop till WWW is dead ;-). Scheduler will check periodic tasks to crawl latest data.

转载于:https://my.oschina.net/sijinge/blog/1527247