pyspider爬虫学习-文档翻译-Working-with-Results.md

最新推荐文章于 2024-09-14 11:30:21 发布

weixin_34403693

最新推荐文章于 2024-09-14 11:30:21 发布

阅读量121

点赞数

文章标签：爬虫数据库 json

原文链接：https://my.oschina.net/sijinge/blog/1530053

版权

2019独角兽企业重金招聘Python工程师标准>>>

Working with Results 结果处理
====================
#从WebUI下载和查看您的数据很方便，但可能不适合计算机。
Downloading and viewing your data from WebUI is convenient, but may not suitable for computer.

Working with ResultDB 结果数据处理
---------------------
#虽然resultdb只是为结果预览而设计，不适合大规模存储。但是，如果您想从resultdb获取数据，那么有一些简单的代码片段使用数据库API来帮助您连接和查询数据。
Although resultdb is only designed for result preview, not suitable for large scale storage. But if you want to grab data from resultdb, there are some simple snippets using database API that can help you to connect and select the data.

```
from pyspider.database import connect_database
resultdb = connect_database("<your resutldb connection url>")
for project in resultdb.projects:
    for result in resultdb.select(project):
        assert result['taskid']
        assert result['url']
        assert result['result']
```
#结果['result']是由脚本提交的“return”语句返回的对象。
The `result['result']` is the object submitted by `return` statement from your script.

Working with ResultWorker 使用 ResultWorker
-------------------------
#在生产环境中，您可能希望将pyspider连接到系统/后端处理管道，而不是将其存储到resultdb中。强烈建议重写ResultWorker
In product environment, you may want to connect pyspider to your system / post-processing pipeline, rather than store it into resultdb. It's highly recommended to override ResultWorker.

```
from pyspider.result import ResultWorker

class MyResultWorker(ResultWorker):
    def on_result(self, task, result):
        assert task['taskid']
        assert task['project']
        assert task['url']
        assert result
        # your processing code goes here
```
#结果['result']是由脚本提交的“return”语句返回的对象。
`result` is the object submitted by `return` statement from your script.
#你可以将写好的就脚本放在你的pyspider启动文件夹下，并使用'result_worker'子命令添加参数:
You can put this script (e.g., `my_result_worker.py`) at the folder where you launch pyspider. Add argument for `result_worker` subcommand:
`pyspider result_worker --result-cls=my_result_worker.MyResultWorker`
或者
Or

```
{
  ...
  "result_worker": {
    "result_cls": "my_result_worker.MyResultWorker"
  }
  ...
}
```
#如果你想使用配置文件【请参考部署文档】
if you are using config file. [Please refer to Deployment](/Deployment)
#设计自己的数据库模式
Design Your Own Database Schema
-------------------------------
#的结果被编码为兼容的JSON存储在数据库中。强烈建议设计自己的数据库，并覆写ResultWorker描述的结果。
The results stored in database is encoded as JSON for compatibility. It's highly recommended to design your own database, and override the ResultWorker described above.

TIPS about Results #关于结果的建议
-------------------
#### Want to return more than one result in callback?#想要在回调中返回多个单一结果吗?
#由于taskid(url)的resultdb不重复结果，新的结果将覆盖之前的结果。
As resultdb de-duplicate results by taskid(url), the latest will overwrite previous results.
#一个解决方案是使用“send_message”API为每个结果制作一个“假”的taskid。
One workaround is using `send_message` API to make a `fake` taskid for each result.

```
def detail_page(self, response):
    for li in response.doc('li').items():
        self.send_message(self.project_name, {
            ...
        }, url=response.url+"#"+li('a.product-sku').text())

def on_message(self, project, msg):
    return msg
```

See Also: [apis/self.send_message](/apis/self.send_message)

转载于:https://my.oschina.net/sijinge/blog/1530053