Why not use the Splash HTTP API directly?

https://github.com/scrapy-plugins/scrapy-splash#why-not-use-the-splash-http-api-directly

The obvious alternative to scrapy-splash would be to send requests directly to the Splash HTTP API. Take a look at the example below and make sure to read the observations after it:

import json

import scrapy
from scrapy.http.headers import Headers

RENDER_HTML_URL = "http://127.0.0.1:8050/render.html"

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True)
            headers = Headers({'Content-Type': 'application/json'})
            yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST",
                                 body=body, headers=headers)

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # ...

It works and is easy enough, but there are some issues that you should be aware of:

  1. There is a bit of boilerplate.
  2. As seen by Scrapy, we're sending requests to RENDER_HTML_URL instead of the target URLs. It affects concurrency and politeness settings: CONCURRENT_REQUESTS_PER_DOMAINDOWNLOAD_DELAY, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain.
  3. As seen by Scrapy, response.url is an URL of the Splash server. scrapy-splash fixes it to be an URL of a requested page. "Real" URL is still available as response.real_url.
  4. Some options depend on each other - for example, if you use timeout Splash option then you may want to set download_timeout scrapy.Request meta key as well.
  5. It is easy to get it subtly wrong - e.g. if you won't use sort_keys=True argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly.
  6. Default Scrapy duplication filter doesn't take Splash specifics in account. For example, if an URL is sent in a JSON POST request body Scrapy will compute request fingerprint without canonicalizing this URL.
  7. Splash Bad Request (HTTP 400) errors are hard to debug because by default response content is not displayed by Scrapy. SplashMiddleware logs content of HTTP 400 Splash responses by default (it can be turned off by setting SPLASH_LOG_400 = False option).
  8. Cookie handling is tedious to implement, and you can't use Scrapy built-in Cookie middleware to handle cookies when working with Splash.
  9. Large Splash arguments which don't change with every request (e.g. lua_source) may take a lot of space when saved to Scrapy disk request queues. scrapy-splash provides a way to store such static parameters only once.
  10. Splash 2.1+ provides a way to save network traffic by caching large static arguments on server, but it requires client support: client should send proper save_args and load_args values and handle HTTP 498 responses.

scrapy-splash utlities allow to handle such edge cases and reduce the boilerplate.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值