https://github.com/scrapy-plugins/scrapy-splash#why-not-use-the-splash-http-api-directly
The obvious alternative to scrapy-splash would be to send requests directly to the Splash HTTP API. Take a look at the example below and make sure to read the observations after it:
import json import scrapy from scrapy.http.headers import Headers RENDER_HTML_URL = "http://127.0.0.1:8050/render.html" class MySpider(scrapy.Spider): start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True) headers = Headers({'Content-Type': 'application/json'}) yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST", body=body, headers=headers) def parse(self, response): # response.body is a result of render.html call; it # contains HTML processed by a browser. # ...
It works and is easy enough, but there are some issues that you should be aware of:
- There is a bit of boilerplate.
- As seen by Scrapy, we're sending requests to
RENDER_HTML_URL
instead of the target URLs. It affects concurrency and politeness settings:CONCURRENT_REQUESTS_PER_DOMAIN
,DOWNLOAD_DELAY
, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain. - As seen by Scrapy, response.url is an URL of the Splash server. scrapy-splash fixes it to be an URL of a requested page. "Real" URL is still available as
response.real_url
. - Some options depend on each other - for example, if you use timeout Splash option then you may want to set
download_timeout
scrapy.Request meta key as well. - It is easy to get it subtly wrong - e.g. if you won't use
sort_keys=True
argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly. - Default Scrapy duplication filter doesn't take Splash specifics in account. For example, if an URL is sent in a JSON POST request body Scrapy will compute request fingerprint without canonicalizing this URL.
- Splash Bad Request (HTTP 400) errors are hard to debug because by default response content is not displayed by Scrapy. SplashMiddleware logs content of HTTP 400 Splash responses by default (it can be turned off by setting
SPLASH_LOG_400 = False
option). - Cookie handling is tedious to implement, and you can't use Scrapy built-in Cookie middleware to handle cookies when working with Splash.
- Large Splash arguments which don't change with every request (e.g.
lua_source
) may take a lot of space when saved to Scrapy disk request queues.scrapy-splash
provides a way to store such static parameters only once. - Splash 2.1+ provides a way to save network traffic by caching large static arguments on server, but it requires client support: client should send proper
save_args
andload_args
values and handle HTTP 498 responses.
scrapy-splash utlities allow to handle such edge cases and reduce the boilerplate.