Tutorial
- You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries to figure out these automatically. ---link
- The
parse()
method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens becauseparse()
is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.---link - As a shortcut for creating Request objects you can use
response.follow Unlike scrapy.Request,
response.follow
supports relative URLs directly - no need to call urljoin. Note thatresponse.follow
just returns a Request instance; you still have to yield this Request. - Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting
DUPEFILTER_CLASS
. - As yet another example spider that leverages the mechanism of following links, check out the
CrawlSpider
class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it. - Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.
- Using spider arguments
You can provide command line arguments to your spiders by using the -a
option when running them:
scrapy crawl quotes -o quotes-humor.json -a tag=humor
tag = getattr(self, 'tag', None)
You can learn more about handling spider arguments here.
Command line tool