scrapy_doc_tips

最新推荐文章于 2024-09-14 19:55:48 发布

dianqian4038

最新推荐文章于 2024-09-14 19:55:48 发布

阅读量185

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/jyh-py-blog/p/10797176.html

版权

scrapy_doc_tips

Tutorial

You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries to figure out these automatically. ---link
The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.---link
As a shortcut for creating Request objects you can use response.follow Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request.
Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the settingDUPEFILTER_CLASS.
As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.
Using spider arguments