scrapy_doc_tips

Tutorial

  1. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries  to figure out these automatically. ---link
  2. The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.---link
  3. As a shortcut for creating Request objects you can use response.follow Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request.
  4. Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the settingDUPEFILTER_CLASS.
  5. As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.
  6. Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.
  7. Using spider arguments

    You can provide command line arguments to your spiders by using the -a option when running them:

    scrapy crawl quotes -o quotes-humor.json -a tag=humor

    tag = getattr(self, 'tag', None)

    You can learn more about handling spider arguments here.

Command line tool

 

posted on 2019-04-30 17:12  我很好u 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/jyh-py-blog/p/10797176.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值