Scrapy 笔记(四) Simple spider

Spiders:
To scrape data from website。You will define the spiders class when after create project。Spiders are classes which define how to scrape data。

Scraping cycle:

For spiders, the scraping cycle goes through something like this:

  1. You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

    The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

  2. In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

  3. In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

  4. Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.

scrapy.Spider

It’s the simplest spider。It just provides a default start_requests() implementation which sends requests from the start_urls spider attribute and calls the spider’s method parse for each of the resulting responses.

name :

This attribute defines name of spider.It must be unique.You will run Scrapyproject use it  in your command line , following :

scrapy crawl quotes #

allowed_domains:

An optional list of strings containing domains that this spider is allowed to crawl.

start_urls:

A list of URLs where the spider will begin to crawl from.

custom_settings:

A dictionary of settings that will be overridden from the project wide configuration when running this spider.

crawler:

This attribute is set by the from_crawler() class method after initializing.

settings :

This is a Settings instance.Configuration for running this spider.

logger:

You can use it to output message.

start_request()

This method must return an iterable.When the spider is open for scraping, this method will be called.Scrapy calls it only once.so it’s safe to implement start_requests() as a generator.

The default implementation generates Request(url,dont_filter=True) for each url in start_urls.

If you want to change the Requests used to start scraping a domain, this is the method to override , For example if you need to start by logging in using a post request,YOu can do it.

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]
parse(response)
	This is the default callback used by Scrapy to process downloaded responses.This method must return an utterable of Request
 dicts or Item objects.
closed(resign)
	Called when the spider closes.
Spider arguments
	Spider can receive arguments.Spider arguments are passed through the crawl command using the -a option.For example:
	scrapy crawl spidername -a category=electronics
	you can use them on __ini__ method like this
	def __init__(self, category=None, *args, **kwargs):
        	super(MySpider, self).__init__(*args, **kwargs)
        	self.start_urls = ['http://www.example.com/categories/%s' % category]
	The default __init__ method will take any arguments and copy them to the spider as attributes.The above example can also be written as follows:
	def start_requests(self):
        	yield scrapy.Request('http://www.example.com/categories/%s' % self.category)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值