Scraping cycle:
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests()
method which (by default) generatesRequest
for the URLs specified in thestart_urls
and theparse
method as callback function for the Requests.In the callback function, you parse the response (web page) and return either dicts with extracted data,
Item
objects,Request
objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.
scrapy.Spider
It’s the simplest spider。It just provides a default start_requests() implementation which sends requests from the start_urls spider attribute and calls the spider’s method parse for each of the resulting responses.
name :
This attribute defines name of spider.It must be unique.You will run Scrapyproject use it in your command line , following :
scrapy crawl quotes #
allowed_domains:
An optional list of strings containing domains that this spider is allowed to crawl.
start_urls:
A list of URLs where the spider will begin to crawl from.
custom_settings:
A dictionary of settings that will be overridden from the project wide configuration when running this spider.
crawler:
This attribute is set by the from_crawler() class method after initializing.
settings :
This is a Settings instance.Configuration for running this spider.
logger:
You can use it to output message.
start_request()
This method must return an iterable.When the spider is open for scraping, this method will be called.Scrapy calls it only once.so it’s safe to implement start_requests() as a generator.
The default implementation generates Request(url,dont_filter=True) for each url in start_urls.
If you want to change the Requests used to start scraping a domain, this is the method to override , For example if you need to start by logging in using a post request,YOu can do it.
def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)]
parse(response)
This is the default callback used by Scrapy to process downloaded responses.This method must return an utterable of Request
dicts or Item objects.
closed(resign)
Called when the spider closes.
Spider arguments
Spider can receive arguments.Spider arguments are passed through the crawl command using the -a option.For example:
scrapy crawl spidername -a category=electronics
you can use them on __ini__ method like this
def __init__(self, category=None, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.start_urls = ['http://www.example.com/categories/%s' % category]
The default __init__ method will take any arguments and copy them to the spider as attributes.The above example can also be written as follows:
def start_requests(self):
yield scrapy.Request('http://www.example.com/categories/%s' % self.category)