Following links¶
Let’s say, instead of just scraping the stuff from the first two pages
from http://quotes.toscrape.com, you want quotes from all the pages in the website.
Now that you know how to extract data from pages, let’s see how to follow links
from them.
First thing is to extract the link to the page we want to follow. Examining
our page, we can see there is a link to the next page with the following
markup:
We can try extracting it in the shell:
>>>response.css("li.next a").get()
"Next →"
This gets the anchor element, but we want the attribute href. For that,
Scrapy supports a CSS extension that lets you select the attribute contents,
like this:
>>>response.css("li.next a::attr(href)").get()
"/page/2/"
There is also an attrib property available
(see Selecting element attributes for more):
>>>response.css("li.next a").attrib["href"]
"/page/2/"
Let’s see now our spider modified to recursively follow the link to the next
page, extracting data from it:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"http://quotes.toscrape.com/page/1/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Now, after extracting the data, the parse() method looks for the link to
the next page, builds a full absolute URL using the
urljoin() method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.
What you see here is Scrapy’s mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it’s
visiting.
In our example, it creates a sort of loop, following all the links to the next page
until it doesn’t find one – handy for crawling blogs, forums and other sites with
pagination.
A shortcut for creating Requests¶
As a shortcut for creating Request objects you can use
response.follow:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"http://quotes.toscrape.com/page/1/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("span small::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
Unlike scrapy.Request, response.follow supports relative URLs directly - no
need to call urljoin. Note that response.follow just returns a Request
instance; you still have to yield this Request.
You can also pass a selector to response.follow instead of a string;
this selector should extract necessary attributes:
for href in response.css("ul.pager a::attr(href)"):
yield response.follow(href, callback=self.parse)
For elements there is a shortcut: response.follow uses their href
attribute automatically. So the code can be shortened further:
for a in response.css("ul.pager a"):
yield response.follow(a, callback=self.parse)
To create multiple requests from an iterable, you can use
response.follow_all instead:
anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)
or, shortening it further:
yield from response.follow_all(css="ul.pager a", callback=self.parse)
More examples and patterns¶
Here is another spider that illustrates callbacks and following links,
this time for scraping author information:
import scrapy
class AuthorSpider(scrapy.Spider):
name = "author"
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
author_page_links = response.css(".author + a")
yield from response.follow_all(author_page_links, self.parse_author)
pagination_links = response.css("li.next a")
yield from response.follow_all(pagination_links, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default="").strip()
yield {
"name": extract_with_css("h3.author-title::text"),
"birthdate": extract_with_css(".author-born-date::text"),
"bio": extract_with_css(".author-description::text"),
}
This spider will start from the main page, it will follow all the links to the
authors pages calling the parse_author callback for each of them, and also
the pagination links with the parse callback as we saw before.
Here we’re passing callbacks to
response.follow_all as positional
arguments to make the code shorter; it also works for
Request.
The parse_author callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don’t need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured by the setting
DUPEFILTER_CLASS.
Hopefully by now you have a good understanding of how to use the mechanism
of following links and callbacks with Scrapy.
As yet another example spider that leverages the mechanism of following links,
check out the CrawlSpider class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page,
using a trick to pass additional data to the callbacks.