python scrapy-Scrapy Tutorial

Following links¶

Let’s say, instead of just scraping the stuff from the first two pages

from http://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let’s see how to follow links

from them.

First thing is to extract the link to the page we want to follow. Examining

our page, we can see there is a link to the next page with the following

markup:

Next →

We can try extracting it in the shell:

>>>response.css("li.next a").get()

"Next →"

This gets the anchor element, but we want the attribute href. For that,

Scrapy supports a CSS extension that lets you select the attribute contents,

like this:

>>>response.css("li.next a::attr(href)").get()

"/page/2/"

There is also an attrib property available

(see Selecting element attributes for more):

>>>response.css("li.next a").attrib["href"]

"/page/2/"

Let’s see now our spider modified to recursively follow the link to the next

page, extracting data from it:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

"http://quotes.toscrape.com/page/1/",

]

def parse(self, response):

for quote in response.css("div.quote"):

yield {

"text": quote.css("span.text::text").get(),

"author": quote.css("small.author::text").get(),

"tags": quote.css("div.tags a.tag::text").getall(),

}

next_page = response.css("li.next a::attr(href)").get()

if next_page is not None:

next_page = response.urljoin(next_page)

yield scrapy.Request(next_page, callback=self.parse)

Now, after extracting the data, the parse() method looks for the link to

the next page, builds a full absolute URL using the

urljoin() method (since the links can be

relative) and yields a new request to the next page, registering itself as

callback to handle the data extraction for the next page and to keep the

crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you yield

a Request in a callback method, Scrapy will schedule that request to be sent

and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules

you define, and extract different kinds of data depending on the page it’s

visiting.

In our example, it creates a sort of loop, following all the links to the next page

until it doesn’t find one – handy for crawling blogs, forums and other sites with

pagination.

A shortcut for creating Requests¶

As a shortcut for creating Request objects you can use

response.follow:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

"http://quotes.toscrape.com/page/1/",

]

def parse(self, response):

for quote in response.css("div.quote"):

yield {

"text": quote.css("span.text::text").get(),

"author": quote.css("span small::text").get(),

"tags": quote.css("div.tags a.tag::text").getall(),

}

next_page = response.css("li.next a::attr(href)").get()

if next_page is not None:

yield response.follow(next_page, callback=self.parse)

Unlike scrapy.Request, response.follow supports relative URLs directly - no

need to call urljoin. Note that response.follow just returns a Request

instance; you still have to yield this Request.

You can also pass a selector to response.follow instead of a string;

this selector should extract necessary attributes:

for href in response.css("ul.pager a::attr(href)"):

yield response.follow(href, callback=self.parse)

For elements there is a shortcut: response.follow uses their href

attribute automatically. So the code can be shortened further:

for a in response.css("ul.pager a"):

yield response.follow(a, callback=self.parse)

To create multiple requests from an iterable, you can use

response.follow_all instead:

anchors = response.css("ul.pager a")

yield from response.follow_all(anchors, callback=self.parse)

or, shortening it further:

yield from response.follow_all(css="ul.pager a", callback=self.parse)

More examples and patterns¶

Here is another spider that illustrates callbacks and following links,

this time for scraping author information:

import scrapy

class AuthorSpider(scrapy.Spider):

name = "author"

start_urls = ["http://quotes.toscrape.com/"]

def parse(self, response):

author_page_links = response.css(".author + a")

yield from response.follow_all(author_page_links, self.parse_author)

pagination_links = response.css("li.next a")

yield from response.follow_all(pagination_links, self.parse)

def parse_author(self, response):

def extract_with_css(query):

return response.css(query).get(default="").strip()

yield {

"name": extract_with_css("h3.author-title::text"),

"birthdate": extract_with_css(".author-born-date::text"),

"bio": extract_with_css(".author-description::text"),

}

This spider will start from the main page, it will follow all the links to the

authors pages calling the parse_author callback for each of them, and also

the pagination links with the parse callback as we saw before.

Here we’re passing callbacks to

response.follow_all as positional

arguments to make the code shorter; it also works for

Request.

The parse_author callback defines a helper function to extract and cleanup the

data from a CSS query and yields the Python dict with the author data.

Another interesting thing this spider demonstrates is that, even if there are

many quotes from the same author, we don’t need to worry about visiting the

same author page multiple times. By default, Scrapy filters out duplicated

requests to URLs already visited, avoiding the problem of hitting servers too

much because of a programming mistake. This can be configured by the setting

DUPEFILTER_CLASS.

Hopefully by now you have a good understanding of how to use the mechanism

of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links,

check out the CrawlSpider class for a generic

spider that implements a small rules engine that you can use to write your

crawlers on top of it.

Also, a common pattern is to build an item with data from more than one page,

using a trick to pass additional data to the callbacks.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值