scrapy的name变量_scrapy使用爬取多个页面 - Come~on!

最新推荐文章于 2024-03-14 06:19:07 发布

weixin_39542111

最新推荐文章于 2024-03-14 06:19:07 发布

阅读量149

点赞数

文章标签： scrapy的name变量

本文链接：https://blog.csdn.net/weixin_39542111/article/details/111765084

版权

本文介绍了Scrapy的基本用法，包括如何创建项目、定义爬虫类，强调了start_urls、parse方法和name变量的重要性。示例中展示了爬取糗事百科的spider代码，解释了如何解析网页内容并生成Item。Scrapy还提供了多种内置蜘蛛和中间件，以适应不同爬取需求。

摘要由CSDN通过智能技术生成

scrapy是个好玩的爬虫框架，基本用法就是：输入起始的一堆url，让爬虫去get这些网页，然后parse页面，获取自己喜欢的东西。。

用上去有django的感觉，有settings，有field。还会自动生成一堆东西。。

用法：scrapy-admin.py startproject abc 生成一个project。试试就知道会生成什么东西。

在spiders包中新建一个py文件，里面写自定义的爬虫类。

自定义爬虫类必须有变量domain_name和start_urls，和实例方法parse(self,response)..最后，定义一个“全局”变量SPIDER，它会在 Scrapy 导入这个 module 的时候实例化，并自动被 Scrapy 的引擎找到。

它会在 Scrapy 查找我们的spider 的时候实例化，并自动被 Scrapy 的引擎找到。爬虫的运行过程：

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the

start_requests()

method which (by default) generates

Request

for the URLs specified in the

start_urls

and the

parse

method as callback function for the Requests. 第一步的关键是start_response()..通过parse和start_urls来生成第一个请求。

In the callback function, you parse the response (web page) and return either

Item

objects,

Request

objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback. 在parse函数中可以返回request，或者items 或者一个生成器来产生这些。这些urls最后会被转给downloader去下载。然后无穷无尽的urls和items产生了。

In callback functions, you parse the page contents, typically using

Selectors

(but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.你可以指定任何的selector，scrapy并不关心你用什么方法生成item，只是给了个XPth的selector而已。见过别人用lxml的，我更喜欢用beautifulsoup，bs的效率最慢。。。

Finally, the items returned from the spider will be typically persisted to a database (in some

Item Pipeline

) or written to a file using

Feed exports

.最后这些items又被交给pipeline，在这里可以进行各种对item的处理，存数据库啦，写文件啦什么的。。

这是我本月爬糗事百科的spider：

1 from scrapy.spider importBaseSpider2 importrandom,uuid3 from BeautifulSoup importBeautifulSoup as BS4 from scrapy.selector importHtmlXPathSelector5

6 from tutorial.items importTutorialItem7 defgetname():8 returnuuid.uuid1( ).hex()9

10 classJKSpider(BaseSpider):11 name='joke'

12 allowed_domains=["qiushibaike.com"]13 start_urls=[14 "http://www.qiushibaike.com/month?slow",15 ]16

17 defparse(self,response):18 root=BS(response.body)19 items=[]20 x=HtmlXPathSelector(response)21

22 y=x.select("//div[@class='content' and @title]/text()").extract()23 for i iny:24 item=TutorialItem()25 item["content"]=i26 items.append(item)27

28 return items

Scrapy comes with some useful generic spiders that you can use, to subclass your spiders from. Their aim is to provide convenient functionality for a few common scraping cases, like following all links on a site based on certain rules, crawling from

Sitemaps

, or parsing a XML/CSV feed.

scrapy自带了许多爬虫，方便去继承。例如全站爬取。从sitemap中爬取，或者是爬取xml中的url。。

class

scrapy.spider.

BaseSpider

This is the simplest spider, and the one from which every other spider must inherit from (either the ones that come bundled with Scrapy, or the ones that you write yourself). It doesn’t provide any special functionality. It just requests the given

start_urls

start_requests

, and calls the spider’s method

parse

for each of the resulting responses.

这是所有爬虫的基类，他没有任何特别的功能，只是请求start_urls/start_requests,然后指定回调函数为parse。

name

A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.

If the spider scrapes a single domain, a common practice is to name the spider after the domain, or without the

TLD

. So, for example, a spider that crawls

mywebsite.com

would often be called

mywebsite

name一定要唯一，所以最好命名为域名。相当唯一啊。。

allowed_domains

An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if

OffsiteMiddleware

is enabled.

不属于这些域名的url不会被爬取。前提是OffseiteMiddleware被启用了。

start_urls

A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.

起始url列表，不多说

start_requests

(

)

This method must return an iterable with the first Requests to crawl for this spider.

这个函数必须得返回一个可迭代对象，以此生成requests

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the

make_requests_from_url()

is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.

这个方法在没有指定particular urls的时候被调用(感觉指的是scrapy的命令启动的时候加上url参数)。如果指定了起始抓取的url，就会调用make_requests_from_url()生成requests。这个函数只会被调用一次。

The default implementation uses

make_requests_from_url()

to generate Requests for each url in

start_urls

If you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do:

默认情况是调用make_requests_from_url来为start_urls生成请求。如果要自定义生成起始请求。

def start_requests(self):

return [FormRequest("http://www.example.com/login",

formdata={'user': 'john', 'pass': 'secret'},

callback=self.logged_in)]

def logged_in(self, response):

# here you would extract links to follow and return Requests for

# each of them, with another callback

pass

这样就可以来抓取登录后用户的数据啦。。。

make_requests_from_url

(

url

)

A method that receives a URL and returns a

Request

object (or a list of

Request

objects) to scrape. This method is used to construct the initial requests in the

start_requests()

method, and is typically used to convert urls to requests.

Unless overridden, this method returns Requests with the

parse()

method as their callback function, and with dont_filter parameter enabled (see

Request

class for more info).

这就是刚才说的，为url生成请求。。会为生成的request对象加上parse方法。。

parse

(

response

)

This is the default callback used by Scrapy to process downloaded responses, when their requests don’t specify a callback.

The

parse

method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the

BaseSpider

class.

This method, as well as any other Request callback, must return an iterable of

Request

and/or

Item

objects.

Parameters:

response

(

:class:~scrapy.http.Response`

) – the response to parse

这个方法得返回request或items。

log

(

message

[

level

component

]

)

Log a message using the

scrapy.log.msg()

function, automatically populating the spider argument with the

name

of this spider. For more information see

Logging

例子：

1 from scrapy.selector importHtmlXPathSelector2 from scrapy.spider importBaseSpider3 from scrapy.http importRequest4 from myproject.items importMyItem5

6 classMySpider(BaseSpider):7 name = 'example.com'

8 allowed_domains = ['example.com']9 start_urls =[10 'http://www.example.com/1.html',11 'http://www.example.com/2.html',12 'http://www.example.com/3.html',13 ]14

15 defparse(self, response):16 hxs =HtmlXPathSelector(response)17 for h3 in hxs.select('//h3').extract():18 yield MyItem(title=h3)19

20 for url in hxs.select('//a/@href').extract():21 yield Request(url, callback=self.parse)