scrapy使用爬取多个页面

最新推荐文章于 2024-07-08 00:01:13 发布

as3166073

最新推荐文章于 2024-07-08 00:01:13 发布

阅读量680

点赞数 1

文章标签： python 爬虫数据库

原文链接：http://www.cnblogs.com/Yeah-come-on/p/3320388.html

版权

scrapy是个好玩的爬虫框架，基本用法就是：输入起始的一堆url，让爬虫去get这些网页，然后parse页面，获取自己喜欢的东西。。

用上去有django的感觉，有settings，有field。还会自动生成一堆东西。。

用法：scrapy-admin.py startproject abc 生成一个project。 试试就知道会生成什么东西。
在spiders包中新建一个py文件，里面写自定义的爬虫类。

自定义爬虫类必须有变量 domain_name 和 start_urls，和实例方法parse(self,response)..

它会在 Scrapy 查找我们的spider 的时候实例化，并自动被 Scrapy 的引擎找到。

爬虫的运行过程：

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests. 第一步的关键是start_response()..通过parse和start_urls来生成第一个请求。
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback. 在parse函数中可以返回request，或者items 或者一个生成器来产生这些。这些urls最后会被转给downloader去下载。然后无穷无尽的urls和items产生了。
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.你可以指定任何的selector，scrapy并不关心你用什么方法生成item，只是给了个XPth的selector而已。见过别人用lxml的，我更喜欢用beautifulsoup，bs的效率最慢。。。
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.最后这些items又被交给pipeline，在这里可以进行各种对item的处理，存数据库啦，写文件啦什么的。。

这是我本月爬糗事百科的spider：

 1 from scrapy.spider import BaseSpider
 2 import random,uuid
 3 from BeautifulSoup import</

最低0.47元/天解锁文章

as3166073

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy使用爬取多个页面

scrapy是个好玩的爬虫框架，基本用法就是：输入起始的一堆url，让爬虫去get这些网页，然后parse页面，获取自己喜欢的东西。。用上去有django的感觉，有settings，有field。还会自动生成一堆东西。。用法：scrapy-admin.py startproject abc 生成一个project。试试就知道会生成什么东西。在spiders包中新建一个py文件，...
复制链接

扫一扫