scrapy是个好玩的爬虫框架,基本用法就是:输入起始的一堆url,让爬虫去get这些网页,然后parse页面,获取自己喜欢的东西。。
用上去有django的感觉,有settings,有field。还会自动生成一堆东西。。
用法:scrapy-admin.py startproject abc 生成一个project。 试试就知道会生成什么东西。
在spiders包中新建一个py文件,里面写自定义的爬虫类。
自定义爬虫类必须有变量 domain_name
和 start_urls,和实例方法parse(self,response)..
它会在 Scrapy 查找我们的spider 的时候实例化,并自动被 Scrapy 的引擎找到。
爬虫的运行过程:
-
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests. 第一步的关键是start_response()..通过parse和start_urls来生成第一个请求。
-
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback. 在parse函数中可以返回request,或者items 或者一个生成器来产生这些。这些urls最后会被转给downloader去下载。然后无穷无尽的urls和items产生了。
-
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.你可以指定任何的selector,scrapy并不关心你用什么方法生成item,只是给了个XPth的selector而已。见过别人用lxml的,我更喜欢用beautifulsoup,bs的效率最慢。。。
-
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.最后这些items又被交给pipeline,在这里可以进行各种对item的处理,存数据库啦,写文件啦什么的。。
这是我本月爬糗事百科的spider:
1 from scrapy.spider import BaseSpider 2 import random,uuid 3 from BeautifulSoup import</