接着上篇博客《用Scrapy抓取豆瓣小组数据(一)》http://my.oschina.net/chengye/blog/124157
在scrapy中怎么让Spider自动去抓取豆瓣小组页面
1,引入Scrapy中的另一个预定义的蜘蛛CrawlSpider
1 | from scrapy.contrib.spiders import CrawlSpider, Rule |
2 | from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor |
2, 基于CrawSpider定义一个新的类GroupSpider,并添加相应的爬行规则。
01 | class GroupSpider(CrawlSpider): |
03 | allowed_domains = [ "douban.com" ] |
05 | "http://www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9" , |
06 | "http://www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB" , |
07 | "http://www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A" , |
08 | "http://www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF" , |
09 | "http://www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF" , |
10 | "http://www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F" , |
11 | "http://www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A" , |
12 | "http://www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3" |
16 | Rule(SgmlLinkExtractor(allow = ( '/group/[^/]+/$' , )), callback = 'parse_group_home_page' , process_request = 'add_cookie' ), |
17 | Rule(SgmlLinkExtractor(allow = ( '/group/explore\?tag' , )), follow = True , process_request = 'add_cookie' ), |
start_urls预定义了豆瓣有所小组分类页面,蜘蛛会从这些页面出发去找小组。
rules定义是CrawlSpider中最重要的一环,可以理解为:当蜘蛛看到某种类型的网页,如何去进行处理。
例如,如下规则会处理URL以/group/XXXX/为后缀的网页,调用parse_group_home_page为处理函数,并且会在request发送前调用add_cookie来附加cookie信息。
1 | Rule(SgmlLinkExtractor(allow = ( '/group/[^/]+/$' , )), callback = 'parse_group_home_page' , process_request = 'add_cookie' ), |
又如,如下规则会抓取网页内容,并自动抓取网页中链接供下一步抓取,但不会处理网页的其他内容。
1 | Rule(SgmlLinkExtractor(allow = ( '/group/explore\?tag' , )), follow = True , process_request = 'add_cookie' ), |
如何添加Cookie
定义如下函数,并如前面所讲在Rule定义里添加process_request=add_cookie。
1 | def add_cookie( self , request): |
2 | request.replace(cookies = [ |
3 | { 'name' : 'COOKIE_NAME' , 'value' : 'VALUE' , 'domain' : '.douban.com' , 'path' : '/' }, |
一般网站在client端都用cookie来保存用户的session信息,添加cookie信息就可以模拟登陆用户来抓取数据。
如何防止蜘蛛被网站Ban掉
首先可以尝试添加登陆用户的cookie去抓取网页,即使你抓取的是公开网页,添加cookie有可能会防止蜘蛛在应用程序层被禁。这个我没有实际验证过,但肯定没有坏处。
其次,即使你是授权用户,如果你的访问过于频繁,你的IP会可能被ban,所以一般你需要让蜘蛛在访问网址中间休息1~2秒。
还有就是配置User Agent,尽量轮换使用不同的UserAgent去抓取网页
在Scrapy项目的settings.py钟,添加如下设置:
2 | RANDOMIZE_DOWNLOAD_DELAY = True |
3 | USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5' |
================
到此位置,抓取豆瓣小组页面的蜘蛛就完成了。接下来,可以按照这种模式定义抓取小组讨论页面数据的Spider,然后就放手让蜘蛛去爬行吧!Have Fun!
01 | from scrapy.contrib.spiders import CrawlSpider, Rule |
02 | from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor |
03 | from scrapy.selector import HtmlXPathSelector |
04 | from scrapy.item import Item |
05 | from douban.items import DoubanItem |
08 | class GroupSpider(CrawlSpider): |
10 | allowed_domains = [ "douban.com" ] |
12 | "http://www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9" , |
13 | "http://www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB" , |
14 | "http://www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A" , |
15 | "http://www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF" , |
16 | "http://www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF" , |
17 | "http://www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F" , |
18 | "http://www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A" , |
19 | "http://www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3" |
23 | Rule(SgmlLinkExtractor(allow = ( '/group/[^/]+/$' , )), callback = 'parse_group_home_page' , process_request = 'add_cookie' ), |
25 | Rule(SgmlLinkExtractor(allow = ( '/group/explore\?tag' , )), follow = True , process_request = 'add_cookie' ), |
28 | def __get_id_from_group_url( self , url): |
29 | m = re.search( "^http://www.douban.com/group/([^/]+)/$" , url) |
37 | def add_cookie( self , request): |
38 | request.replace(cookies = [ |
43 | def parse_group_topic_list( self , response): |
44 | self .log( "Fetch group topic list page: %s" % response.url) |
48 | def parse_group_home_page( self , response): |
50 | self .log( "Fetch group home page: %s" % response.url) |
52 | hxs = HtmlXPathSelector(response) |
56 | item[ 'groupName' ] = hxs.select( '//h1/text()' ).re( "^\s+(.*)\s+$" )[ 0 ] |
59 | item[ 'groupURL' ] = response.url |
60 | groupid = self .__get_id_from_group_url(response.url) |
63 | members_url = "http://www.douban.com/group/%s/members" % groupid |
64 | members_text = hxs.select( '//a[contains(@href, "%s")]/text()' % members_url).re( "\((\d+)\)" ) |
65 | item[ 'totalNumber' ] = members_text[ 0 ] |
68 | item[ 'RelativeGroups' ] = [] |
69 | groups = hxs.select( '//div[contains(@class, "group-list-item")]' ) |
71 | url = group.select( 'div[contains(@class, "title")]/a/@href' ).extract()[ 0 ] |
72 | item[ 'RelativeGroups' ].append(url) |
74 | return item<span><span style = "line-height:20px;" > < / span>< / span> |