这里有crawlspider源码分析
1、start_urls里面的URL会经过parse、_parse_response、parse_start_url得到处理。
2、Rule里面没有指定callback的URL会经过_requests_to_follow发起请求,经过_response_download、_parse_response(这里会判断是否有callback,若是来自parse的调用,那么callback就是parse_start_url)。如果不是来自parse的调用而且Rule里没有callback,那么这个网页就不会被爬下来。
3、如果Rule有指定callback,那么会调用,将网页爬下来。
4、爬考拉的时候,我把parse_start_url重写,然后将Rule里面的nextPage的callback指定为parse,start_urls方第一页的URL,那么这样所有的列表也都会经过parse,再经过parse_start_url,最后经过parse_item被抓下来。
直接上代码:
class KaolaSpider(CrawlSpider):
name = "kaola"
start_urls = ['http://www.kaola.com/search.html?key=coach&pageNo=1&type=2&pageSize=60&isStock=false&isSelfProduct=false&isDesc=true&brandId=&proIds=&isSearch=0&isPromote=false&backCategory=&country=&lowerPrice=-1&upperPrice=-1&changeContent=type',]
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="nextPage"]')),
follow=True,
callback='parse'),
)
def parse_start_url(self,response):
sel = Selector(response)
goods = sel.xpath('//ul[@id="result"]/li')
currPage = ''.join( sel.xpath('//span[@class="num"]/i/text()').extract() ).strip()
i=1
for good in goods:
item = items.KaolaItem()
item['rank'] = i + ( int(currPage)-1 )*60
i+=1
item['currentPrice'] = ''.join( good.xpath('.//*/*/p[@class="price"]/span[1]/text()').extract() ).strip()
item['marketPrice'] = ''.join( good.xpath('.//*/*/p[@class="price"]/span[2]/del/text()').extract() ).strip()
tmp = good.xpath('.//div/div[@class="img"]/a/@href').extract()[0]
detailUrl = "http://www.kaola.com"
if "http://" not in tmp:
detailUrl = detailUrl + tmp
else:
detailUrl = tmp
item['goodUrl'] = detailUrl
r = Request(detailUrl,callback=self.parse_item)
r.meta['item'] = item
yield r
def parse_item(self,response):
item = response.meta['item']
sel = Selector(response)
item['name'] = ''.join( sel.xpath('//dt[@class="product-title"]/text()').extract() ).strip()
item['commentCount'] = ''.join( sel.xpath('//b[@id="commentCounts"]/text()').extract() ).strip()
params = sel.xpath('//ul[@class="goods_parameter"]/li')
for param in params:
text = ''.join( param.xpath('.//text()').extract() ).strip().encode("utf-8")
if "商品品牌" in text:
item['brand'] = text
elif "产品类型" in text:
item['proType'] = text
elif "适用人群" in text:
item['fitPeople'] = text
yield item
5、还有一个方法就是完全不用parse和parse_start_url,直接用start_request方法发起初始请求并将callback设为parse_item,然后设置Rule里面抓到的URL的callback也为parse_item,这样就统一了页面处理,但是怎样用start_request发起多个初始URL的请求?????