crawlspider小试牛刀

最新推荐文章于 2021-04-17 23:08:51 发布

W_Joffre

最新推荐文章于 2021-04-17 23:08:51 发布

阅读量428

点赞数

分类专栏： Python 文章标签： crawspider callback parse

本文链接：https://blog.csdn.net/u011385780/article/details/52966211

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

这里有crawlspider源码分析
1、start_urls里面的URL会经过parse、_parse_response、parse_start_url得到处理。
2、Rule里面没有指定callback的URL会经过_requests_to_follow发起请求，经过_response_download、_parse_response(这里会判断是否有callback，若是来自parse的调用，那么callback就是parse_start_url)。如果不是来自parse的调用而且Rule里没有callback，那么这个网页就不会被爬下来。
3、如果Rule有指定callback，那么会调用，将网页爬下来。
4、爬考拉的时候，我把parse_start_url重写，然后将Rule里面的nextPage的callback指定为parse，start_urls方第一页的URL，那么这样所有的列表也都会经过parse，再经过parse_start_url，最后经过parse_item被抓下来。
直接上代码：

class KaolaSpider(CrawlSpider):
    name = "kaola"
    start_urls = ['http://www.kaola.com/search.html?key=coach&pageNo=1&type=2&pageSize=60&isStock=false&isSelfProduct=false&isDesc=true&brandId=&proIds=&isSearch=0&isPromote=false&backCategory=&country=&lowerPrice=-1&upperPrice=-1&changeContent=type',]
    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="nextPage"]')),
            follow=True,
            callback='parse'),
    )
    def parse_start_url(self,response):
        sel = Selector(response)
        goods = sel.xpath('//ul[@id="result"]/li')
        currPage = ''.join( sel.xpath('//span[@class="num"]/i/text()').extract() ).strip()
        i=1
        for good in goods:
            item = items.KaolaItem()
            item['rank'] = i + ( int(currPage)-1 )*60
            i+=1           
            item['currentPrice'] = ''.join( good.xpath('.//*/*/p[@class="price"]/span[1]/text()').extract() ).strip()
            item['marketPrice'] = ''.join( good.xpath('.//*/*/p[@class="price"]/span[2]/del/text()').extract() ).strip()
            tmp = good.xpath('.//div/div[@class="img"]/a/@href').extract()[0]
            detailUrl = "http://www.kaola.com"
            if "http://" not in tmp:
                detailUrl = detailUrl + tmp
            else:
                detailUrl = tmp
            item['goodUrl'] = detailUrl
            r = Request(detailUrl,callback=self.parse_item)
            r.meta['item'] = item
            yield r

    def parse_item(self,response):
        item = response.meta['item']
        sel = Selector(response)
        item['name'] = ''.join( sel.xpath('//dt[@class="product-title"]/text()').extract() ).strip()
        item['commentCount'] = ''.join( sel.xpath('//b[@id="commentCounts"]/text()').extract() ).strip()
        params = sel.xpath('//ul[@class="goods_parameter"]/li')
        for param in params:
            text = ''.join( param.xpath('.//text()').extract() ).strip().encode("utf-8")
            if "商品品牌" in text:
                item['brand'] = text
            elif "产品类型" in text:
                item['proType'] = text
            elif "适用人群" in text:
                item['fitPeople'] = text
        yield item

5、还有一个方法就是完全不用parse和parse_start_url，直接用start_request方法发起初始请求并将callback设为parse_item，然后设置Rule里面抓到的URL的callback也为parse_item，这样就统一了页面处理，但是怎样用start_request发起多个初始URL的请求？？？？？

W_Joffre

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
crawlspider小试牛刀

crawlspider源码分析 1、start_urls里面的URL会经过parse、_parse_response、parse_start_url得到处理。 2、Rule里面没有指定callback的URL会经过_requests_to_follow发起请求，经过_response_download、_parse_response(这里会判断是否有callback，若是来自parse的调用，
复制链接

扫一扫