爬虫（六）：苏宁图书

最新推荐文章于 2024-10-12 19:31:03 发布

灬心微

最新推荐文章于 2024-10-12 19:31:03 发布

阅读量176

点赞数 1

分类专栏：爬虫文章标签： redis python 爬虫

本文链接：https://blog.csdn.net/qq_40399001/article/details/106511148

版权

爬虫专栏收录该内容

5 篇文章 2 订阅

订阅专栏

苏宁图书爬虫

一、开发环境

 平台：windows
 解释器：vscode
 模块：re、time、scrapy-redis、copy

二、网页分析

主要是价格这里json数据：#https://c0.3.cn/stock?skuId=11290882&cat=1713-3258-3303&venderId=1000013489&area=27_2468_2472_0
经过分析都可以从原网页中找到组成
skuId=11290882：item["skuId"]=li.xpath('./@data-spu').extract_first()
cat=1713-3258-330
venderId=1000013489： item["venderId"]=li.xpath('.//div[@class="p-img"]/div/@data-venid').extract_first()
area=27_2468_2472_0：这个所有的网页都有但是不能去掉

三、主代码

下面展示一些 内联代码片。

# -*- coding: utf-8 -*-
import scrapy
from copy import deepcopy
import re
import json
import time

class JdbookSpider(scrapy.Spider):
    name = 'jdbook'
    allowed_domains = ['jd.com','c0.3.cn']
    start_urls = ['https://book.jd.com/booksort.html']


    def parse(self, response):
        dt_list=response.xpath('//div[@class="mc"]/dl/dt')
        for dt in dt_list:
            item={}
            item["b_cate"]=dt.xpath("./a/text()").extract_first()
            em_list=dt.xpath("./following-sibling::dd[1]/em")#
            for em in em_list:
                
                item["s_href"]=em.xpath("./a/@href").extract_first()
                #print(item["s_href"])
                item["num"]=re.findall(r"//list.jd.com/(.*?).html",item["s_href"])[0]#提取price的cat数据
                item["num"]=re.sub(r"-",",",item["num"])#将cat数据转换成网页所需要的格式
                item["s_cate"]=em.xpath("./a/text()").extract_first()
                if item["s_href"] is not None:
                    item["s_href"]="https:"+item["s_href"]
                    yield scrapy.Request(
                        item["s_href"],
                        callback=self.parse_book_list,
                        meta={"item":deepcopy(item)}
                    )


    def parse_book_list(self,response):
        item=response.meta["item"]
        li_list=response.xpath("//div[@id='J_goodsList']/ul/li")
        for li in li_list:
            item["bookname"]=li.xpath('.//div[@class="p-name"]/a/em/text()').extract_first()
            item["bokimg"]=li.xpath('.//div[@class="p-img"]/a/@href').extract_first()
            item["author"]=li.xpath('.//div[@class="p-bookdetails"]/span[@class="p-bi-name"]/a/text()').extract_first()
            item["datetime"]=li.xpath('.//div[@class="p-bookdetails"]/span[@class="p-bi-date"]/text()').extract_first()
            item["press"]=li.xpath('.//div[@class="p-shopnum"]/a/text()').extract_first()
            item["skuId"]=li.xpath('./@data-spu').extract_first()
            item["venderId"]=li.xpath('.//div[@class="p-img"]/div/@data-venid').extract_first()
            price_url="https://c0.3.cn/stock?skuId="+str(item["skuId"])+"&cat="+str(item["num"])+"&venderId="+str(item["venderId"])+"&area=27_2468_2472_0"
            item["price_url"]=price_url
            #print(price_url)
            #print(item)
            yield scrapy.Request(
                price_url,
                callback=self.price_detail,
                meta={"item":deepcopy(item)}
            )
        next_url=response.xpath("//div[@id='J_searchWrap']//a[9]/@href")
        if next_url is not None:
            data=re.sub(r"-","%2C",item["num"])
            for i in range(100):
                if i%2==1:
                    next_url="https://list.jd.com/list.html?cat="+data+"&page="+str(i)
                    yield scrapy.Request(
                        next_url,
                        callback=self.parse_book_list,
                        meta={"item":item}
                    )
    

    def  price_detail(self,response):
        item=response.meta["item"]
        html = response.text
        try:
            item["book_price"]= re.findall('''"m":"(.*?)"''',html)[0]
        #print(item)
        except IndexError:
            print("尚未采集到价格")
        finally:
            print(item)
            time.sleep(0.5)
            yield item

setting.py:

/DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器(核心代码共享调度器)
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

REDIS_URL= 'redis://127.0.0.1:6379'

四、问题

在数据价格提取的url地址、由于频繁的测试，导致网站崩溃。无法获取到url,因此才使用了异常捕捉。

灬心微

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录