scrapy 爬取苏宁图书

爱吃猫的鱼101

于 2021-05-16 01:25:55 发布

阅读量338

点赞数 2

分类专栏： Python爬虫记录笔记

本文链接：https://blog.csdn.net/qq_43369592/article/details/116871023

版权

笔记同时被 3 个专栏收录

5 篇文章 0 订阅

订阅专栏

Python爬虫

4 篇文章 0 订阅

订阅专栏

记录

3 篇文章 0 订阅

订阅专栏

一.项目要求

从每个大分类中获取里面的小分类
从小分类里面获取图书列表，并进行翻页获取
从图书列表里面获取每本书里面的详细信息

二. 需要用到的库

scrapy（整个爬虫的实现）
re（需要用正则匹配需要的数据）
copy（需要用到deepcopy）

三. 开撸

创建爬虫项目及爬虫

scrapy startproject suning
scrapy genspider book suning.com

找到初始url地址

start_urls = ['https://book.suning.com/']

在items.py里面创建需要的字段

class SuningItem(scrapy.Item):
    big_tag = scrapy.Field()
    small_tag = scrapy.Field()
    small_href = scrapy.Field()
    book_href = scrapy.Field()
    book_name = scrapy.Field()
    book_price = scrapy.Field()

对大分类和小分类进行分组，这里使用的是xpath，并对小分组发起请求

def parse(self, response):
    # 大分组
    divs = response.xpath('//div[@class="menu-item"]')
            for div in divs:
                item = SuningItem()
                item['big_tag'] = div.xpath('./dl/dt/h3/a/text()').get()
                # 小分组
                dd_as = div.xpath('./dl/dd/a')
                for a in dd_as:
                    item['small_tag'] = a.xpath('./text()').get()
                    # 小分组的url
                    item['small_href'] = a.xpath('./@href').get()
                    yield scrapy.Request(
                        item['small_href'],
                        callback=self.parse_small_href,
                        meta={'item': deepcopy(item)}
                    )

通过meta将前面的 item 传递给下一个函数，值得注意的是：此处需要使用deepcopy方法将item进行深度复制后再传过去。为什么呢？由于scrapy框架是基于Twisted 异步网络框架写的，因此会导致item的错误调用，而使用deepcopy则可以避免。

对小分组里面的图书列表进行每本图书详情页链接获取，并发起请求

def parse_small_href(self, response):
    # 接收之前的item
    item = response.meta['item']
    # 图书列表
    book_list = response.xpath('//div[@id="filter-results"]//ul[@class="clearfix"]/li')
    for book in book_list:
        item['book_href'] = 'https:' + 
        			   book.xpath('.//div[@class="wrap"]/div[@class="res-img"]//a/@href').get()
        yield scrapy.Request(
            item['book_href'],
            callback=self.parse_xiangqing,
            meta={'item': deepcopy(item)}
        )

图书列表页中进行翻页

# 获取该图书列表的总页码
all_pages = int(re.findall(r'param.pageNumbers = "(.*?)";', response.body.decode())[0])
# 获取起始页码
page_cont = int(re.findall(r'param.currentPage = "(.*?)";', response.body.decode())[0]) 
# 构造url需要用到的参数
url_num = re.findall(r'<a    pagenum="1"  name="ssdln_(.*?)_bottom_page-1"', 			                            response.body.decode())[0]
# 对是否有下一页进行判断
if page_cont < all_pages:
    next_url = 'https://list.suning.com' + '/1-{}-{}-0-0-0-0-0-0-4.html'.format(url_num, 	                   page_cont + 1)
    yield scrapy.Request(
        next_url,
        callback=self.parse_small_href,
        meta={'item': deepcopy(item)},
        dont_filter=False
    )

对！！就是这个地方最扯淡，xpath根本获取不到的url！！！花了好长时间。。。。。。。

这里使用xpath获取不到下一页的链接，所以我在网页源码中找到了需要的信息并通过正则找到了它，由于获取到的url并不能直接使用，因此需要进行url的构造后再发起请求。

对图书详情页面中的信息进行提取

def parse_xiangqing(self, response):
    item = response.meta['item']
    item['book_name'] = response.
    				xpath('//div[@class="proinfo-title"]/h1/text()').extract()	[-1].strip()
    item['book_price'] = re.findall(r'"itemPrice":"(.*?)"', response.body.decode())[0]
    yield item

book_name 获取到后发现里面的数据有很多换行符、空白之类的无用信息，这里使用strip()方法进行过滤，

然后就是图书的价格，xpath页获取不到，只能在网页源码中找数据，使用正则进行匹配。

settings.py文件的配置

# 随便找一个UA就可以
USER_AGENT = ' '
# 关闭robots协议
ROBOTSTXT_OBEY = False
# 设置下载时间间隔
DOWNLOAD_DELAY = 0.5

# ITEM_PIPELINES 开不开无所谓，这次也没用到它

四. 部分成果展示
在这里插入图片描述

爱吃猫的鱼101

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
scrapy 爬取苏宁图书

一.项目要求从每个大分类中获取里面的小分类从小分类里面获取图书列表，并进行翻页获取从图书列表里面获取每本书里面的详细信息二. 需要用到的库scrapy（整个爬虫的实现）re（需要用正则匹配需要的数据）copy（需要用到deepcopy）三. 开撸创建爬虫项目及爬虫scrapy startproject suningscrapy genspider book suning.com找到初始url地址start_urls = ['https://book.suning.
复制链接

扫一扫

专栏目录