Python框架爬虫——Scrapy爬取当当网选定店铺的全部信息。保存至本地(csv、MongoDB )

最新推荐文章于 2023-06-13 00:13:30 发布

Demonslzh6

最新推荐文章于 2023-06-13 00:13:30 发布

阅读量1.1k

点赞数 2

分类专栏：爬虫文章标签： python 大数据数据分析数据挖掘

本文链接：https://blog.csdn.net/Demonslzh/article/details/104497255

版权

爬虫专栏收录该内容

3 篇文章 1 订阅

订阅专栏

文章目录

一、创建项目

开始之前我们先建立项目
1、命令行输入scrapy startproject dangdang创建dangdang项目文件夹
2、命令行输入cd dangdang 进入刚刚创建的目录中
3、命令行输入scrapy genspider spider "http://store.dangdang.com/282"生成我们的爬虫文件。

在这里插入图片描述

二、爬取子页面链接

创建项目完毕后我们开始分析网页（Chrome浏览器F12打开开发者模式），鼠标移到我们需要的字段的位置，可以看到源码，再复制XPath（如下图）。
//*[@id="sidefloat"]/div[2]/div/div/map/area/@href
可以看到最左边的分栏给出了所有的分类，那么我们依次爬取文学、小说等等每个页面的内容即可。
我们先记录下这个的Xpath。
在这里插入图片描述

三、设置每本书要爬取的Item(Items.py)

那么对于每本书我们需要爬取的内容已经框选出来了
分别为价格、评论数量（销量）、书名、作者、分类（也就是上一步获取的内容）
在这里插入图片描述
在项目文件中找到items.py定义上述字段。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    #定义需要爬取的变量名
    name=scrapy.Field()#定义保存书名的变量
    author=scrapy.Field()#作者
    price = scrapy.Field()#价格
    comments=scrapy.Field()#销量
    category=scrapy.Field()#图书分类

此外，我们还要获取这几个字段的Xpath。自己先悄悄地复制粘贴修改好。

四、爬虫解析页面(spider.py)

所以，对于整体的实现，我们分两步走，即按照二、三两步分别要抓取的内容进行解析。

import scrapy
import re
from dangdang.items import DangdangItem
class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['store.dangdang.com/282']
    start_urls = ['http://store.dangdang.com/282']


    def parse(self, response):
    '''
    爬取当当首页，获取左边悬浮窗口各个窗口的链接
    循环解析每一个页面内容
	'''
        #获取分类列表的链接
        urls=response.xpath('//*[@id="sidefloat"]/div[2]/div/div/map/area/@href').extract()
        #获取分类名称
        categories=response.xpath('//*[@id="sidefloat"]/div[2]/div/div/map/area').extract()

        for url,category in zip(urls,categories):#爬取分类链接的详情页
            text = re.findall('title="(.*?)" alt',category)#正则匹配，提取分类关键字信息
            
            for i in range(1,20):#爬取分页信息
                url_now=url+'&page_index='+str(i)#构造每一页的链接

				#调用解析单页的函数进行爬取
                yield scrapy.Request(url=url_now,callback=lambda response, category=text : self.parse_subpage(response,category), dont_filter=True)

    def parse_subpage(self,response,category):
		'''
		获取每一页的各个商品的信息。
		我们将每一个商品作为一个item
		'''
		#获取每一面的图书数量（这个不写也可，去网站数一数就知道是24个）
        length= len(response.xpath('//*[@id="component_0__0__8395"]/li/a/img').extract())
        for i in range(0,length+1):
            item = DangdangItem()

            item['name']=response.xpath('//*[@id = "component_0__0__8395"] /li[{}]/p[2]/a/text()'.format(i)).extract()
            item['author']=response.xpath('//*[@id="component_0__0__8395"]/li[{}]/p[5]/text()'.format(i)).extract()
            item['price']=response.xpath('//*[@id="component_0__0__8395"]/li[{}]/p[1]/span[1]/text()'.format(i)).extract()
            item['comments']=response.xpath('//*[@id="component_0__0__8395"]/li[{}]/p[4]/a/text()'.format(i)).extract()
            item['category']=category

            yield item

这样我们一个整体的抓取工作就完成了。但是数据现在只是抓取下来，并没有保存到本地目录来。所以接下来我们还要设置保存方法。

五、将爬取内存保存至本地(piplines.py)

通过设置piplines.py管道文件，使得我们可以在每次调用Item之后，将item自动保存到我们指定的本地位置，以下两种保存方法任选一种都可以保存成功

1、保存数据到MongoDB

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

class DangdangPipeline(object):
    def __init__(self):
        # 连接数据库
        self.client = pymongo.MongoClient(host='127.0.0.1', port=27017)#连接MongoDB
        self.test = self.client['dangdang']#创建数据库dangdang
        self.post = self.test['book']

    def process_item(self, item, spider):
        data = dict(item)#插入数据之前需要把字段转换成字典形式
        flag=1#判断是否为空，默认为1，表示不为空
        for key,value in data.items():
            if (value == []):
                flag = 0
                break
            if(type(value)==list):
                data[key]=value[0]
        if(flag==1):
            self.post.insert(data)#插入数据
        # return item

2、保存到csv

import csv

class DangdangPipeline(object):

    def __init__(self):
        # csv文件的位置,无需事先创建
        store_file = "dangdang.csv"

        self.file = open(store_file, 'a+', encoding="utf-8",newline = '')
        # csv写法
        self.writer = csv.writer(self.file, dialect="excel")
    def process_item(self, item, spider):
        # 判断字段值不为空再写入文件
        if(len(item['name'])!=0):
            self.writer.writerow([item['name'],item['author'] ,item['price'], item['comments'], item['category']])
        return item

pipline写完之后我们还需要修改setting.py使得管道文件生效，这样我们的爬取工作就告一段落了。
在这里插入图片描述

六、查看、清洗数据，开始数据分析

这样我们一个整体的就做好了
我们新建一个代码文件来读取爬到的数据。
在这里插入图片描述

相关的爬虫源码如下↓↓↓，关注公众号回复0001，即可获得爬虫+数据分析源码:

在这里插入图片描述

Demonslzh6

关注

2
点赞
踩
21

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python框架爬虫——Scrapy爬取当当网选定店铺的全部信息。保存至本地(csv、MongoDB )

文章目录一、创建项目二、爬取子页面链接三、设置每本书要爬取的Item(Items.py)四、爬虫解析页面(spider.py)五、将爬取内存保存至本地(piplines.py)1、保存数据到MongoDB2、保存到csv六、查看、清洗数据，开始数据分析一、创建项目开始之前我们先建立项目1、命令行输入scrapy startproject dangdang创建dangdang项目文件夹2、命...
复制链接

扫一扫