Scrapy爬虫框架—自定义Pipelines将文件以Json格式存储

最新推荐文章于 2022-12-05 14:43:35 发布

Jay丶萧邦

最新推荐文章于 2022-12-05 14:43:35 发布

阅读量1.1k

点赞数

分类专栏： Python编程由简到繁

本文链接：https://blog.csdn.net/qq_42543244/article/details/81486897

版权

Python编程由简到繁专栏收录该内容

29 篇文章 1 订阅

订阅专栏

我们可以在终端内输入命令，这个就是scrapy自带将文件保存为Json格式的命令。

scrapy crawl xxx -o xxx.json -s FEED_EXPORT_ENCODING=utf-8

本次的内容，我们将通过自定义pipelines将文件以Json格式存储。

我们这次以爬取小说为例。小说网址：https://www.hongxiu.com/，点击进入这个网址，我们这次想要获取的内容是：女生分类里面的各个小说类别的前五页小说内容，包含小说的名字，小说的作者，小说的类型，小说的字数，小说的收藏量，小说的点击量，小说的简介，小说的封面，其中小说的封面下载下来，其余内容以Json格式存储。

这些红色框体就是我们所要取的内容。

1.创建Scrapy框架，爬虫目录。

2.book.py内的代码编写

# -*- coding: utf-8 -*-
import scrapy
import re
from ..items import  HongxiuItem

class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['hongxiu.com']
    start_urls = ['https://www.hongxiu.com/all?gender=2&catId=-1']
    # 1.取出女生界面下的小说类别
    def parse(self, response):
        type_list = response.xpath('//ul[@type="category"]/li/a/@href').extract()
        # 删除第一个'全部'的类别
        del type_list[0]
        for type in type_list:
            # 拼接完整的女生界面下类别的url
            url = 'https://www.hongxiu.com' + type
            # 取出网址之中的catId=''里面的数字，代表的是一个类别
            split=re.compile(r'.*?catId=(.*?)&.*?',re.S)
            catId=re.findall(split,url)
            yield  scrapy.Request(url=url,meta={'type':catId[0]},callback=self.get_content_with_type_url)
    # 2.在这个方法内我们要得到女生界面下的所有类别的前五页内容
    def get_content_with_type_url(self,response):
        catId=response.meta['type']
        for page_num in range(1,6):
            # 拼接网址
            url='https://www.hongxiu.com/all?pageNum='+str(page_num)+'&pageSize=10&gender=2&catId='+catId+'&isFinish=-1&isVip=-1&size=-1&updT=-1&orderBy=0'
            yield scrapy.Request(url=url,callback=self.get_book_with_url)
    # 3.这个方法内，我们需要去获取小说的详情页的网址
    def get_book_with_url(self,response):
        book_list=response.xpath('//div[@class="book-info"]/h3/a/@href').extract()
        for book in book_list:
            # 拼接完整的小说详情页网址
            url='https://www.hongxiu.com'+book
            yield  scrapy.Request(url=url,callback=self.get_detail_with_url)
    # 4.得到我们需要获取的内容
    def get_detail_with_url(self,response):
# extract_first(''): 获取extract()函数中返回列表(<type 'list'>)的第一个元素值，如果元素不存在，则使用default默认值。
 # extract()：是将(<type 'Selectorlist'>)类型的列表，转化为(<type 'lsit'>)类型的列表。
        type=response.xpath('//div[@class="crumbs-nav center1020"]/span/a[2]/text()').extract_first('')
        print(type)
        name=response.xpath('//div[@class="book-info"]/h1/em/text()').extract_first('')
        print(name)
        author=response.xpath('//div[@class="book-info"]/h1/a/text()').extract_first('')
        print(author)
        total=response.xpath('//p[@class="total"]/span/text()').extract_first('')+response.xpath('//p[@class="total"]/em/text()').extract_first('')
        print(total)
        love=response.xpath('//p[@class="total"]/span[2]/text()').extract_first('')+response.xpath('//p[@class="total"]/em[2]/text()').extract_first('')
        print(love)
        click=response.xpath('//p[@class="total"]/span[3]/text()').extract_first('')+response.xpath('//p[@class="total"]/em[3]/text()').extract_first('')
        print(click)
        introduce_list=response.xpath('//p[@class="intro"]/text()').extract()
        for introduce in introduce_list:
            introduce=introduce.strip()
            print(introduce)
        url='http:'+response.xpath('//div[@class="book-img"]//img/@src').extract_first('')
        url=url.replace('\r','')
        print(url)
        print('----------------------------------------------------')
        #每解析一个小说的信息，就创建一个Item类的对象，将信息保存在类变量中
        item=HongxiuItem()
        item['type']=type
        item['name'] =name
        item['author'] =author
        item['total'] =total
        item['love'] =love
        item['click'] =click
        item['introduce'] =introduce
        item['url'] =[url]
        yield item

在这里面，我们可以先把print('--------------------------------------')下的代码注释一下，然后输入命令scrapy crawl book.按下回车键，如果出现下图的情况便代表我们获取成功。

3.items.py

import scrapy


class HongxiuItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    type=scrapy.Field()
    name=scrapy.Field()
    author=scrapy.Field()
    total=scrapy.Field()
    love=scrapy.Field()
    click=scrapy.Field()
    introduce=scrapy.Field()
    url=scrapy.Field()
    pass

4.pipelines.py，在此进行对piplines的自定义

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json
import os
# 自定义文件写入pipelines
class HongxiuPipeline(object):
    def __init__(self):
        # 用来打开本地的json文件，w+有的话打开，没有的话创建打开
        self.file=codecs.open(filename='book.json',mode='w+',encoding='utf-8')
        self.file.write('"book_list":[')
    # 如果需要将数据写入到本地json或者是数据库中，必须用process_item()函数
    def process_item(self, item, spider):
        # 先将item对象转化为一个字典对象
        res=dict(item)
        # 再将字典对象转化为Json字符串
        str=json.dumps(res,ensure_ascii=False)
        # 写入json字符串
        self.file.write(str)
        # 添加换行符
        self.file.write(',\n')
        # 返回一个item对象，供后续的pipeline对这个item进行处理
        return item
    def open_spider(self,spider):
        # 爬虫程序开启时，这句话会被输出
        print('爬虫开始')
    def close_spider(self,spider):
        # 爬虫程序关闭时，这个函数会被调用，然后输出这句话
        print('爬虫结束')
        # 将json文件中的最后的字符',\'删除掉
        # -1表示偏移量至文件的末尾，SEEK_END定位到
          # 文章的最后一个字符
        # 这个取出的是换行符\n
        self.file.seek(-1,os.SEEK_END)
        self.file.truncate()
        # 这个去除的是','号
        self.file.seek(-1,os.SEEK_END)
        self.file.truncate()
        # 加上列表的另一半部分
        self.file.write(']')
        self.file.close()

5.settings.py

BOT_NAME = 'hongxiu'

SPIDER_MODULES = ['hongxiu.spiders']
NEWSPIDER_MODULE = 'hongxiu.spiders'

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
  #配置自定义的写入json文件的pipeline
   'hongxiu.pipelines.HongxiuPipeline': 300,
   #下载图片的管道
   'scrapy.pipelines.images.ImagesPipeline':1
}

#下载图片的路径
IMAGES_STORE='IMAGES'
#下载图片的网址
IMAGES_URLS_FIELD='url'

编写完成后，我们输入命令，scrapy crawl book,按下回车键最后会出现如图所示的情况。

点击打开book.json，我们会看到以下内容。

点击打开full文件夹：

这个就是我们下载的小说的封面的图片，这样便算是我们的任务完成了。

以上就是关于如何使用自定义的pipeline将文件保存为json格式的方法。

Jay丶萧邦

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
Scrapy爬虫框架—自定义Pipelines将文件以Json格式存储

我们可以在终端内输入命令，这个就是scrapy自带将文件保存为Json格式的命令。scrapy crawl xxx -o xxx.json -s FEED_EXPORT_ENCODING=utf-8本次的内容，我们将通过自定义pipelines将文件以Json格式存储。我们这次以爬取小说为例。小说网址：https://www.hongxiu.com/，点击进入这个网址，我们这次想要获...
复制链接

扫一扫