网络爬虫—05Scrapy爬虫框架

最新推荐文章于 2024-09-21 17:56:48 发布

小黑--

最新推荐文章于 2024-09-21 17:56:48 发布

阅读量526

点赞数

分类专栏：网络爬虫文章标签：网络爬虫 python

本文链接：https://blog.csdn.net/qq_41645466/article/details/105641473

版权

网络爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章目录

一、Scrapy架构流程
二、Scrapy爬虫步骤
三、三国演义名著定向爬虫项目
四、item详解

一、Scrapy架构流程

1.简介

Scrapy，Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。
Scrapy吸引人的地方在于它是一个框架，任何人都可以根据需求方便的修改。
它也提供了多种类型爬虫的基类，如BaseSpider、sitemap爬虫等，最新版本又提供了web2.0爬虫的支持。
Scrap，是碎片的意思，这个Python的爬虫框架叫Scrapy。

2.优势

用户只需要定制开发几个模块，就可以轻松实现爬虫, 用来抓取网页内容和图片，非常方便;
Scrapy使用了Twisted异步网络框架来处理网络通讯，加快网页下载速度，不需要自己实现异步框架和多线程等，并且包含了各种中间件接口，灵活完成各种需求

3.架构流程图

在这里插入图片描述
绿色为数据流向

只有当调度器中不存在任何request时，整个程序才会停止。(注:对于下载失败的URL，Scrapy也会重新下载. )

4.组件

引擎(Scrapy): 用来处理整个系统的数据流, 触发事务(框架核心)
调度器(Scheduler): 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader): 用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders): 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline):负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。
下载器中间件(Downloader Middlewares): 位于Scrapy引擎和下载器之间的框架，主要是处理Scrapy引擎与下载器之间的请求及响应。
爬虫中间件(Spider Middlewares):介于Scrapy引擎和爬虫之间的框架，主要工作是处理蜘蛛的响应输入和请求输出
调度中间件(Scheduler Middewares): 介于Scrapy引擎和调度之间的中间件，从
Scrapy引擎发送到调度的请求和响应。

二、Scrapy爬虫步骤

新建项目(scrapy startproject xxx)
命令行输入：
scrapy startproject 项目名称
cd 项目名称
scrapy genspider 爬虫名称爬取网址
明确目标(编写item.py)
明确你要抓取的目标
制作爬虫(爬虫名称.py)
制作爬虫，开始爬取网页;
存储爬虫(pipelines.py)
设置管道存储爬取内容
运行时，命令行输入:
scrapy crawl 爬虫名称
运行并存储：scrapy crawl 爬虫名称 -o 文件名

三、三国演义名著定向爬虫项目

页面分析：在这里插入图片描述

创建项目:

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.loader.processors import TakeFirst

class ScrapyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class BookItem(scrapy.Item):
    # 定义item类：
    #       继承scrapy.item
    #       所有字段都定义为 scrapy.Field()  不管是什么类型
    # ItemLoader返回列表
    # 输入输出处理器
    name = scrapy.Field(output_processor=TakeFirst())  # 只提取列表中的第一个元素
    content = scrapy.Field(output_processor=TakeFirst())
    bookname = scrapy.Field(output_processor=TakeFirst())

setting.py
在这里插入图片描述
book.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy.loader import ItemLoader
from ScrapyProject.items import BookItem

"""
爬虫流程：
1. 确定start_url
2. 引擎将起始的url交给调度器（存储到队列；去重)
3. 调度器将url 地址发送给Downloader,Downloader发起Requests请求，从互联网上下载网页信息，并返回响应Response（就是这个信息）
4. 将下载的页面内容交给Spider,进行解析（parse函数）,yield数据
5. 将处理好的数据items，交给管道pipline进行存储
"""
class BookSpider(scrapy.Spider):
    # 爬虫的名称，必须唯一
    name = 'book'
    base_url = 'http://www.shicimingju.com'
    # 限制：爬取的url地址必须是shicimingju.com
    #allowed_domains = ['shicimingju.com']
    # 起始的url地址，可以有多个   有两种方式指定
    #1. start_urls 属性设置=[]
    #2. 通过start_requests生成起始的url地址
    start_urls = [
        'http://www.shicimingju.com/book/sanguoyanyi.html',
        'http://www.shicimingju.com/book/xiyouji.html',
        'http://www.shicimingju.com/book/hongloumeng.html',
    ]

    def parse(self, response):
        # # 响应的url地址
        # name = response.url.split("/")[-1]
        # self.log('Saved file %s', name)

        """
        如何编写好的解析代码：使用scrapy的交互式工具scrapy shell url
        如何处理解析后的数据：通过yield返回解析数据的字典格式
        如何下载小说章节详情页的链接并下载到本地
        """

        #  0. 实例化item对象
        # item = BookItem()

        # l = ItemLoader(item=BookItem(), response = response)

        #  1.获取所有章节的li标签
        chapters = response.xpath('//div[@class="book-mulu"]/ul/li')
        #  2. 遍历每一个li标签，获取章节的详细网址和章节名称
        for chapter in chapters:
            # 创建ItemLoader对象，将item与   用response/selector根据情况而定
            l = ItemLoader(item=BookItem(), selector=chapter)
            detail_url = chapter.xpath('./a/@href').extract_first()  # 是列表，而且是select对象 转为 字符串，用extract_first()
            # 根据xpath提取数据信息，并填充到item对象的name属性中
            l.add_xpath('name','./a/text()')
            # 将数据信息填充到item对象的bookname属性中
            l.add_value('bookname',response.url.split('/')[-1].strip('.html'))
            # name = chapter.xpath('./a/text()').extract_first()
            # bookname = response.url.split('/')[-1].strip('.html')
            # yield {
            #     'detail_url':detail_url,
            #     'name':name}

            # 存到item里
            # item['name'] = name
            # item['bookname'] = bookname
            # print('item对象: ' ,item)

            # 将章节详情页的url提交到调度器的队列，通过Downloader下载器下载并交给self.parse_detail解析器进行解析数据
            yield Request(url=self.base_url+detail_url,
                          callback=self.parse_chapter_detail,   # 交给哪个解析器去解析
                          # meta={'name':name, 'bookname':bookname}   # 作为原数据传给第二次要解析的函数 parse_chapter_detail
                          # meta={'item':item}
                          meta = {'item':l.load_item()}  # l.load_item()：获取item对象
                          )


    def parse_chapter_detail(self, response):
        # .xpath('string(.)')  获取该标签和子孙标签的所有文本信息
        # 如何将对象转成字符串：
        #       转换一个: extract_first()/ get()
        #       转换列表中的每一个对象：extract()/ get_all()
        content = response.xpath('//div[@class="chapter_content"]')[0].xpath('string(.)').get()
        item = response.meta['item']
        item['content'] = content
        yield item           # item是类似于字典

        # yield{
        #     'name':response.meta['name'],
        #     'content':content,
        #     'bookname':response.meta['bookname']
        # }

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os

class ScrapyprojectPipeline(object):

    def process_item(self, item, spider):
        """ 将章节内容写入对应的章节文件 """
        # books/hongloueng
        dirname = os.path.join('books',item['bookname'])
        if not os.path.exists(dirname):
            os.makedirs(dirname)  # 递归创建目录
        name = item['name']
        # 文件名相对路径用join方法拼接，linux路径拼接符是/ ， windows路径拼接符是\
        filename = os.path.join(dirname, name)
        with open (filename,'w',encoding='utf-8') as f:
            f.write(item['content'])
            print('写入文件%s成功' %(name))
        return item

四、item详解

1、item.py文件中定义item类：

class BookItem(scrapy.Item):
    # 定义item类：
    #       继承scrapy.item
    #       所有字段都定义为 scrapy.Field()  不管是什么类型
    # ItemLoader返回列表
    # 输入输出处理器
    name = scrapy.Field(output_processor=TakeFirst())  # 只提取列表中的第一个元素
    content = scrapy.Field(output_processor=TakeFirst())
    bookname = scrapy.Field(output_processor=TakeFirst())

book.py文件中：
两种做法实现：

方法一、实例化item对象

# 实例化item对象
item = BookItem()

获取信息并存储到item中：

name = chapter.xpath('./a/text()').extract_first()
bookname = response.url.split('/')[-1].strip('.html')

item['name'] = name
item['bookname'] = bookname
print('item对象: ' ,item)

章节详情页的url提交到调度器的队列，通过Downloader下载器下载并交给self.parse_detail解析器进行解析数据

yield Request(url=self.base_url+detail_url,
              callback=self.parse_chapter_detail,   
              meta={'item':item}
              )

方法二、使用ItemLoader

# 创建ItemLoader对象  用response/selector根据情况而定
l = ItemLoader(item=BookItem(), selector=chapters)

根据xpath提取数据信息，并填充到item对象的属性中

l.add_xpath('name','./a/text()')
l.add_value('bookname',response.url.split('/')[-1].strip('.html'))

填充数据有三种方法：

调用xpath选择器：add_xpath
调用css选择器：add_css
直接给字段赋值：add_value

章节详情页的url提交到调度器的队列，通过Downloader下载器下载并交给self.parse_detail解析器进行解析数据

yield Request(url=self.base_url+detail_url,
              callback=self.parse_chapter_detail,  
              meta = {'item':l.load_item()}  # l.load_item()：获取item对象
              )