（笔记）数据采集基础04

五彩斑斓的猫

已于 2024-04-27 23:22:29 修改

阅读量545

点赞数 22

文章标签：笔记

于 2024-04-24 19:51:08 首次发布

本文链接：https://blog.csdn.net/qq_51372207/article/details/137973375

版权

本文详细介绍了Scrapy框架的各个组件，包括Engine、Item、Scheduler、Downloader、Spiders、ItemPipeline以及下载器和蜘蛛中间件。讲解了Scrapy的安装步骤，并通过实例演示如何创建Spider、Item和使用Pipeline进行数据处理。

摘要由CSDN通过智能技术生成

20240417

1.Scrspy介绍

Scrapy是一个基于Twisted的异步处理框架，，，，，

Engine （引擎）：用来处理整个系统的数据流处理、触发事务，是整个框架的核心。

Item （项目）：定义了爬取结果的数据结构，爬取的数据会被赋值成该对象。

Scheduler （调度器）：用来接受引擎发过来的请求并加入队列中，并在引擎再次请求的

时候提供给引擎。

Downloader （下载器）：用于下载网页内容，并将网页内容返回给蜘蛛。

Spiders （蜘蛛）：其内定义了爬取的逻辑和网页的解析规则，它主要负责解析响应并生

成提取结果和新的请求。

Item Pipeline （项目管道）：负责处理由蜘蛛从网页中抽取的项目，它的主要任务是清

洗、验证和存储数据。

Downloader Middlewares （下载器中间件）：位于引擎和下载器之间的钩子框架，主要

是处理引擎与下载器之间的请求及响应。

Spider Middlewares （蜘蛛中间件）：位于引擎和蜘蛛之间的钩子框架，主要工作是处

理蜘蛛输入的响应和输出的结果及新的请求。

2.Scrapy安装与使用

pip install scrapy

创建项目：

创建一个 Scrapy 项目，项目文件可以直接用 scrapy 命令生成，命令如下所示：

scrapy startproject tutorial

这个命令可以在任意文件夹运行。如果提示权限问题，可以加 sudo 运行该命令。这个命令将

会创建一个名为 tutorial 的文件夹，文件夹结构如下所示：

scrapy . cfg # Scrapy 部署时的配置文件

tutorial # 项目的模块，引入的时候需要从这里引入

__init__ . py

items . py # Items 的定义，定义爬取的数据结构

middlewares . py # Middlewares 的定义，定义爬取时的中间件

pipelines . py # Pipelines 的定义，定义数据管道

settings . py # 配置文件

spiders # 放置 Spiders 的文件夹

__init__ . py

创建Spider:

Spider 是自己定义的类， Scrapy 用它从网页里抓取内容，并解析抓取的结果。不过这个类必

须继承 Scrapy 提供的 Spider 类 scrapy.Spider ，还要定义 Spider 的名称和起始请求，以及怎

样处理爬取后的结果的方法。

你也可以使用命令行创建一个 Spider 。比如要生成 Quotes 这个 Spider ，可以执行如下命令：

cd tutorial

scrapy genspider quotes

进入刚才创建的 tutorial 文件夹，然后执行 genspider 命令。第一个参数是 Spider 的名称，第

二个参数是网站域名。执行完毕之后， spiders 文件夹中多了一个 quotes.py ，它就是刚刚创建

的 Spider ，内容如下所示：

import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
pass

这里有三个属性 ——name 、 allowed_domains 和 start_urls ，还有一个方法 parse 。

name ：它是每个项目唯一的名字，用来区分不同的 Spider 。

allowed_domains ：它是允许爬取的域名，如果初始或后续的请求链接不是这个域名下

的，则请求链接会被过滤掉。

start_urls ：它包含了 Spider 在启动时爬取的 url 列表，初始请求是由它来定义的。

parse ：它是 Spider 的一个方法。默认情况下，被调用时 start_urls 里面的链接构成的请

求完成下载执行后，返回的响应就会作为唯一的参数传递给这个函数。该方法负责解析

返回的响应、提取数据或者进一步生成要处理的请求。

创建Item:

Item 是保存爬取数据的容器，它的使用方法和字典类似。不过，相比字典， Item 多了额外的

保护机制，可以避免拼写错误或者定义字段错误。

创建 Item 需要继承 scrapy.Item 类，并且定义类型为 scrapy.Field 的字段。观察目标网站，我

们可以获取到的内容有 text 、 author 、 tags 。

定义 Item ，此时将 items.py 修改如下：

import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

解析Response:

parse 方法的参数 response 是 start_urls 里面的链接爬取后的结果。所以在

parse 方法中，我们可以直接对 response 变量包含的内容进行解析，比如浏览请求结果的网

页源代码，或者进一步分析源代码内容，或者找出结果中的链接而得到下一个请求。

我们可以看到网页中既有我们想要的结果，又有下一页的链接，这两部分内容我们都要进行处

理。

首先看看网页结构，如图所示。每一页都有多个 class 为 quote 的区块，每个区块内都包含

text 、 author 、 tags 。那么我们先找出所有的 quote ，然后提取每一个 quote 中的内容。

提取的方式可以是 CSS 选择器或 XPath 选择器。在这里我们使用 CSS 选择器进行选择，

parse 方法的改写如下所示：

def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()

使用Item:

Item 可以理解为一个字典，不过在声明的时候需要

实例化。然后依次用刚才解析的结果赋值 Item 的每一个字段，最后将 Item 返回即可。

QuotesSpider 的改写如下所示：
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield item

如此一来，首页的所有内容被解析出来，并被赋值成了一个个 QuoteItem 。

3.scrapy举例抓小说：

import scrapy


class XsbooktxtCcSpider(scrapy.Spider):
    name = "xsbooktxt.cc"
    # allowed_domains = ["www.xsbooktxt.cc"]
    # start_urls = ["http://www.xsbooktxt.cc/"]
    start_urls = ['https://www.xsbooktxt.cc/html/{}/1/'.format(types) for types in ['chuanyue','yanqing','dushi'] ]


    def parse(self, response):
        # print(response.xpath('//span[@class="s2"]/a/text()').extract())
        lies = response.xpath('//div[@id="newscontent"]/div/ul/li')
        for li in lies:
            book_name = li.xpath('span[2]/a/text()').extract_first()
            book_url = 'https://www.xsbooktxt.cc'+li.xpath('span[2]/a/@href').extract_first('暂无')
            # print(book_name,book_url)
            yield scrapy.Request(book_url,self.chatper_url_get)

    def chatper_url_get(self,response):
        chapter_url_lists = response.xpath('//dl/dd/a/@href').extract()
        for chapter_url in chapter_url_lists:
            chapter_content_url = 'https://www.xsbooktxt.cc{}'.format(chapter_url)
            yield scrapy.Request(chapter_content_url,self.chatper_url_get)

    # def content_get(self,response):
    #     print(response. Text)

4.Item使用举例：

import scrapy
from xiaoshuo1.items import *
import re
class CentoschinaSpider(scrapy.Spider):
    name = "centoschina"
    # allowed_domains = ["ddsdfs"]
    # start_urls = ["https://www.centoschina.cn/troubleshooting/page/4"]
    #
    # def parse(self, response):
    #     pass
    def start_requests(self):
        every_page_url = "https://www.centoschina.cn/troubleshooting/page/2"
        yield scrapy.Request(every_page_url,self.question_lists)

    def question_lists(self,response):
        # title = response.xpath('//h2/a/text()').extract()
        hrefs = response.xpath('//h2/a/@href').extract()
        current_url = response.url
        current_page = int(current_url.split('/')[-1])
        next_page = current_page+1
        next_url = "https://www.centoschina.cn/troubleshooting/page/{}".format(next_page)
        yield scrapy.Request(next_url,self.question_lists)
        for href in hrefs:
            yield scrapy.Request(href, self.get_content)

    def get_content(self,response):
        item = CentosChinaItem()
        # title = response.meta['title']
        title = response.xpath('//h1/text()').extract_first().strip()
        contents = ''.join(response.xpath('//div[@class="content_post"]//text()').extract()).strip()
        print('标题',title,'内容',contents)
        item['title'] = title
        item['contents'] = contents
        return item

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CentosChinaItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    contents = scrapy.Field()

5.Scrapy pipeline入库:

我们可以自定义 Item Pipeline ，只需要实现指定的方法就可以，其中必须要实现的一个方法

是：

process_item(item, spider)

另外还有几个比较实用的方法，它们分别是：

open_spider(spider)

close_spider(spider)

from_crawler(cls, crawler)

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql
from pymysql.converters import escape_string
from xiaoshuo1.settings import *


class Xiaoshuo1Pipeline:
    def open_spider(self, spider):
        self.client = pymysql.connect(user=MYSQL_USER,password=MYSQL_PASSWORD,db=MYSQL_DB)
        self.cursor = self.client.cursor()

    def process_item(self, item, spider):
        title = item['title']
        contents = escape_string(item['contents'])
        sql = 'insert into questions(title,contents) values("{}","{}")'.format(title,contents)
        self.cursor.execute(sql)
        self.client.commit()
        return item

    def close_spider(self, spider):
        self.client.close()

五彩斑斓的猫

关注

22
点赞
踩
12

收藏

觉得还不错? 一键收藏
1
评论
（笔记）数据采集基础04

Scrapy是一个基于Twisted的异步处理框架，，，，，Engine（引擎）：用来处理整个系统的数据流处理、触发事务，是整个框架的核心。Item（项目）：定义了爬取结果的数据结构，爬取的数据会被赋值成该对象。Scheduler（调度器）：用来接受引擎发过来的请求并加入队列中，并在引擎再次请求的时候提供给引擎。Downloader（下载器）：用于下载网页内容，并将网页内容返回给蜘蛛。Spiders（蜘蛛）：其内定义了爬取的逻辑和网页的解析规则，它主要负责解析响应并生。
复制链接

扫一扫