scrapy流程

最新推荐文章于 2024-11-29 16:09:59 发布

debug工具人

最新推荐文章于 2024-11-29 16:09:59 发布

阅读量629

点赞数

文章标签： scrapy python 开发语言

本文链接：https://blog.csdn.net/qq_53367724/article/details/130992999

版权

本文介绍了Scrapy爬虫的整体流程，包括引擎、调度器、下载器、爬虫、管道的作用。讲解了创建项目、设置参数、静态和动态页面的处理，以及数据存储（如CSV、MySQL、MongoDB）和文件下载。同时强调了请求头的重要性和避免被识别为爬虫的策略。

摘要由CSDN通过智能技术生成

scrapy整体流程

scrapy的官方文档: https://docs.scrapy.org/en/latest/

引擎(engine)

scrapy的核心, 所有模块的衔接, 数据流程梳理.

调度器(scheduler)

本质上这东西可以看成是一个队列. 里面存放着一堆我们即将要发送的请求. 可以看成是一个url的容器. 它决定了下一步要去爬取哪一个url. 通常我们在这里可以对url进行去重操作.

下载器(downloader)

它的本质就是用来发动请求的一个模块. 小白完全可以把它理解成是一个get_page_source()的功能. 只不过这货返回的是一个response对象.

爬虫(spider)

负责解析下载器返回的response对象.从中提取到我们需要的数据.

管道(pipeline)

主要负责数据的存储和各种持久化操作.

1.创建工程

首先，打开控制台，cd到你想创建工程的文件夹下

scrapy startproject 项目名

提示创建成功
cd到spider文件夹内创建爬虫

scrapy genspider 爬虫名  要爬取的网站域名

这里项目名是你爬虫整个项目的名称
爬虫名是spider文件名，注意区分

这两个名称不允许一样，否则会报错

经过上述步骤之后，会出现下面这样的一系列文件夹

    ├── mySpider_2
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── settings.py
    │   └── spiders
    │       ├── __init__.py
    │       └── 爬虫名.py   # 多了一个这个. 
    └── scrapy.cfg

说明你这个爬虫项目成功创建了

2 进行相关设置

打开settings，发现其中有很多参数被注释了

我的习惯是五步走

（1）不看日志：LOG_LEVEL = 'WARNING'

（2）不遵守robot_text协议：ROBOTSTXT_OBEY = False

（3）设置代理ip，这部分后边细说

（4）降低请求数 CONCURRENT_REQUESTS = 4  不要太高容易崩

（5）在文件夹最外层创建spider-run文件，省的每次都要scrapy crawl

1.日志很多看起来很乱，编写的时候不需要看，维护的时候尽量打开，比较容易发现哪里出错

2.如果遵守协议你什么也拿不到

3.下边详细说

4.请求数不要太高，很容易被网站检测出你是爬虫，从而拒绝你的请求

5.spider_run的功能，就是运行爬虫，如果没有这个文件每次都要在终端里scrapy crawl

spider_run代码如下

from scrapy import cmdline

cmdline.execute("scrapy crawl spider_name".split())

建立完spider_run后，文件结构如下(spider_run一定会要建立在最外层，否则不生效)

    ├── mySpider_2
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── settings.py
    │   └── spiders
    │       ├── __init__.py
    │       └── 爬虫名.py   # 多了一个这个. 
    └── scrapy.cfg
    └── spider_run.py

3.静态页面spider

在写spider之前，你首先应该观察你要爬的界面（右键检查）,是静态还是动态,

静态直接HTML获取对应字段,
动态需要抓包

完成创建后，打开spider文件会发现以下代码

import scrapy
   
class YouxiSpider(scrapy.Spider):
       name = 'youxi'  # 该名字非常关键, 我们在启动该爬虫的时候需要这个名字
       allowed_domains = ['4399.com']  # 爬虫抓取的域.单网站爬取可以注释掉
       start_urls = ['http://www.4399.com/flash/']  # 起始页
   
       def parse(self, response, **kwargs):
           # response.text  # 页面源代码
           # response.xpath()  # 通过xpath方式提取
           # response.css()  # 通过css方式提取
           # response.json() # 提取json数据
   
           # 用我们最熟悉的方式: xpath提取游戏名称, 游戏类别, 发布时间等信息
           li_list = response.xpath("//ul[@class='n-game cf']/li")
           for li in li_list:
               name = li.xpath("./a/b/text()").extract_first()
               category = li.xpath("./em/a/text()").extract_first()
               date = li.xpath("./em/text()").extract_first()
               
               # 存储为字典
               dic = {
                   "name": name,
                   "category": category,
                   "date": date
               }
   
               # 将提取到的数据提交到管道内.
               yield dic   # 注意, 这里只能返回 request对象, 字典, item数据, or None

这个spider中有三个地方需要注意

(1)解析

解析页面，自己去看xpath css 正则表达式

(2)parse函数间数据的传递

当一个参数需要再每个函数之间传递时，在一个parse函数末尾，现将传递参数内容和他的参数名对应起来，再使用mate函数进行传递

 for ind, url in enumerate(main_category_list):
            main_category_name = main_category_names[ind]
            # meta用于再不同函数之间传递参数
            yield scrapy.Request(url, callback=self.keshi_detail, meta={'main_category_name': main_category_name})

信息传递出去了肯定要在keshi_detail函数中接收啊

  def keshi_detail(self, response):
        main_category_name = response.meta['main_category_name']

(3)传递数据给item时的格式

当你将数据yield item时，请注意，一定要是item，不要用字典dic或者list

最简便的方法就是直接传到对应字段，参考以下代码

item['字段名']= response.xpath(.........)

直接将items中的字段名复制到这个位置，标准又完美，其实就是把数据结构定义好

这样你爬到的数据直接被yield到item中的对应字段

3.动态页面spider

对于动态页面有一个很明显的特征，在同一网址下的子链接，无论如何刷新都是同一个链接，这个链接，你对他直接进行请求，网站会拒绝你的请求，返回状态码400

这时，你要组装好你的headers，也就是请求头，网页拒绝你的请求就是因为你的请求头不对，请求头包含很多种参数，有些网页甚至要求你一一对应才会允许你的请求

这里我列一个比较稳妥的方案，你去要爬取的网页，右键检查，找出请求头的所有信息，全部复制下来，最后组装成一个字典，这样网页要检查哪个，随它去

 request_headers = b"""
                           :authority: www.xxxxx.cn    
                           :method: GET
                           :path: /work/index.htm?from=index
                           :scheme: https
                           accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
                           accept-encoding: gzip, deflate, br
                           accept-language: zh-CN,zh;q=0.9
                           cache-control: max-age=0
                           cookie: _ga=GA1.2.1425286857.1577928625; route=2bf12a7e3d18498383e2970557db636f; dxy_da_cookie-id=4fd84d1297e8a4c397d35355c62f03131578878840013; Hm_lvt_17521045a35cb741c321d016e40c7f95=1578561472,1578620137,1578626115,1578878841; _gid=GA1.2.1118274719.1578878841; JOSESSIONID=459F925E8F455EB53F24BE736297EEF5-n2; _gat=1; Hm_lpvt_17521045a35cb741c321d016e40c7f95=1578894839
                           referer: https://www.jobmd.cn/work/index.htm?from=index
                           sec-fetch-mode: navigate
                           sec-fetch-site: same-origin
                           sec-fetch-user: ?1
                           upgrade-insecure-requests: 1
                           user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
                           """

        headers = headers_raw_to_dict(request_headers)

尤其注意这里有一个关键的字段cookies，这个字段带着请求的身份信息，你复制登录的请求cookie，可以免密码登录，有的网页的cookie带时间戳需要自己组装

写好了请求头，解析界面发现是一堆乱码，什么也没有，

这是因为网页要知道你要请求哪些数据，才能给你返回，也就是常说的带数据返回

4.items.py

从上边spider中拿到的数据，无一例外最终都yield item，这个item可以理解为是一个数据库的自定义表，在数据库中有每个字段相应的位置
这部分很容易，格式是一样的
里面有你获取到的每个字段名字，

字段名 = scrapy.Field()

存储你每个字段的数据然后传递给pipline（管道）

itrem.py文件中的代码格式：

import scrapy

class GameItem(scrapy.Item):
    # 定义数据结构
    name = scrapy.Field()
    category = scrapy.Field()
    date = scrapy.Field()

这里需要注意的是，当你设置了很多个item时，

有一个问题，都是yield item 那程序怎么知道那个字段要进入哪个item呢？

你要先声明。你导入的是哪个item，我这里声明的是我上边在item里定义好的，那我后续进行传递参数，就会往这个GameItem里传递

item = GameItem()
item['name']= response.xpath(.........)
item['category']= response.xpath(.........)
item['date']= response.xpath(.........)
yield item

何必是item呢

大汉堡 = GameItem()
大汉堡['name']= response.xpath(.........)
大汉堡['category']= response.xpath(.........)
大汉堡['date']= response.xpath(.........)
yield 大汉堡

开玩笑，中文相当不专业。。。

5. pipline.py（管道）

你想用pipline 请先去设置（settings）里打开pipline，打开pipline，打开pipline，重要的说三遍，不打开是不生效的

ITEM_PIPELINES = {
   'caipiao.pipelines.CaipiaoFilePipeline': 300,
}

从item中的传出来的数据进入了pipline. pipline主要负责数据的存储.你可以存为任何你想要的格式，

(1) 存储为csv文件

写入文件是一个非常简单的事情. 直接在pipeline中开启文件即可.

class CaipiaoFilePipeline:
    
    def process_item(self, item, spider):
        with open("pig.txt", mode="a", encoding='utf-8') as f:
            # 写入文件
            f.write(f"{item['name']}, {'_'.join(item['age'])}, {'_'.join(item['sex'])}\n")
        return item

但是你这个代码写的意思是，你没存储一条数据就要打开一次文件，如果我有上千万条数据岂不是要打开上千次
我希望的是, 只打开一次这个文件, 我可以在pipeline中创建两个方法, 一个是open_spider(), 另一个是close_spider(). 看名字也能明白其含义:

open_spider(), 在爬虫开始的时候执行一次

close_spider(), 在爬虫结束的时候执行一次

这样我就能打开一次文件，然后存储，爬虫运行完毕，关闭这个文件

class CaipiaoFilePipeline:

    def open_spider(self, spider):
        self.f = open("caipiao.txt", mode="a", encoding='utf-8')

    def close_spider(self, spider):
        if self.f:
            self.f.close()

    def process_item(self, item, spider):
        # 写入文件
        self.f.write(f"{item['name']}, {'_'.join(item['age'])}, {'_'.join(item['sex'])}\n")
        return item

（2）mysql数据库写入

首先, 在open_spider中创建好数据库连接. 在close_spider中关闭链接. 在proccess_item中对数据进行保存工作.

先把mysql相关设置丢到settings里

# MYSQL配置信息
MYSQL_CONFIG = {
   "host": "localhost",
   "port": 3306,
   "user": "root",
   "password": "test123456",
   "database": "spider",
}

from caipiao.settings import MYSQL_CONFIG as mysql
import pymysql

class CaipiaoMySQLPipeline:

    def open_spider(self, spider):
        self.conn = pymysql.connect(host=mysql["host"], port=mysql["port"], user=mysql["user"], password=mysql["password"], database=mysql["database"])

    def close_spider(self, spider):
        self.conn.close()

    def process_item(self, item, spider):
        # 写入文件
        try:
            cursor = self.conn.cursor()
            sql = "insert into caipiao(qihao, red, blue) values(%s, %s, %s)"
            red = ",".join(item['red_ball'])
            blue = ",".join(item['blue_ball'])
            cursor.execute(sql, (item['qihao'], red, blue))
            self.conn.commit()
            spider.logger.info(f"保存数据{item}")
        except Exception as e:
            self.conn.rollback()
            spider.logger.error(f"保存数据库失败!", e, f"数据是: {item}")  # 记录错误日志
        return item

别忘了把pipeline设置一下

ITEM_PIPELINES = {
   'caipiao.pipelines.CaipiaoMySQLPipeline': 301,
}

（3） mongodb数据库写入

mongodb数据库写入和mysql写入如出一辙…不废话直接上代码吧

MONGO_CONFIG = {
   "host": "localhost",
   "port": 27017,
   'has_user': True,
   'user': "python_admin",
   "password": "123456",
   "db": "python"
}

from caipiao.settings import MONGO_CONFIG as mongo
import pymongo

class CaipiaoMongoDBPipeline:
    def open_spider(self, spider):
        client = pymongo.MongoClient(host=mongo['host'],
                                     port=mongo['port'])
        db = client[mongo['db']]
        if mongo['has_user']:
            db.authenticate(mongo['user'], mongo['password'])
        self.client = client
        self.collection = db['caipiao']

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.collection.insert({"qihao": item['qihao'], 'red': item["red_ball"], 'blue': item['blue_ball']})
        return item

ITEM_PIPELINES = {
    # 三个管道可以共存~
   'caipiao.pipelines.CaipiaoFilePipeline': 300,
   'caipiao.pipelines.CaipiaoMySQLPipeline': 301,
   'caipiao.pipelines.CaipiaoMongoDBPipeline': 302,
}

4. 文件保存

这里面对的场景是，某个网站上有一个文件，我想要讲这些文件批量的下载下来，

在网页上，点击下载，就能下载这个文件，但背后的原理是，你点击的时候，相当于发送了一个请求，服务器会响应给你下载链接。这样电脑就会下载，我们需要拿到的也是这个下载链接，我曾经爬药监局的审批文件就是这个套路

（1）首先,建好项目, 在items中定义好数据结构

class MeinvItem(scrapy.Item):
    name = scrapy.Field()
    img_url = scrapy.Field()
    img_path = scrapy.Field()

（2）完善spider, 注意看yield scrapy.Request()

import scrapy
from meinv.items import MeinvItem


class TupianzhijiaSpider(scrapy.Spider):
    name = 'tupianzhijia'
    allowed_domains = ['tupianzj.com']
    start_urls = ['https://www.tupianzj.com/bizhi/DNmeinv/']

    def parse(self, resp, **kwargs):
        li_list = resp.xpath("//ul[@class='list_con_box_ul']/li")
        for li in li_list:
            href = li.xpath("./a/@href").extract_first()
            # 拿到href为了什么? 进入详情页啊
            """
            url: 请求地址
            method: 请求方式
            callback: 回调函数
            errback: 报错回调
            dont_filter: 默认False, 表示"不过滤", 该请求会重新进行发送
            headers: 请求头. 
            cookies: cookie信息
            """
            yield scrapy.Request(
                url=resp.urljoin(href),  # scrapy的url拼接
                method='get',
                callback=self.parse_detail,
            )
        # 下一页
        next_page = resp.xpath('//div[@class="pages"]/ul/li/a[contains(text(), "下一页")]/@href').extract_first()
        if next_page:
            yield scrapy.Request(
                url=resp.urljoin(next_page),
                method='get',
                callback=self.parse
            )


    def parse_detail(self, resp):
        img_src = resp.xpath('//*[@id="bigpic"]/a[1]/img/@src').extract_first()
        name = resp.xpath('//*[@id="container"]/div/div/div[2]/h1/text()').extract_first()
        meinv = MeinvItem()
        meinv['name'] = name
        meinv['img_url'] = img_src
        yield meinv

关于Request()的参数:
url: 请求地址
method: 请求方式
callback: 回调函数
errback: 报错回调
dont_filter: 默认False, 表示"不过滤", 该请求会重新进行发送
headers: 请求头.
cookies: cookie信息

接下来就是下载问题了. 如何在pipeline中下载一张图片呢?

Scrapy早就帮你准备好了. 在Scrapy中有一个ImagesPipeline可以实现自动图片下载功能.

from scrapy.pipelines.images import ImagesPipeline, FilesPipeline
import pymysql
from meinv.settings import MYSQL
import scrapy

class MeinvPipeline:
    def open_spider(self, spider):
        self.conn = pymysql.connect(
            host=MYSQL['host'],
            port=MYSQL['port'],
            user=MYSQL['user'],
            password=MYSQL['password'],
            database=MYSQL['database']
        )

    def close_spider(self, spider):
        if self.conn:
            self.conn.close()

    def process_item(self, item, spider):
        try:
            cursor = self.conn.cursor()
            sql = "insert into tu (name, img_src, img_path) values (%s, %s, %s)"
            cursor.execute(sql, (item['name'], item['img_src'], item['img_path']))
            self.conn.commit()
        except:
            self.conn.rollback()
        finally:
            if cursor:
                cursor.close()
        return item


class MeinvSavePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        # 发送请求去下载图片
        # 如果是一堆图片. 可以使用循环去得到每一个url, 然后在yield每一个图片对应的Request对象
        return scrapy.Request(item['img_url'])

    def file_path(self, request, response=None, info=None):
        # 准备好图片的名称
        filename = request.url.split("/")[-1]
        return f"img/{filename}"

    def item_completed(self, results, item, info):
        # 文件存储的路径
        ok, res = results[0]
        # print(res['path'])
        item['img_path'] = res["path"]
        return item

最后, 需要在settings中设置以下内容:

MYSQL = {
   "host": "localhost",
   "port": 3306,
   "user": "root",
   "password": "test123456",
   "database": 'spider'
}

ITEM_PIPELINES = {
    'meinv.pipelines.MeinvPipeline': 303,
    'meinv.pipelines.MeinvSavePipeline': 301,
}
# 图片保存路径  -> ImagesPipeline
IMAGES_STORE= './my_tu'
# 文件保存路径 -> FilesPipeline
FILES_STORE = './my_tu'