Scrapy用法总结

最新推荐文章于 2024-04-22 15:42:39 发布

虚幻交界

最新推荐文章于 2024-04-22 15:42:39 发布

阅读量211

点赞数

分类专栏： python scrapy

本文链接：https://blog.csdn.net/Zz_er/article/details/107717717

版权

python 同时被 2 个专栏收录

14 篇文章 1 订阅

订阅专栏

scrapy

1 篇文章 0 订阅

订阅专栏

scrapy组件

引擎（engine）
负责总体调度
调度器（scheduler）
a. 接受引擎（爬虫组件）发送来的Requests对象，保存
b. 弹出Requests对象，交给引擎(下载器)
下载器（downloader）
接收引擎（调度器）发过来的Requests对象，发送网络请求，并且获取响应，把响应交给引擎（爬虫组件）
爬虫组件（spiders）
接收引擎（下载器）传递过来的Response，同时解析response
a.把提取出的数据交给引擎（管道）
b.提取出url，构造Requests请求，交给引擎（调度器）
管道（pipeline）
保存数据

Scrapy结构

使用

1.安装
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy

加上镜像会自动安装依赖
2.创建项目
cmd中输入 scrapy startproject project_name
例：scrapy startproject demo
切换文件目录
cd project_name
新建spider
scrapy genspider spidername xxx.com
例：
cd demo
scrapy genspider spider1 xxx.com

└── demo //demo文件夹
└── spiders // spider文件夹
├── init.py
└── spider1.py //上面操作创建的爬虫
├── init.py
├── items.py // 保存数据的结构
├── middlewares.py // 中间件
├── pipelines.py //管道（pipeline）
└── settings.py // 包含一些设置

修改 spider1.py文件

# -*- coding: utf-8 -*-
import scrapy


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['xxx.com']    #允许的域名
    start_urls = ['http://xxx.com/']   #起始url

    def parse(self, response):    #起始url下载后的response对象交给parse方法解析
        pass

默认发送的起始url请求没有修改user-agent，也可用如下方法自定义起始请求

import scrapy


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['baidu.com']    #允许的域名
    # start_urls = ['http://xxx.com/']   #起始url

    def start_requests(self):     #重写起始请求
        url = 'https://www.baidu.com/'
        #添加用户代理
        headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"} 
        #返回请求对象
        yield scrapy.Request(url,headers=headers,callback=self.parse)
    
    def parse(self, response):   #在parse方法中定义起始响应的解析逻辑
        print(response.text)

下面来获取百度首页的一些简单数据并保存
首先，可以在此处设置robots协议

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

在items.py中定义items对象存储数据的字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    titles = scrapy.Field()
    # pass

在spider1.py中导入items.py文件中的DemoItem对象，解析响应并保存在对应字段中

# -*- coding: utf-8 -*-
import scrapy
from ..items import DemoItem

class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['baidu.com']    #允许的域名
    # start_urls = ['http://xxx.com/']   #起始url

    def start_requests(self):     #重写起始请求
        url = 'https://www.baidu.com/'
        #添加用户代理
        headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"} 
        #返回请求对象
        yield scrapy.Request(url,headers=headers,callback=self.parse)
    
    def parse(self, response):   #在parse方法中定义起始响应的解析逻辑
        titles = response.xpath('//ul[@id="hotsearch-content-wrapper"]/li/a/span[2]/text()').getall()
        item = DemoItem()
        item['titles'] = titles
        yield item            #返回item对象
        # print(response.text)

在settings.py中打开piplines（取消注释）


# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'demo.pipelines.DemoPipeline': 300,
}

在piplines.py文件中定义数据保存的逻辑

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class DemoPipeline(object):
    def __init__(self):
        self.file = open('data.txt','w',encoding='utf-8')


    def process_item(self, item, spider):
        self.file.write(json.dumps(dict(item),ensure_ascii=False,indent='')+',\n')
        return item

    def __del__(self):
        self.file.close()

流程总结

重写start_requests构造起始请求对象并指定解析响应的回调函数，yield返回请求对象
定义解析响应的逻辑，将数据保存在item中，通过yield返回给引擎
在settings.py中打开piplines，piplines收到引擎发送的item对象后保存在自定义路径文件中

虚幻交界

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Scrapy用法总结

scrapy组件引擎（engine）负责总体调度调度器（scheduler）接受引擎（爬虫组件）发送来的Requests对象，保存弹出Requests对象，交给引擎(下载器)下载器（downloader）接收引擎（调度器）发过来的Requests对象，发送网络请求，并且获取响应，把响应交给引擎（爬虫组件）爬虫组件（spiders）接收引擎（下载器）传递过来的Response，同时解析response，1，把提取出的数据交给引擎（管道）提取出url，构造Requests请求
复制链接

扫一扫