爬虫--06：scrapy框架

最新推荐文章于 2024-08-21 07:52:57 发布

置顶

十束多多良^_^

最新推荐文章于 2024-08-21 07:52:57 发布

阅读量662

点赞数

分类专栏： scrapy框架 redis数据库文章标签： redis

本文链接：https://blog.csdn.net/Rhymeplot__JDQS/article/details/116406125

版权

本文详细介绍了Scrapy框架的基础知识，包括框架的介绍、工作流程、项目结构和中间件等。同时，文章深入探讨了Scrapy与Redis的结合使用，包括Redis的介绍、安装、五大数据类型以及如何在Scrapy中使用scrapy_redis实现分布式爬虫。通过案例演示，解释了如何改写普通Scrapy爬虫以适应分布式爬虫的需求。

摘要由CSDN通过智能技术生成

scrapy框架基础

一、scrapy框架的介绍

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，我们只需要实现少量的代码，就能够快速的抓取
Scrapy使用了Twisted异步网络框架，可以加快我们的下载速度
异步：调用在发出之后，这个调用就直接返回，不管有无结果
非阻塞：关注的是程序在等待调用结果时的状态，指在不能立刻得到结果之前，该调用不会阻塞当前线程

二、scrapy框架的工作流程

在这里插入图片描述

项目	作用
引擎	总指挥：负责数据和信号在不同模块间的传递
调度器	一个队列，存放引擎发过来的requests请求
下载器	下载引擎发过来的requests数据在发送给引擎
爬虫	处理引擎发来的response数据，提取url，并交给引擎
管道	处理引擎传过来的数据，进行下一步操作，例如存储
下载器中间件	可以自定义下载扩展，比如设置代理
爬虫中间件	可以指定requests请求和进行requests过滤

三、scrapy入门

创建scrapy

scrapy startproject 项目名称

创建爬虫

scrapy genspider 爬虫名字 demo.com

运行scrapy框架

scrapy crawl 爬虫名字

注意：
1、以上步骤如果在终端使用，先通过终端命令前往需要创建scrapy文件的文件夹，在重复以上步骤。
2、如果在编译器内部进行，需要在编译器与系统的相关的终端（Terminal）内部进行。在终端内部重复以上创建步骤，而运行步骤则需要在scrapy框架文件的第一层创建一个.py文件。
在py文件的内部重复一下动作，在运行这个py文件，就可以运行scrapy框架来进行爬虫程序的运行。

from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', '爬虫的名字'])
或者这么写
cmdline.execute("scrapy crawl 爬虫的名字".split())

四、scrapy项目结构

pipline：管道
- 1、对引擎传递的数据进行处理
- 2、pipline李可以有多个管道，数据值越小，管道优先级越高
- 3、pipeline中process_item()方法名不能修改为其他的名称
items
- 用于对爬虫程序内的数据进行实例化对象
settings
- 存储一些公共变量
spider
- 爬虫程序，用于对引擎返回的response数据进行分析处理
middlewares
- 1、自定义一个中间件
- 2、重写process_request(seif,request,spider)方法
- 3、实现随机ua

五、scrapy其他的知识

在开始编写爬虫程序之前，需要在settings.py文件的内部进行一些修改

修改一：
在内部添加

LOG_LEVEL = 'WARNING'

设置警告优先级

修改二：

ROBOTSTXT_OBEY = False

将Ture改为False

修改三：

DEFAULT_REQUEST_HEADERS = {
   
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

添加请求头

修改四：

ITEM_PIPELINES = {
   
   'Poetry.pipelines.PoetryPipeline': 300,
}

开启管道

六、中间件

下载中间件
爬虫中间件
自定义中间件
需求：添加随机ua

在下载器当中
process_resquest(seif, request, spider):
参数
request：发送求情的request对象
spider：发送请求的spider对象
在这个方法里面默认返回一个None（return None）

None：就是正常情况，处理正常的请求
Response：返回response对象直接执行process_response方法
Request:


process_response(self, request, response, spider):

七、middleware补充

总结
1、在middleware文件中设置随机ua
2、充血process_request方法
    def process_request(self, request, spider):
        uesr_agent = random.choice(self.USER_AGENTS)
        request.headers['Uesr-Agent'] = uesr_agent
     
3、不能忘记在settings文件中开启middliware

扩展
如果你不希望过滤，可以使用
# dont_filter = True --默认是False，完成去重功能。
        yield scrapy.Request(self.start_urls[0], dont_filter=True)

fake-useragent 可以随机生成ua
1、安装 pip3 install fake-uearagent
2、使用
	1、随机ua
	2、ua，random
	3、生成指定的ua

八、scrapy模拟登陆

如何进行模拟登陆

1、写cookie模拟登陆
2、找到数据接口，发送post请求（提交账号和密码）
3、通过seleium进行模拟登陆
- 加载驱动
- 打开登陆界面
- 找到对应的imput标签输入文字

模拟登陆人人网

http://www.renren.com/975937712/profile  王佳欣的人人网个人详情页面
搜索cookies
1、可以通过middlewres
2、分析了源码，发现了start_requests()

# 保存文件
        with open('renren.html', 'w', encoding='utf-8') as f:
            f.write(response.body.decode())


总结
1、parse函数直接拿到的就是response，所以我们需要它像之前的url发起请求，并携带cookie
start_request（）方法
在这个方法当中不要用headers、
# 发起请求
        yield scrapy.Request(
            url=self.start_urls[0],
            # 处理请求结果
            callback=self.parse,
            # headers=headers
            cookies=cookies
        )
        
2、cookie格式的处理
需要使用字典推导式
cookies = {
   i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}

renren.py

import scrapy


class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://www.renren.com/975937712/profile'] # 详情页


    # 重写start_requests()方法
    def start_requests(self):
        # 携带cookie
        cookies = 'Cookie: anonymid=kn8jumirb7ee2k; depovince=GW; _r01_=1; taihe_bi_sdk_uid=422fb307325599aa438d408d4bb06a37; taihe_bi_sdk_session=60c3421a289deb3afd889f15a9ec0aad; JSESSIONID=abcF3UTXoHDWr75pvwWIx; t=88d15e116307bd1cbad7ed5b0b9aa5d12; societyguester=88d15e116307bd1cbad7ed5b0b9aa5d12; id=975937712; xnsid=4fa3ce39; jebecookies=f2fe4658-7b93-40b5-8ce2-f50fe03ed175|||||; ver=7.0; loginfrom=null; XNESSESSIONID=4400bd141850; wp_fold=0'
        cookies = {
   i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}

        headers = {
   
            'Cookies':cookies
        }

        # 发起请求
        yield scrapy.Request(
            url=self.start_urls[0],
            # 处理请求结果
            callback=self.parse,
            # headers=headers
            cookies=cookies
        )

    def parse(self, response):
        # print(response.body.decode())
        # 保存文件
        with open('renren.html', 'w', encoding='utf-8') as f:
            f.write(response.body.decode())

pipline.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import json
from itemadapter import ItemAdapter


class PoetryPipeline:

    def open_spider(self, item):# 需要加入参数（随机参数）
        self.gushiwen = open('古诗文.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        item_json = json.dumps(dict(item), ensure_ascii=False) # ensure_ascii=False将乱码装换成中文
        self.gushiwen.write(item_json+'\n')
        # print(item)
        return item

    def close_spider(self, item):
        self.gushiwen.close()

settings.py

# Scrapy settings for Poetry project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Poetry'

SPIDER_MODULES = ['Poetry.spiders']
NEWSPIDER_MODULE = 'Poetry.spiders'

LOG_LEVEL = 'WARNING' # 设置登记
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Poetry (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 爬虫中间件
#SPIDER_MIDDLEWARES = {
   
#    'Poetry.middlewares.PoetrySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载器中间件
#DOWNLOADER_MIDDLEWARES = {
   
#    'Poetry.middlewares.PoetryDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
   
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道
ITEM_PIPELINES = {
   
   'Poetry.pipelines.PoetryPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

九、scrapy框架的翻页\详情页操作

思路1 ：

在原有的开始页面内找到下一页的url地址，通过response.xpath()提取出来，在通过response.urljoin()来复原整个的url地址，在通过yield scrapy.Request(url= xx,callback = seif.parse)将新的url地址重新进行请求。

爬虫文件源码展示：

import scrapy
from ChapterSpider.items import ChapterspiderItem


class YinnegzheSpider(scrapy.Spider):
    name = 'yinnegzhe'
    allowed_domains = ['bxwxorg.com']
    start_urls = ['https://www.bxwxorg.com/read/121200/639119.html']

    def parse(self, response):
        chapter_name = response.xpath('//div[@class="content_read"]/div[@class="box_con"]/div[@class="bookname"]/h1/text()').extract_first()
        # print(chapter_name)
        chapter_contents = response.xpath('//div[@class="content_read"]/div[@class="box_con"]/div[@id="content"]/p/text()').extract()
        # print(chapter_content)
        chapter_text = '\n'.join(chapter_contents)
        # print(chapter_content)
        item = ChapterspiderItem()
        item['chapter_name'] = chapter_name
        item['chapter_text'] = chapter_text

        yield item

        chapter_href = response.xpath('//div[@class="content_read"]/div[@class="box_con"]/div[@class="bottem2"]/a/@href').getall()[3]
        print(chapter_href)
        chapter_url = response.urljoin(chapter_href)
        # print(chapter_url)
        if chapter_href == '/read/121200/':
            pass
            print('小说爬虫程序结束！')
        else:
            yield scrapy.Request(
                url=chapter_url,
                callback=self.parse
            )

注意：这是一个小说爬虫文件的源码，其中if chapter_href == '/read/121200/':这段代码用意是对爬虫文件翻页操作之后，来到最后一页进行判断结束，以免报错。

scrapy.Request知识点

scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=

最低0.47元/天解锁文章

十束多多良^_^

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
爬虫--06：scrapy框架

Crawler-06: Scrapy Framework scrapy框架基础一、scrapy框架的介绍二、scrapy框架的工作流程scrapy框架基础一、scrapy框架的介绍Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，我们只需要实现少量的代码，就能够快速的抓取Scrapy使用了Twisted异步网络框架，可以加快我们的下载速度异步：调用在发出之后，这个调用就直接返回，不管有无结果非阻塞：关注的是程序在等待调用结果时的状态，指在不能立刻得到结果之前，该调用不会阻塞当
复制链接

扫一扫

专栏目录