Scrapy-Redis并部署到gerapy实战之腾讯招聘

江中新ZHN

已于 2024-06-03 14:36:08 修改

阅读量254

点赞数 5

分类专栏： scrapy 文章标签：学习笔记分布式 scrapy

于 2024-06-02 11:49:47 首次发布

本文链接：https://blog.csdn.net/2302_81214177/article/details/139388771

版权

scrapy 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

准备工作：
安装有minicoda3，scrapyd，redis和mongodb的Ubuntu虚拟机，主机安装有gerapy==0.9.12版本
第一部分：
代码
首先创建scrapy的一个项目。
spider里的代码如下：

from scrapy.http import HtmlResponse, Request
from scrapy_redis.spiders import RedisSpider


class JobSpider(RedisSpider):
    name = "job"
    redis_key = 'job:start_urls'

    def start_requests(self):
        start_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10'
        self.server.lpush(self.redis_key, start_url)

        # 确保返回父类的生成器对象
        yield from super().start_requests()

    def parse(self, response: HtmlResponse, **kwargs):
        print("Parsing:", response.url)
        job_list = response.json()['Data']['Posts']
        for job in job_list:
            item = dict()
            item['RecruitPostName'] = job['RecruitPostName']
            item['Responsibility'] = job['Responsibility']
            item['RequireWorkYearsName'] = job['RequireWorkYearsName']

            yield item

        yield from self.next_page()

    def next_page(self):
        for page in range(2, 20):
            url = f'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={page}&pageSize=10'
            yield Request(url=url, callback=self.parse)

整个流程
1.启动爬虫时：

start_requests 方法被调用。
初始 URL 被推送到 Redis 列表 job:start_urls。
调用父类的 start_requests 来开始处理从 Redis 列表中获取的 URL。

2.处理初始请求：

start_requests 中获取的 URL 发送请求，并由 parse 方法处理响应。

3.parse 方法执行：

响应被解析为 JSON 格式。
解析出职位信息，并将结果 item 传送到 Scrapy 管道。
调用 next_page 方法生成下一页的请求。

4.next_page 方法生成分页请求：

生成从第二页到第十九页的请求，每个请求的回调都是 parse 方法。
确保继续处理后续页面的数据。

middleware配置：

import random


class UserAgentDownloaderMiddleware:
    USER_AGENTS_LIST = [
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR "
        "3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR "
        "2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET "
        "CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) "
        "Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
    ]

    def process_request(self, request, spider):
        print("------下载中间件----")
        # 随机选择UA
        user_agent = random.choice(self.USER_AGENTS_LIST)
        request.headers['User-Agent'] = user_agent
        return None

随机请求头

pipline如下：

import pymongo


class TxWorkPipeline:
    def __init__(self):
        self.client = None
        self.db = None
        self.collection = None

    def open_spider(self, spider):
        self.client = pymongo.MongoClient()
        self.db = self.client['py_spider']
        self.collection = self.db['tx_data']
        print('MongoDB连接成功')

    def close_spider(self, spider):
        self.client.close()
        print('MongoDB连接关闭')

    def process_item(self, item, spider):
        self.collection.insert_one(item)
        print('数据已写入MongoDB')
        return item

这里的mongo默认保存在Ubuntu虚拟机上，因为我待会会在虚拟机上开scrapyd服务。

settings文件设置：

# Scrapy settings for TX_work project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "TX_work"

SPIDER_MODULES = ["TX_work.spiders"]
NEWSPIDER_MODULE = "TX_work.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "TX_work (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    "TX_work.middlewares.TxWorkSpiderMiddleware": 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   "TX_work.middlewares.UserAgentDownloaderMiddleware": 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "TX_work.pipelines.TxWorkPipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"


"""scrapy-redis配置"""

# 持久化配置
SCHEDULER_PERSIST = True

# 使用scrapy-redis调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# scrapy-redis指纹过滤器
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# redis链接地址
REDIS_URL = 'redis://192.168.72.129:6379/0'

# 任务的优先级别
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

到这里scrapy-redis的项目已经写好了！

第二部分
虚拟机上scrapyd的启动命令和本地电脑部署gerapy
1.虚拟机上scrapyd的启动命令
需要注意的是，这里scrapyd里面的需要配置一个scrapyd.conf文件，从而使宿主机能够访问到它。
如图：
在这里插入图片描述打开完scrapyd服务之后，我们在本地电脑中启动gerapy。这里建议使用gerapy的时候最好再创建一个虚拟环境，应为gerapy会下载指定版本的scrapy。

gerapy init  # 新建文件夹并启动指令：
cd gerapy
gerapy migrate  # 同步sqlite数据库
gerapy createsuperuser  # 创建超级管理员  这里账号密码都设置为admin，email不用管，直接按回车
gerapy runserver  # 启动服务，访问地址为：127.0.0.1:8000

如图已是配置好的gerapy项目：
在这里插入图片描述这就表gerapy启动成功！
打开 http://127.0.0.1:8000/
登陆后界面如下：
接着创建主机：
名字随便取，IP是你虚拟机的ip，端口号是6800，不要选认证，接着创建。
接着选择项目部署，把刚才写好的scrapy一整个文件夹压缩成zip格式的压缩包上传

在这里插入图片描述之后点击部署就可以了，接着就可以在主机上调度任务了
点击运行，即可执行爬虫任务，执行中途也可以选择暂停。

第三部分
查看结果：
在Tiny RDM中查看：
在这里插入图片描述这里是经过去重过滤后的所有请求的url的散列值。

查看mongo的数据集合：
在这里插入图片描述可以看到数据已经保存成功！

总结：
在这里我和以往不同的做法是实现了首个url自动插入到redis中，而不是手动插入，因为上传的文件是压缩包格式，并不会帮助你插入首个url，这就会导致代码会一直监听request队列，从而使程序一直等待下去。顺便复习了一下项目部署的知识点。

江中新ZHN

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Scrapy-Redis并部署到gerapy实战之腾讯招聘

在这里我和以往不同的做法是实现了首个url自动插入到redis中，而不是手动插入，因为上传的文件是压缩包格式，并不会帮助你插入首个url，这就会导致代码会一直监听request队列，从而使程序一直等待下去。这里建议使用gerapy的时候最好再创建一个虚拟环境，应为gerapy会下载指定版本的scrapy。需要注意的是，这里scrapyd里面的需要配置一个scrapyd.conf文件，从而使宿主机能够访问到它。名字随便取，IP是你虚拟机的ip，端口号是6800，不要选认证，接着创建。
复制链接

扫一扫