准备工作:
安装有minicoda3,scrapyd,redis和mongodb的Ubuntu虚拟机,主机安装有gerapy==0.9.12版本
第一部分:
代码
首先创建scrapy的一个项目。
spider里的代码如下:
from scrapy.http import HtmlResponse, Request
from scrapy_redis.spiders import RedisSpider
class JobSpider(RedisSpider):
name = "job"
redis_key = 'job:start_urls'
def start_requests(self):
start_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10'
self.server.lpush(self.redis_key, start_url)
# 确保返回父类的生成器对象
yield from super().start_requests()
def parse(self, response: HtmlResponse, **kwargs):
print("Parsing:", response.url)
job_list = response.json()['Data']['Posts']
for job in job_list:
item = dict()
item['RecruitPostName'] = job['RecruitPostName']
item['Responsibility'] = job['Responsibility']
item['RequireWorkYearsName'] = job['RequireWorkYearsName']
yield item
yield from self.next_page()
def next_page(self):
for page in range(2, 20):
url = f'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={page}&pageSize=10'
yield Request(url=url, callback=self.parse)
整个流程
1.启动爬虫时:
start_requests 方法被调用。
初始 URL 被推送到 Redis 列表 job:start_urls。
调用父类的 start_requests 来开始处理从 Redis 列表中获取的 URL。
2.处理初始请求:
start_requests 中获取的 URL 发送请求,并由 parse 方法处理响应。
3.parse 方法执行:
响应被解析为 JSON 格式。
解析出职位信息,并将结果 item 传送到 Scrapy 管道。
调用 next_page 方法生成下一页的请求。
4.next_page 方法生成分页请求:
生成从第二页到第十九页的请求,每个请求的回调都是 parse 方法。
确保继续处理后续页面的数据。
middleware配置:
import random
class UserAgentDownloaderMiddleware:
USER_AGENTS_LIST = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR "
"3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR "
"2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET "
"CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) "
"Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
]
def process_request(self, request, spider):
print("------下载中间件----")
# 随机选择UA
user_agent = random.choice(self.USER_AGENTS_LIST)
request.headers['User-Agent'] = user_agent
return None
随机请求头
pipline如下:
import pymongo
class TxWorkPipeline:
def __init__(self):
self.client = None
self.db = None
self.collection = None
def open_spider(self, spider):
self.client = pymongo.MongoClient()
self.db = self.client['py_spider']
self.collection = self.db['tx_data']
print('MongoDB连接成功')
def close_spider(self, spider):
self.client.close()
print('MongoDB连接关闭')
def process_item(self, item, spider):
self.collection.insert_one(item)
print('数据已写入MongoDB')
return item
这里的mongo默认保存在Ubuntu虚拟机上,因为我待会会在虚拟机上开scrapyd服务。
settings文件设置:
# Scrapy settings for TX_work project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = "TX_work"
SPIDER_MODULES = ["TX_work.spiders"]
NEWSPIDER_MODULE = "TX_work.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "TX_work (+http://www.yourdomain.com)"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# "TX_work.middlewares.TxWorkSpiderMiddleware": 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
"TX_work.middlewares.UserAgentDownloaderMiddleware": 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"TX_work.pipelines.TxWorkPipeline": 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
"""scrapy-redis配置"""
# 持久化配置
SCHEDULER_PERSIST = True
# 使用scrapy-redis调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# scrapy-redis指纹过滤器
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# redis链接地址
REDIS_URL = 'redis://192.168.72.129:6379/0'
# 任务的优先级别
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
到这里scrapy-redis的项目已经写好了!
第二部分
虚拟机上scrapyd的启动命令和本地电脑部署gerapy
1.虚拟机上scrapyd的启动命令
需要注意的是,这里scrapyd里面的需要配置一个scrapyd.conf文件,从而使宿主机能够访问到它。
如图:
打开完scrapyd服务之后,我们在本地电脑中启动gerapy。这里建议使用gerapy的时候最好再创建一个虚拟环境,应为gerapy会下载指定版本的scrapy。
gerapy init # 新建文件夹并启动指令:
cd gerapy
gerapy migrate # 同步sqlite数据库
gerapy createsuperuser # 创建超级管理员 这里账号密码都设置为admin,email不用管,直接按回车
gerapy runserver # 启动服务,访问地址为:127.0.0.1:8000
如图已是配置好的gerapy项目:
这就表gerapy启动成功!
打开 http://127.0.0.1:8000/
登陆后界面如下:
接着创建主机:
名字随便取,IP是你虚拟机的ip,端口号是6800,不要选认证,接着创建。
接着选择项目部署,把刚才写好的scrapy一整个文件夹压缩成zip格式的压缩包上传
之后点击部署就可以了,接着就可以在主机上调度任务了
点击运行,即可执行爬虫任务,执行中途也可以选择暂停。
第三部分
查看结果:
在Tiny RDM中查看:
这里是经过去重过滤后的所有请求的url的散列值。
查看mongo的数据集合:
可以看到数据已经保存成功!
总结:
在这里我和以往不同的做法是实现了首个url自动插入到redis中,而不是手动插入,因为上传的文件是压缩包格式,并不会帮助你插入首个url,这就会导致代码会一直监听request队列,从而使程序一直等待下去。顺便复习了一下项目部署的知识点。