scrapy-splash+docker

pip install scrapy-splash
安装docker, 安装好后运行docker
拉取镜像 docker pull scrapinghub/splash
用docker运行docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
scrapy-setting 配置
SPLASH_URL = ‘http://localhost:8050’
DOWNLOADER_MIDDLEWARES = { ‘scrapy_splash.SplashCookiesMiddleware’: 723, ‘scrapy_splash.SplashMiddleware’: 725, ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’: 810, }
SPIDER_MIDDLEWARES = { ‘scrapy_splash.SplashDeduplicateArgsMiddleware’: 100, }
去重过滤器
DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter’
HTTPCACHE_STORAGE = ‘scrapy_splash.SplashAwareFSCacheStorage’
spider配置
class ZhCambridgeSpider(scrapy.Spider):
name = ‘zh_Cambridge’
custom_settings = {
‘HTTPERROR_ALLOWED_CODES’: [503],
‘DOWNLOAD_TIMEOUT’: 40,
‘RETRY_TIMES’: 3
}

script = ‘’’
splash:go(args.url) #要解析的url
splash:wait(20) #等待时间
return {
html = splash:html()
}
‘’’
spiders
def start_requests(self):
yield SplashRequest(url=‘https://www.cambridge.org/core/what-we-publish/journals’,
endpoint=‘run’,
args={‘lua_source’: self.script}, callback=self.journal)

def journal(self, response):
print(response.text)

splash官网 https://splash.readthedocs.io/en/stable/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值