pip install scrapy-splash
安装docker, 安装好后运行docker
拉取镜像 docker pull scrapinghub/splash
用docker运行docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
scrapy-setting 配置
SPLASH_URL = ‘http://localhost:8050’
DOWNLOADER_MIDDLEWARES = { ‘scrapy_splash.SplashCookiesMiddleware’: 723, ‘scrapy_splash.SplashMiddleware’: 725, ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’: 810, }
SPIDER_MIDDLEWARES = { ‘scrapy_splash.SplashDeduplicateArgsMiddleware’: 100, }
去重过滤器
DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter’
HTTPCACHE_STORAGE = ‘scrapy_splash.SplashAwareFSCacheStorage’
spider配置
class ZhCambridgeSpider(scrapy.Spider):
name = ‘zh_Cambridge’
custom_settings = {
‘HTTPERROR_ALLOWED_CODES’: [503],
‘DOWNLOAD_TIMEOUT’: 40,
‘RETRY_TIMES’: 3
}
script = ‘’’
splash:go(args.url) #要解析的url
splash:wait(20) #等待时间
return {
html = splash:html()
}
‘’’
spiders
def start_requests(self):
yield SplashRequest(url=‘https://www.cambridge.org/core/what-we-publish/journals’,
endpoint=‘run’,
args={‘lua_source’: self.script}, callback=self.journal)
def journal(self, response):
print(response.text)
splash官网 https://splash.readthedocs.io/en/stable/