使用scrapy 和 scrapy_splash
-
安装:
官网 :https://www.docker.com/
配置镜像(非必须):https://jingyan.baidu.com/article/47a29f24eb42ca80142399ef.html
拉取scrapinghub/splash 镜像: 终端命令 docker pull scrapinghub/splash
-
检测是安装成功
启动splash : 终端命令 docker run -p 8050:8050 scrapinghub/splash
最后一行会出现 : Server listening on http://0.0.0.0:8050
浏览器打开这个网址,出现如图页面,表示安装成功
)
检验动态加载的js
# 在 http://0.0.0.0:8050 ,输入网址,进行检测 function main(splash, args) assert(splash:go(args.url)) # 动态加载js 核心 local scroll_to = splash:jsfunc("window.scrollTo") scroll_to(0, 2800) splash:set_viewport_full() assert(splash:wait(0.5)) return { html = splash:html(), png = splash:png(), har = splash:har(), } end
-
安装scrapy-splash
pip install scrapy-splash
-
具体操作(只写思路)
-
settins.py 配置
# 渲染服务的url SPLASH_URL = 'http://0.0.0.0:8050' # 去重过滤器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 使用Splash的Http缓存 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } #下载器中间件 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # 请求头 DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', }
还要设置 ua ,robots 协议 , 日志打印信息,
-
爬虫文件
def start_requests(self): for url in self.start_urls: # 通过SplashRequest请求等待1秒 yield SplashRequest(url, self.parse, args={'wait': 1} def parse(self, response): ...
-
其他文件正常编写
-
运行docker :docker run -p 8050:8050 scrapinghub/splash
-
运行爬虫文件
-