使用splash需要安装docker,我这里使用的是ubuntu20.04.
1、安装docker
sudo apt install docker.io
sudo docker pull scrapinghub/splash
1、开放端口
安装完成之后需要开放本机的8050和8051端口,不然运行不了爬虫。
sudo docker run -p 8050:8050 -p 8051:8051 scrapinghub/splash
3、在爬虫中使用splash
3.1 安装splash
pip3 install scrapy-splash
3.2 在scrapy的setting.py文件中添加以下内容
# splash server url address
SPLASH_URL="http://localhost:8050"
# open splash's two downloader and adjust the sort of HttpCompressionMiddleware
DOWNLOADER_MIDDLEWARES={
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware':725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
}
# setting the filter
DUPEFILTER_CLASS='scrapy_splash.SplashAwareDupeFilter'
# to support cache_args (could choose)
#SPIDER_MIDDLEWARES={
# 'scrapy_splash.SplashDeduplicateArgsMiddlewares':100,
#}
3.3 编写spider
以下是我的各个文件(只修改过setting.py和quotes.py)
3.4 运行爬虫
scrapy crawl quotes -o quotes.csv
查看quote.csv
cat -n quotes.csv
4、运行结果