材料清单
docker
scrapy
当我们经常遇到js加载的页面,用scrapy来抓取其实挺麻烦的。Splash是做来加载渲染后的页面,可以支持scrapy使用。由于Splash和Scrapy都支持异步处理,而Selenium的对接过程中每个页面渲染下载过程是在Downloader Middleware里面完成的,所以整个过程是堵塞式的,Scrapy会等待这个过程完成后再继续处理和调度其他请求,影响了爬取效率,因此使用Splash爬取效率上比Selenium高出很多。
首先安装docker,直接拉取镜像 docker pull scrapinghub/splash
启动Splashdocker run -p 8050:8050 scrapinghub/splash
然后测试一下是否可以连上curl http://localhost:8050
如果关闭防火墙之类操作已经做完了,那么远程是可以连接上splash的
接着开始在scrapy的配置,在settings.py中添加如下配置
# 加入splash的url以及去重类
SPLASH_URL = 'http://192.168.99.100:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 修改下载中间件
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 723,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
#'splash_163.middlewares.Splash163DownloaderMiddleware': 543,
}
# 修改爬虫中间件
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
在spider中加入生成的splash请求
from scrapy_splash import SplashRequest
...
...
yield SplashRequest(url, callback=self.parse_result,
args={
# optional; parameters passed to Splash HTTP API
'wait': 0.5,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
endpoint='render.json', # optional; default is render.html
splash_url='<url>', # optional; overrides SPLASH_URL
)
# 另外我们也可以生成Request对象,关于Splash的配置通过meta属性配置即可,代码如下:
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
# optional parameters
'endpoint': 'render.json', # optional; default is render.json
'splash_url': '<url>', # optional; overrides SPLASH_URL
'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
'splash_headers': {}, # optional; a dict with headers sent to Splash
'dont_process_response': True, # optional, default is False
'dont_send_headers': True, # optional, default is False
'magic_response': False, # optional, default is True
}
})
当我们使用脚本来执行某些操作时,就需要Lua脚本了。Lua脚本可以像selenium那样来实现页面加载、模拟点击翻页的功能
script = """
function main(splash, args)
args = {
url="https://s.taobao.com/search?q=羽毛球",
wait=5,
page=5
}
splash.images_enabled = false
assert(splash:go(args.url))
assert(splash:wait(args.wait))
js = string.format("document.querySelector('#mainsrp-pager div.form > input').value=%d;document.querySelector('#mainsrp-pager div.form > span.btn.J_Submit').click()", args.page)
splash:evaljs(js)
assert(splash:wait(args.wait))
return splash:png()
end
"""
class TaobaoSpider(Spider):
name = 'taobao'
allowed_domains = ['www.taobao.com']
base_url = 'https://s.taobao.com/search?q='
def start_requests(self):
for keyword in self.settings.get('KEYWORDS'):
for page in range(1, self.settings.get('PAGE_NUM') + 1):
url = self.base_url + quote(keyword)
yield SplashRequest(url, callback=self.parse, endpoint='execute',
args={'lua_source': script, 'page': page, 'wait': 3})
顺便贴个post请求的Lua脚本
script = """
function main(splash, args)
local treat = require("treat")
local json = require("json")
local response = splash:http_post{args.url,
body=json.encode({keywords="园林"})}
splash:wait(10)
return {
html = treat.as_string(response.body),
url = response.url,
status = response.status
}
end
"""