方案是通过DesiredCapabilities设置prefs通过页面的加载日志来获取,依然在DownloaderMiddleware中实现:
(1)webdriver设置perfs:
prefs = {
"profile.managed_default_content_settings.images": {2}
}
d = DesiredCapabilities.CHROME
d['goog:loggingPrefs'] = {'performance': 'ALL'}
spider.driver = webdriver.Chrome(desired_capabilities=d,
chrome_options=chrome_options)
(2)webdriver打开网页以后,获取日志信息,获取图片或其他多媒体信息,代码如下:
#获得所有网络请求
lo = driver.get_log('performance')
#聚合 请求分类
datalist = {}
for entry in lo:
try:
m = json.loads(
entry['message'])['message']["params"]["response"]
k = m['headers']['Content-Type']
url = m['url']
if k not in datalist:
datalist[k] = [url]
else:
datalist[k].append(url)
except Exception as e:
continue
数据可以通过HtmlResponse返回给spider进行解析。