首先给出splash官网地址:http://splash.readthedocs.io/en/stable/api.html#render-html
1.安装和准备
(1)先安装scrapy-splash库:
pip install scrapy-splash
(2)然后将我们的docker起起来
docker run -p 8050:8050 scrapinghub/splash
如果关于docker安装还有更多的问题,请查考:
2.配置
- (1)将splash server的地址放在你的settings.py文件里面,如果是你在本地起的,那地址应该是http://127.0.0.1:8050,我的地址如下
-
SPLASH_URL = 'http://192.168.99.100:8050'
- (2)在你的下载器中间件:download_middleware 里面启用如下的中间文件,注意启用的顺序
-
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
另外注意:
scrapy_splash.SplashMiddleware(725)的顺序是在默认的HttpProxyMiddleware(750)之前,要不然顺序的紊乱会造成功能的紊乱的
HttpCompressionMiddleware的优先级和顺序也应该适当的更改一下,这样才能更能处理请求
查看:https://github.com/scrapy/scrapy/issues/1895 .里面提到的一些问题
- (3)在settings.py启用SplashDeduplicateArgsMiddleware中间件
-
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }
- (4)我们来设置一个去重的类
-
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
- (5)如果你使用scrapy http 缓存系统,那你就有必要启用这个scrapy-splash的缓存系统
-
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
如果你有在你自己的settings.py里面启用DEFAULT_REQUEST_HEADERS ,请务必注释掉,目前看来是一个bug ,我已经给scrapy splash 官方提了这个bug
该bug 是由于default_request_headers 里面的host 与我要爬的sougou不匹配,这当然会出错,不得不说scrapy的官方维护人反应真的很迅速。大家添加的headers的时候注意这些细节内容。
代码
# -*- coding: utf-8 -*-
from scrapy import Request
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from scrapy_splash import SplashMiddleware
from scrapy.http import Request, HtmlResponse
from scrapy.selector import Selector
class SplashSpider(Spider):
name = 'scrapy_splash'
# main address since it has the fun list of the products
start_urls = [
'https://item.jd.com/2600240.html'
]
# allowed_domains = [
# 'sogou.com'
# ]
# def __init__(self, *args, **kwargs):
# super(WeiXinSpider, self).__init__(*args, **kwargs)
# request需要封装成SplashRequest
def start_requests(self):
# text/html; charset=utf-8
for url in self.start_urls:
yield SplashRequest(url
, self.parse
, args={'wait': '0.5'}
# ,endpoint='render.json'
)
pass
def parse(self, response):
print "############"+response._url
fo = open("html.txt", "wb")
fo.write(response.body); # 写入文件
fo.close();
#本文只抓取一个京东链接,此链接为京东商品页面,价格参数是ajax生成的。会把页面渲染后的html存在html.txt
#如果想一直抓取可以使用CrawlSpider,或者把下面的注释去掉
'''site = Selector(response)
links = site.xpath('//a/@href')
for link in links:
linkstr=link.extract()
print "*****"+linkstr
yield SplashRequest(linkstr, callback=self.parse)'''