我们之前都是使用slenium + PhantomJS来获取渲染后的页面,但是因为PhantomJS不再更新了,selenium也是不再对PhantomJS提供支持,但是我们还有Headless Chrome,这个是Chrome推出的能实现和PhantomJS一样的功能,
使用方式:以抓取京东商城的商品为例,京东商城的商品一页显示30个商品,但是如果我们向下拉的话,还会再加载30个商品,
我们就使用Headless Chrome来模拟这一过程,获取一页60个商品
selenium的核心思想就是我们在下载中间件的 process_request 将我们的reqeust转化成selenium的请求
jd.py
# -*- coding: utf-8 -*-
import scrapy
import re
from jingdong.items import JingdongItem
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
class JdSpider(scrapy.Spider):
def __init__(self):
chrome_options = Options()
# 加上这个参数就可以了,能够实现跟PhantomJS 一样的功能
chrome_options.add_argument('--headless')
# chrome_options.add_argument('--disable-gpu')
self.browser = webdriver.Chrome(chrome_options=chrome_options)
# self.browser = webdriver.Chrome()
self.browser.set_page_load_timeout(30)
def closed(self, spider):
print("spider closed")
self.browser.close()
name = 'jd'
allowed_domains = ['jd.com']
start_urls = ['https://search.jd.com/Search?keyword=%E7%BE%8E%E9%A3%9F&enc=utf-8']
def parse(self, response):
goods_list = response.xpath('//div[@id="J_goodsList"]/ul/li')
for goods in goods_list:
item = JingdongItem()
item['goods_name'] = goods.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()').extract_first()
item['price'] = goods.xpath('.//div[@class="p-price"]/strong/i/text()').extract_first()
item['shop'] = goods.xpath('.//div[@class="p-shop"]/span/a/text()').extract_first()
yield item
middlewares.py
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
import time
class JingdongDownloaderMiddleware(object):
def process_request(self, request, spider):
if spider.name == 'jd':
try:
spider.browser.get(request.url)
spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
except TimeoutException as e:
print('超时')
spider.browser.execute_script('window.stop()')
time.sleep(2)
return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
encoding="utf-8", request=request)
我们要记住process_request的返回值,如果返回的是None,那么就继续scrapy的运行流程,如果返回request,scrapy就会重新发送这个request,如果返回response,则不会再经过其他的process_request和process_exception,这就相当于直接得到了返回值,会进入返回值的流程。
那么selenium做了什么呢,他就相当于是截取了原本的request,然后自己请求,并返回结果,这个reqeust就没有由scrapy的下载器来下载response,这一步由selenium来做了
以上的两个是最关键的步骤,剩下的settings,items就跟其他的爬虫没啥区别,该开的开,该写的写