Scrapy集成Selenium ChromeDriver

参考:
官网chromedriver
chromedriver-downloads
Running Selenium Headless with Chrome

一、安装chrome浏览器

1、windows
可通过 帮助->关于Google Chrome查看已安装的Chrome版本
在这里插入图片描述
2、linux
TODO

二、下载chromdriver

下载链接:
https://sites.google.com/a/chromium.org/chromedriver/downloads
国内下载链接 - http://npm.taobao.org/mirrors/chromedriver/
1、选择对应的版本
在这里插入图片描述
2、选择对应的操作系统
在这里插入图片描述
如win32版本下载解压后:
在这里插入图片描述
如linux64版本下载解压后
在这里插入图片描述

三、测试chromdriver

首先需要先安装selenium

pip install selenium

windows环境下测试chromedriver

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
# headless无界面模式
chrome_options.add_argument("--headless")

browser = webdriver.Chrome(
    executable_path=r"D:\programs\chromedriver_win32\chromedriver.exe",
    chrome_options=chrome_options
)
browser.get("https://www.baidu.com/")
print("Title: %s" % browser.title)
browser.quit()

运行结果

Title: 百度一下,你就知道

注:
注释掉chrome_options.add_argument("–headless")这条语句,就会看见弹出的chrome窗口,在browser.quit()后会自动关闭
在这里插入图片描述

四、chromedriver解析Json

https://stackoverflow.com/questions/37121843/how-to-get-a-json-response-from-a-google-chrome-selenium-webdriver-client
在这里插入图片描述
即json响应默认会通过body>pre进行包装

<html>
 <head>
  <style></style>
  <script src="chrome-extension://mooikfkahbdckldjjndioackbalphokd/assets/prompt.js"></script>
 </head>
 <body>
  <pre>json content...</pre>
  ...
 </body>
</html>

五、chromdriver无图模式

方式1:https://tarunlalwani.com/post/selenium-disable-image-loading-different-browsers/

from selenium import webdriver

option = webdriver.ChromeOptions()
chrome_prefs = {}
option.experimental_options["prefs"] = chrome_prefs
# 1 - Allow all images
# 2 - Block all images
# 3 - Block 3rd party images 
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}

driver = webdriver.Chrome(chrome_options=option)
driver.get("http://www.baidu.com")

实际测试发现方式1在headless模式下不生效,而在删除headless选项后(即弹出浏览器窗口)是可以生效的。

方式2【推荐】:https://stackoverflow.com/questions/48773031/how-to-prevent-chrome-headless-from-loading-images

from selenium import webdriver

option = webdriver.ChromeOptions()
# 设置无界面
# option.add_argument('--headless')
# 设置无图模式
option.add_argument('--blink-settings=imagesEnabled=false')

driver = webdriver.Chrome(chrome_options=option)
driver.get("http://www.baidu.com")

实际测试方式2在headless和有界面模式下均生效
在这里插入图片描述

六、Scrapy集成Selenium+ChromeDriver

1、修改settings.py:

# 设置ChromeDriver的执行path
CHROME_DRIVER_PATH = 'D:/programs/chromedriver_win32/chromedriver.exe'
# 集成ChromeDriver的downloader middlewares
# 具体代码实现参见下文
DOWNLOADER_MIDDLEWARES = {
    'mx_crawl_spider.middlewares.MxCrawlSpiderDownloaderMiddleware': 543,
}

2、集成ChromeDriver的downloader middlewares代码实现:

from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException

class MxCrawlSpiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        try:
            # 获取网页链接内容
            spider.logger.info(f"Chrome driver get: {request.url}")
            self.driver.get(request.url)
            # self.driver.execute_script("scroll(0, 1000);")
            # time.sleep(1)
            # 返回HTML数据
            return HtmlResponse(url=request.url,
                                body=self.convert_resp_body(request, spider),
                                request=request,
                                encoding='utf-8',
                                status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, request=request, encoding='utf-8', status=500)
        finally:
            spider.logger.info('Chrome driver end...')

    def convert_resp_body(self, request, spider):
    	# 提取JSON 或 HTML内容
        try:
            json = self.driver.find_element_by_css_selector("body > pre").text
            spider.logger.info(f"convert {request.url} to json resp")
            return json
        except Exception as e:
            return self.driver.page_source

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info(f'Spider opened: {spider.name}')
        options = webdriver.ChromeOptions()
        # 设置无界面
        options.add_argument('--headless')
        # 设置无图模式
        options.add_argument('--blink-settings=imagesEnabled=false')
        # 初始化Chrome驱动
        chrome_driver_path = spider.settings.get("CHROME_DRIVER_PATH")
        self.driver = webdriver.Chrome(chrome_options=options, executable_path=chrome_driver_path)

解决CloudFlare防火墙

参考:
https://stackoverflow.com/questions/33247662/how-to-bypass-cloudflare-bot-ddos-protection-in-scrapy
https://stackoverflow.com/questions/55480924/how-to-enable-javascript-in-selenium-webdriver-chrome-using-python
https://stackoverflow.com/questions/64842858/selenium-app-redirect-to-cloudflare-page-when-hosted-on-heroku
在这里插入图片描述
在新弹出的chrome窗口中查看是否支持JS:
chrome://settings/content/javascript

<!DOCTYPE HTML>
<html lang="en-US">
<head>
  <meta charset="UTF-8" />
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
  <meta name="robots" content="noindex, nofollow" />
  <meta name="viewport" content="width=device-width,initial-scale=1" />
  <title>Just a moment...</title>
  <style type="text/css">
    html, body {width: 100%; height: 100%; margin: 0; padding: 0;}
    body {background-color: #ffffff; color: #000000; font-family:-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Helvetica Neue",Arial, sans-serif; font-size: 16px; line-height: 1.7em;-webkit-font-smoothing: antialiased;}
    h1 { text-align: center; font-weight:700; margin: 16px 0; font-size: 32px; color:#000000; line-height: 1.25;}
    p {font-size: 20px; font-weight: 400; margin: 8px 0;}
    p, .attribution, {text-align: center;}
    #spinner {margin: 0 auto 30px auto; display: block;}
    .attribution {margin-top: 32px;}
    @keyframes fader     { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
    @-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
    #cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
    #cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
    #cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
    .bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
    a { color: #2c7cb0; text-decoration: none; -moz-transition: color 0.15s ease; -o-transition: color 0.15s ease; -webkit-transition: color 0.15s ease; transition: color 0.15s ease; }
    a:hover{color: #f4a15d}
    .attribution{font-size: 16px; line-height: 1.5;}
    .ray_id{display: block; margin-top: 8px;}
    #cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
    #cf-hcaptcha-container { text-align:center;}
    #cf-hcaptcha-container iframe { display: inline-block;}
  </style>

      <meta http-equiv="refresh" content="12">
  <script type="text/javascript">
    //<![CDATA[
    (function(){

      window._cf_chl_opt={
        cvId: "2",
        cType: "non-interactive",
        cNounce: "17094",
        cRay: "6785deb3cd503aec",
        cHash: "1ca301a3b67a5a3",
        cFPWv: "g",
        cTTimeMs: "4000",
        cRq: {
          ru: "aHR0cHM6Ly9jbi5pbnZlc3RpbmcuY29tL2luc3RydW1lbnRzL0hpc3RvcmljYWxEYXRhQWpheA==",
          ra: "Y3VybC83LjU1LjE=",
          rm: "UE9TVA==",
          d: "joJnQtza+iRnbn5WVV6f6IRmR9EmES9mV6n7g4Yw1ovTCDkJNh3sAWK5tMO4VX1WOCbHGXNhKqplhgQc+BsUDYg8ug29nFRVt+Szx4Ms6zp0KTDtlJnycpjOZXua7InOYjEgD03WfZHGzUiXMRSLKgbzLYXOZNBUj2g428irH0Ldhe6d89LPYRSYDuXWtvI7YO5FL7n550l8HVbC13Wyi9FnMbsnfwn4ZvLn952yZQgwwxcP5vKN+dvhIPJF6nPUGJlcfTJNiPd4dRqE+0cC0YBmlsrKFvh0iuDcJVRGD04gtHhDB6kApcAaeCTwn8V+OXd/xb90k9UsjqbMyVvA4z+Ji5UxdMFYvHh3keWhhpAWCX8METoZ0z5t8f2dYEmXO8GW0CSdc8w0hedlJK1CalDKbwiz8ZdEhfmoop5JAKiQH4vwBDzOOK2Bc9j5p30anU8WdafUHOxnFJJ8SBaOUROJb+XodTY59hv/KCTGjX7aqsd0tMnbwX1NA+BzHCvQtRHxH/SWha9SNIH4M8nb6T7I2qSbo3xBiLKn0UL9hdSQePW9Om++oeRyBPHPJNxokd5tTOTC6yTx/sa0e03hePhc1tXZQhfD1Uk69I4d4SlKVDB04Urs3KFRMIMGeeQC+88SlKbkSlqL1su0WwNeFbCPmKndkyVpNpYEUfMPD6JSfE0lO1E5PUesflYCqiOG",          t: "MTYyNzg5MjI0Ny42NjEwMDA=",
          m: "K+89ac3LYJNEWJFtk9ohWKrbb2Ovzl8Oo2w81fr8HPA=",
          i1: "/3OUSzHGzSBW2Er1g1CHDw==",
          i2: "OHyyOn1ApyQjIyRm+kGpsA==",
          zh: "JJQg2KI/+bPgJbLHlLjmrs/mnno8aAGH5k3tm8QDk4c=",
          uh: "yJo4Yz2g40fRnUbkl+3xumbT2Zvi1Q9/8tEG1FzQ5ro=",
          hh: "T2hi97JJ3TXBbbaDfe4fVaGfimFjucUPtz+gmsc9Zq0=",
        }
      }
      window._cf_chl_enter = function(){window._cf_chl_opt.p=1};

    })();
    //]]>
  </script>


</head>
<body>
  <table width="100%" height="100%" cellpadding="20">
    <tr>
      <td align="center" valign="middle">
          <div class="cf-browser-verification cf-im-under-attack">
  <noscript>
    <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
  </noscript>
  <div id="cf-content" style="display:none">

    <div id="cf-bubbles">
      <div class="bubbles"></div>
      <div class="bubbles"></div>
      <div class="bubbles"></div>
    </div>
    <h1><span data-translate="checking_browser">Checking your browser before accessing</span> investing.com.</h1>

    <div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
      <p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
    </div>
    <p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
    <p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds&hellip;</p>
    <p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting&hellip;</p>
  </div>

  <form class="challenge-form" id="challenge-form" action="/instruments/HistoricalDataAjax?__cf_chl_jschl_tk__=pmd_b44e6894f26381ec65b7ce23b86e8129a364b849-1627892247-0-gqNtZGzNAjijcnBszQbO" method="POST" enctype="application/x-www-form-urlencoded">
    <input type="hidden" name="md" value="451c537d80e1bea80c69dc737c5769b09b670734-1627892247-0-AaEUgkAqbIiw24JsE46HbzTyxeOEhQmi8EbAWnTlRKLCuY_9e8BK9vsyTztGDpOgOKGeHJ6c2nGZlaC9GD7yXQgT3yayH4hyysTGh0qAA8Cohn_rXsoVKEy5sQELH-n4w4O7ueEyvl-qpKfI1OcD-NwUfpABwRYqlgZ8IFAtYpWfWSWJZV6-a_jc_KTmZWYEsQgcrL7ymfTZ9GRGWrWe0gXh6Nd_3Lnix8qjaq-D2PfTLLdnMMK6dR2QRfEGsDndMrYjd0wITStlRWRn-vIbWqlDgUe6ZquuYNAinXRsaRS0pFGBkrmgLAgCEWD2Qucurcw5chay51tk_bKrFsSuAKEf2j8EG4x47F_QKQjhrCpKwdAGymcrxlrzuKY4iTL4RVfXOD-1oG6OkC-hLxfUriL41rHF7n069gS_7CPGruyQudaZT-G7JSP6ziEB7ewhlg_0wesnWQvLRS-38NXv4FKxPXFh-y7yVf4M5CW1qsEudiiYr7IllPoURvr-jEmMVg" />
    <input type="hidden" name="r" value="79ae4aef921ee4d89184402e18dc0eef7ef2799e-1627892247-0-ATyPW3cpfiZHnKMzYYz6kWN5wLP8bh69u0RAqk0c0NVdhf5rZYvFYEI1XErZIsXclkv+OiQk3wyP4UCWpqGdHkl34vCx28J66C3QHxcXHWmltizpewOrPNzIV39l0t3tos+LRohlQVGEd7CD2DN+2w3eNIzmj8IcxhRDIWa4kXvLdGF10zxehh9dB/zaRLJtUPnPk3fKshXcQbRTT9Uz207nUrk6N3qoCh6baJwAccK6tYPAsuf8jYesH+oKWGT1ZavzujhFvaPAMqEmOELZGRCq/Cq9s3HJdg3njBknHmKIkBoYaecpaewpGqBIZeXPOvgdr11FEPhvamjJATEyhss6r8/P3UooX3OiPgix0ePXcIzhtXaNTY5bftVmVyTiHKcwpLQx2SQH4lKzC273CxsIGQH91SI0/TmFqJx+e1cz8K9SrPcDq4nBJX9NuwP56NM9jkZA2hPjOkf9kNOD/sF9KAEXPIumhP1/k5Gnyp5O+u3gxiAfJHHeAclxUsFLXA/TBGdP4+qXb/2D6/wRPoBQKuXPK3QaiBf6KhzmvaEhktTgXqBX7E8m7tWetIlpXSsNGW7oVEabVd14BJfXNh7wnFhbxadYBrL7jPs8F7fiTlyekw4omUcTq52kpgW5KanEcuTksGn4yldb4O9C086LTasGLPkd0Qz25RIuGvXUmXfwnfoJQ8pg2mNV0GykXcyIVs09r0Wz5+IgclodF4WhZ2GMaL1ZDeGJbRQQHrmsF8a74cEgm4/HaZW0rn0xrAxNovhwPkWTaz7UxMNQaxVa0uoJ7c1g5j83wHkrKwnX12P9TSI8X375B8l7P4PS56i3iDd7edcsyqgtq8F8FKfh3BLl+MU4ZNJu/nKa43GqlD+YEkfK2aK7MXfuhT+vuVF5fSUkOo05TQ+td8VxYDmwTjN8vl5XXTEgLCKBIN1QTaxQN+YSwbGsmHy9brqRKcvCTFzJjneQZ8XDWuLh2FUIeqD6N8viDlFcJB4VMT9p5hNEKQLV7bEg84o3UKtOtBYj3M4k+hfz664fGg/giI7gYhQ7l9W+FbT3zKnni6wlxjgCWwo8h78b/S/4UPXqaT9p8eKIpOZVpIe9AXQRQNtl7uQntf1xiW00XupiAHC0N95rRQy5KrSAtwMiuFbxPP+ttfodwESvakbQ0rzQk5t2huYKljcNF4rzdexJe4c1iPetB97VmOo9vXixvwwQGds5iQIMqSrRxi/PCjowSK8JjReH1qLeGU0I/9RKJZA30Sz8jMQM7S2FC/kqSsr2rUAH7Ku/UIjblXUpaoCxEsC57YnhhMUxC5f8wNWuxMZz2+IfZfHqZXMtv2L6APd2LEnOaSk9DIlUectu4kwhsQ+59x9zWicVOPDoPQJ5DvB0TP5BYj+jvxmXUi3vls3TsMNaDSzhxv6IOQorhkBVuu5Md973zksLLUy7kH9E9ffH1jyEo/4G6/MkDTFpfqRBcES7s6zdHJgyuFO5rVC72SWTmj7bWvKZHLs3UFA7vWyOxdBSbuYidY6fKS/qgR1CkcipEy+5YsPYkSTpqqllGJrKV6fWWBT2hLG2DFkoBfgSKdoPEJ/pvMia1qBOI7t4G5Cfvcb+G4j4g3AdPUF1axJwPPXjQeHK7xpzJ9baNE5gGzxLihp3JUcdoFmdZaZ5Sv1qMq5UdKrKqZBSQ1j52Bo="/>
    <input type="hidden" value="62f859a005e28f79ef2069d9de198947" id="jschl-vc" name="jschl_vc"/>
    <!-- <input type="hidden" value="" id="jschl-vc" name="jschl_vc"/> -->
    <input type="hidden" name="pass" value="1627892251.661-yYtA6C6xg3"/>
    <input type="hidden" id="jschl-answer" name="jschl_answer"/>
  </form>

    <script type="text/javascript">
      //<![CDATA[
      (function(){
          var a = document.getElementById('cf-content');
          a.style.display = 'block';
          var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
          var trkjs = isIE ? new Image() : document.createElement('img');
          trkjs.setAttribute("src", "/cdn-cgi/images/trace/jschal/js/transparent.gif?ray=6785deb3cd503aec");
          trkjs.id = "trk_jschal_js";
          trkjs.setAttribute("alt", "");
          document.body.appendChild(trkjs);
          var cpo=document.createElement('script');
          cpo.type='text/javascript';
          cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/jsch/v1?ray=6785deb3cd503aec";
          document.getElementsByTagName('head')[0].appendChild(cpo);
        }());
      //]]>
    </script>



  <div id="trk_jschal_nojs" style="background-image:url('/cdn-cgi/images/trace/jschal/nojs/transparent.gif?ray=6785deb3cd503aec')"> </div>
</div>


          <div class="attribution">
            DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a>
            <br />
            <span class="ray_id">Ray ID: <code>6785deb3cd503aec</code></span>
          </div>
      </td>

    </tr>
  </table>
</body>
</html>

七、从入门到放弃💔

selenium+chromedriver的组合可以很好的解决网页渲染(js执行)的问题,
但是在Scrapy中使用selenium+chromdriver存在以下问题:
(1)Python + scrapy + selenium + chromedriver + chrome环境配置繁杂;
(2)Scrapy线程阻塞 - 串行的执行http请求,爬取速度太慢😭,并没有充分发挥Scrapy的性能;
(3)多个spider同时执行时开启多个chrome实例,系统负载过高;
综上,结合当前同时爬取500+网站的需求,最终弃用Selenium+ChromeDriver的组合😓
通过进一步了解,决定使用Scrapy+Splash的架构…
在这里插入图片描述

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

罗小爬EX

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值