Python 网络爬虫：Scrapy-selenium配置及selenium接管浏览器

最新推荐文章于 2024-09-29 14:23:10 发布

一壶清玖

最新推荐文章于 2024-09-29 14:23:10 发布

阅读量1.7k

点赞数 4

文章标签： python selenium chrome

本文链接：https://blog.csdn.net/weixin_44072750/article/details/113117439

版权

Scrapy-selenium配置及selenium接管浏览器

下面笔者将继续介绍selenium以及将selenium配置到Scrapy中

文章目录

Scrapy-selenium配置及selenium接管浏览器
前言
一、selenium被检测
二、Selenium方法被检测识别出来的应对方案
- 接管浏览器
三、Scrapy-selenium 配置
总结

前言

继续学习继续向前走，别停下！

一、selenium被检测

在写爬虫时面对很多js 加载的页面，很多人喜欢用Senlenium+ Webdriver。可是已淘宝为首，众多网站都开展针对 Selenium的js监测机制，比如：window.navigator.webdriver，navigator.languages，navigator.plugins.length……
例如：
正常情况下我们用浏览器访问淘宝等网站的 window.navigator.webdriver的值为
undefined。
在这里插入图片描述
当我们用selenium 的时候， window.navigator.webdriver的值为 true。

这样会对我们有什么影响呢?笔者在这就用美团举例，当我们直接使用以下代码请求美团的页面并打印时：

from selenium import webdriver
import time

# 控制chrome浏览器
driver = webdriver.Chrome()
# 窗口最大化
# driver.maximize_window()
# 输入网址
driver.get("https://bj.meituan.com/")
# 停一下，等待加载完毕
time.sleep(2)
html_goods = driver.page_source
print(html_goods)

我们将打印的页面复制到pycharm中打开：
在这里插入图片描述

跟正常打开的美团页面一对比，我们会发现，这里少了大量的数据，只剩下跳转到下一级的部分，我们没有拿到我们需要的完整的页面。下面我们便要着手解决这个问题。

二、Selenium方法被检测识别出来的应对方案

接管浏览器

如果使用selenium直接调用打开浏览器，window.navigator.webdriver的值会为True，为了避免这个问题，我们可以使用selenium直接接管现有的浏览器，这样就可以巧妙地避开这个问题。实现方法如下：

代码如下：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import os
import time

os.popen('"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile"')
options = Options()
options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
driver = webdriver.Chrome(options=options)
driver.get('https://bj.meituan.com/')
time.sleep(5)
html = driver.page_source
print(html)

这里是使用了os方法，调用了chrome，并指定了端口。
其中，C:\Program Files\Google\Chrome\Application\chrome.exe" 指的是你chrome.exe所在的位置
–remote-debugging-port=9222 指的是调试端口
user-data-dir=“C:\selenum\AutomationProfile” 指的是用户数据信息缓存，并指明路径。

效果如下：
在这里插入图片描述
注：由于网页在执行的过程了不断地产生网页数据缓存，所以我单独写了一个小脚本对网页数据进行删除。

import shutil
shutil.rmtree("C:\selenum",ignore_errors=True)

三、Scrapy-selenium 配置

我们只需要对scrapy的中间件middlewares和setting进行更改：
我们创建一个selenium类：

# 驱动selenium中间件
class SeleniumMiddleware(object):
    def __init__(self):
        os.popen('"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile"')
        options = Options()
        options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.close()

    # 总共重试40次，每次间隔100毫秒
    @retry(stop_max_attempt_number=40, wait_fixed=1000)
    def retry_load_page(self, request, spider):
        # 如果页面数据找到了，表示网页渲染成功，程序正常向下执行
        try:
            # 根据页面有无//h3节点，来判断网页是否加载成功
            self.driver.find_element_by_xpath("//h3")
        except:
            self.count += 1
            spider.logger.info("<{}> retry {} times".format(request.url, self.count))
            # 手动抛出异常交给retry捕获，这样retry才能正常工作
            raise Exception("<{}> page load failed.".format(request.url))

    def process_request(self, request, spider):
        self.count = 0
        self.driver.get(request.url)
        # 显示等待
        # time.sleep(2)
        try:
            self.retry_load_page(request, spider)
            # 隐式等待
            # 判断页面数据是否渲染成功，如果没成功继续等待，如果成功提取数据不用等待。
            # Unicode 字符串
            html = self.driver.page_source
            # 返回一个response响应对象给引擎，引擎会认为是下载器返回的响应，默认交给spider解析
            return scrapy.http.HtmlResponse(url=self.driver.current_url, body=html.encode("utf-8"),
                                            encoding="utf-8", request=request)
        except Exception as e:
            spider.logger.error(e)
            return request

setting设置：

DOWNLOADER_MIDDLEWARES = {
    'cwk1.middlewares.SeleniumMiddleware': 543,
}

总结

以上便是笔者对于selenium的一点应用总结，如果哪里有写误或没写清楚的地方，欢迎大家指正，相互学习，相互进步，共勉！

一壶清玖

关注

4
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫