Selenium + chrome + BeautifulSoup进行网页爬虫

shuxiaohua

已于 2024-02-21 14:13:33 修改

阅读量808

点赞数 2

分类专栏：杂项文章标签： selenium chrome beautifulsoup

于 2020-09-08 20:29:18 首次发布

本文链接：https://blog.csdn.net/shuxiaohua/article/details/108476173

版权

杂项专栏收录该内容

9 篇文章

订阅专栏

背景

一般的网页和rest api是可以通过urllib进行爬取的，但是出现一下情况，urllib就无能为力了

网页有反机器人机制，使用代码模拟浏览器行为非常麻烦
要抓取的内容，是js实时渲染出来的

为了模拟浏览器的行为进行网页爬取，需要

Selenium:自动化工具，负责调用chrome driver
chrome driver:chrome的驱动，负责调用chrome，作为Selemnium与chrome之间交互的桥梁，类似程序通过驱动与硬件交互。
BeautifulSoup:Selenium的webdriver工具操作html页面时，功能不够强，BeautifulSoup有更强大的api，比如css选择器
chrome

准备工作

python:下载最新的版本即可
chrome:下载最新的版本即可
chrome driver:得下载与chrome 配套的版本:

淘宝镜像：https://npm.taobao.org/mirrors/chromedriver/
官方镜像:https://chromedriver.storage.googleapis.com/index.html
https://googlechromelabs.github.io/chrome-for-testing/
https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-downloads.json

Selenium： pip install Selenium

注意：chrome需要在path路径里面，chrome diver如果不在path路径里面，可以在代码中设置
driver = webdriver.Chrome({chrome driver的路径},chrome_options=chromeOptions)

源

from bs4 import BeautifulSoup
from selenium import webdriver

def selectElement(url,cssSelector):
    totalSleepTime = 0
    maxSleepTime = 120
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument('--headless')
    chromeOptions.add_argument('--disable-gpu')
    # 取消DevTools的日志打印
    chromeOptions.add_experimental_option('excludeSwitches', ['enable-logging'])
    # linux环境，使用root账户运行的时候，chrome需要加--no-sandbox才能启动成功
    chromeOptions.add_argument('--no-sandbox')
    # 忽略证书错误
    chromeOptions.add_argument('--ignore-certificate-errors')
    driver = webdriver.Chrome('C:/Users/Administrator/Desktop/chromedriver.exe', chrome_options=chromeOptions)
    driver.get(url)

    while True:
        # 获取网页源码
        source = driver.page_source
        soup = BeautifulSoup(source, 'html5lib')
        elements = soup.select(cssSelector)
        # 有时候网页有反爬虫，会经过几层跳转后才到真正的页面，得轮询等待driver.get(url)运行完毕
        # chrome driver与浏览器不是同一个进程，之间是通过进程间通信获取到网页信息的，所以得通过轮询的方式，去检查页面跳转是否完毕，目前没发现有什么同步方式
        if len(elements) == 0 and totalSleepTime < maxSleepTime:
            time.sleep(10)
            totalSleepTime += 10
        else:
            break
    driver.close()
    return elements