python+Selenium多线程后台爬虫例子_selenium多线程爬虫-CSDN博客

本文链接：https://blog.csdn.net/wg2627/article/details/127380184

Selenium多线程后台爬虫
一、前言：
有些网站不支持网页源码爬虫、或要爬取的网页内容不在网页源码中，
等需要使用Selenium进行爬虫
二、准备工作：
安装selenium及对应googlechrome浏览器
安装方法：参考安装教程
三、多线程原理：
1、利用同一个浏览器打开多页面、相当于打开一个线程、提高爬虫速度
2、同时打开多个浏览器，相关于打开多个线程。多线程提高爬虫速度
部分代码如下：

关键代码：

import time
import re
import threading
import queue
from browsermobproxy import Server
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def jiexi1():    #后台运行chrom浏览器
    option = webdriver.ChromeOptions()  # 设置option 后台运行
    option.add_argument('--headless')  # 设置option 后台运行
    option.add_argument('--blink-settings=imagesEnabled=false')  # 设置option 不显示图片提高速度  # 设置option 后台运行
    option.add_argument('--disable-gpu')  # 禁用GPU加速  # 设置option 后台运行
    web = webdriver.Chrome(chrome_options=option)  # 调用带参数的谷歌浏览器  # 设置option 后台运行
    web.get('http://www.baidu.com') #初始页面
    
    地址 = url.get()  #取地址

    js1="window.open('%s')" % 地址     #打开 新地址
    web.execute_script(js1)
    web.switch_to.window(web.window_handles[-1])
    jb1=web.window_handles[-1]
    
    for i in range(5):    #多线程打开浏览器
        t1 = threading.Thread(target=jiexi1)  
        t1.start()
        #time.sleep(1)
    for ii in range(5):
        t1.join()