逆向爬虫11 selenium基础

一个小黑酱

已于 2022-03-24 18:05:07 修改

阅读量1k

点赞数 1

分类专栏：爬虫学习文章标签：爬虫 selenium python

于 2022-01-22 00:26:50 首次发布

本文链接：https://blog.csdn.net/weixin_40743639/article/details/122631986

版权

爬虫学习专栏收录该内容

45 篇文章 20 订阅

订阅专栏

文章目录

逆向爬虫11 selenium基础

逆向爬虫11 selenium基础

一、什么是selenium?

selenium 是一个自动化测试的工具。可以启动一个全新的浏览器，并从浏览器中提取到你想要的内容。

二、为什么要学selenium？

前面学了这么 requests 模块已经可以获取到网页源代码了，为什么还要学习 selenium 这样一个本不是为爬虫服务的自动化测试工具呢？因为很多网站现在将数据进行加密，再通过 javascript 将数据解密，requests 模块只能获得加密后的数据，之前所学的知识已经无法爬取这类网站的数据了，而 selenium 模块可以提供浏览器环境，而浏览器会去加载 javascript 代码对数据进行解密，再通过selenium 提取目标内容，因此 selenium 可以应对大多数数据加密的情况（大厂例外）。

三、如何安装selenium?

1. 输入命令 pip install selenium

2. 下载浏览器驱动 https://npm.taobao.org/mirrors/chromedriver

在这里插入图片描述

3. 下载完后放到python解释器目录

在这里插入图片描述

四、如何使用selenium？

1. 打开浏览器，输入网址回车

from selenium.webdriver import Chrome

web = Chrome()  # 此时自动查找浏览器驱动
url = "http://www.baidu.com"
web.get(url)
print(web.title)  # 固定的. 获取到网站的titile标签中的内容

2. selenium各种神奇操作

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

web = Chrome()
url = "https://shanghai.zbj.com/"
web.get(url)
time.sleep(1)

# 点击外包需求
print("选择外包需求")
btn = web.find_element(By.XPATH, '//*[@id="utopiacs-zp-header-v1"]/div/div/div[3]/div[3]/div[1]/a/span')
btn.click()
time.sleep(1)

# 切换窗口
print("切换窗口")
web.switch_to.window(web.window_handles[-1])    # 跳转到最后一个窗口

# 关闭广告
print("正在关闭广告")
web.execute_script("""
    var a = document.getElementsByClassName("hall-top-xw")[0];
    a.parentNode.removeChild(a);
""")

# 选择输入框，输入python
print("输入python，点击搜索")
web.find_element(By.XPATH, '//*[@id="utopia_widget_2"]/div/div[2]/div/input').send_keys("python", Keys.ENTER)
time.sleep(1)

# 切换窗口
print("切换窗口")
web.switch_to.window(web.window_handles[-1])    # 跳转到最后一个窗口

# 获取任务信息和赏金
print("获取任务信息和赏金")
for i in range(2):
    # 关闭广告
    print("正在关闭广告")
    web.execute_script("""
        var a = document.getElementsByClassName("hall-top-xw")[0];
        a.parentNode.removeChild(a);
    """)
    div_list = web.find_elements(By.XPATH, '//*[@id="utopia_widget_6"]/div/div[1]/div')
    for div in div_list:
        name = div.find_element(By.XPATH, './div[1]/h4/a').text
        detail = div.find_element(By.XPATH, './div[2]').text
        salary = div.find_element(By.XPATH, './div[4]/span').text
        print(name, detail, salary)
    next = web.find_element(By.XPATH, '//*[@id="utopia_widget_8"]/a[9]')
    next.click()
    time.sleep(1)

print("关闭当前窗口")
web.close()

print("切换回第一个窗口")
web.switch_to.window(web.window_handles[0])
time.sleep(1)

3. 如何获取iframe中的东西

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
import time

web = Chrome()
web.get("http://www.wbdy.tv/play/30288_1_1.html")
time.sleep(5)

# 切换iframe
iframe = web.find_element(By.XPATH, '//*[@id="mplay"]')
web.switch_to.frame(iframe)

# 获取标签属性
input = web.find_element(By.XPATH, '//*[@id="dplayer"]/div[4]/div[1]/input')
placeholder = input.get_property("placeholder")
print(placeholder)

# 跳出iframe
web.switch_to.parent_frame()
content = web.find_element(By.XPATH, '/html/body/div[2]/div[3]/div[2]/div/div[2]')
print(content.text)

4. 下拉列表切换，拿页面代码（不是源代码）

from selenium.webdriver.common.by import By
from selenium.webdriver import Chrome
from selenium.webdriver.support.select import Select	# 下拉列表 <select>
import time

web = Chrome()
web.get("https://www.endata.com.cn/BoxOffice/BO/Year/index.html")

sel = web.find_element(By.XPATH, '//*[@id="OptionDate"]')
sel_new = Select(sel)

# selenium可以一口气拿到标签和其后代标签中的文本内容，因此直接拿表格标签，打印它的text
for i in range(len(sel_new.options)):
    sel_new.select_by_index(i)	# 根据位置切换
    time.sleep(3)
    div = web.find_element(By.XPATH, '//*[@id="TableList"]/table/tbody')
    print(div.text)

# 获取页面代码( 不是页面源代码, 是F12里面 elements的代码)
page_source = web.page_source
print(page_source)

5. 隐藏浏览器

from selenium.webdriver.common.by import By
from selenium.webdriver import Chrome
from selenium.webdriver.support.select import Select
import time

# 配置无头信息
from selenium.webdriver.chrome.options import Options
opt = Options()
opt.add_argument("--headless")
opt.add_argument("--disable-gpu")
web = Chrome(options=opt)

web.get("https://www.endata.com.cn/BoxOffice/BO/Year/index.html")

sel = web.find_element(By.XPATH, '//*[@id="OptionDate"]')
sel_new = Select(sel)

for i in range(len(sel_new.options)):
    sel_new.select_by_index(i)
    time.sleep(3)
    div = web.find_element(By.XPATH, '//*[@id="TableList"]/table/tbody')
    print(div.text)

# 获取页面代码( 不是页面源代码, 是F12里面 elements的代码)
page_source = web.page_source
print(page_source)

五、如何搞定验证码

1. 为何会有验证码？

验证码起初是用于防止暴力破解密码而设计出来的一种人机验证方式。银行密码一般为6位数字，一共有10的6次方100万种可能性，如果有人得知了你的银行卡号，再写了一个穷举代码，一次一次访问银行网站，那么他最多穷举100万次，就可以登入进你的银行账号，这对于计算机程序来说并不是一件难事。因此人们设计了验证码，每次当你登录的时候，会要求人为识别验证码中的内容并输入，验证通过后才能进行登录访问。增加了这道验证码机制后，普通的穷举代码就无法对密码进行破解了。

2. 使用超级鹰解决验证码

注册超级鹰账号，充值积分（超级鹰每次进行验证码识别时会消耗积分）
进入用户中心，生成一个软件ID，复制该软件ID
下载示例代码，将超级鹰账号，密码，软件ID替换，运行程序获得示例验证码图片的识别结果

过程就不截图，详细可以看官方的说明文档来学习如何使用，下面贴出代码

#!/usr/bin/env python
# coding:utf-8

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


if __name__ == '__main__':
	chaojiying = Chaojiying_Client('xxxxxx', 'xxxxxx', '96001')	#用户中心>>软件ID 生成一个替换 96001
	im = open('a.jpg', 'rb').read()			#本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
	print(chaojiying.PostPic(im, 1902)) 	#1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()

3. 使用超级鹰来搞定超级鹰

from selenium.webdriver.common.by import By
from selenium.webdriver import Chrome
from chaojiying import Chaojiying_Client

web = Chrome()
web.get("http://www.chaojiying.com/user/login/")

png = web.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/div/img').screenshot_as_png

chaojiying = Chaojiying_Client('xxxxxx', 'xxxxxx', 'xxxxxx')	#用户中心>>软件ID 生成一个替换 96001
result = chaojiying.PostPic(png, 1902)      #1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()
v_code = result['pic_str']

web.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input').send_keys("xxxxxxxxxx")
web.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input').send_keys("xxxxxxxxxx")
web.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input').send_keys(v_code)
web.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input').click()