爬虫第四讲 js2py和selenium

最新推荐文章于 2024-06-06 22:45:00 发布

加油小羽哥

最新推荐文章于 2024-06-06 22:45:00 发布

阅读量5.3w

点赞数 4

分类专栏：爬虫文章标签：爬虫 js selenium

本文链接：https://blog.csdn.net/yangyusir/article/details/116269003

版权

爬虫专栏收录该内容

9 篇文章 12 订阅

订阅专栏

爬虫第四讲 selenium

一、js2py

1.js2py简介

在平时爬虫过程中，我们会遇到网站对js文件加密，无法爬取，现在就让我们来了解一下js2py模块，它可以对js文件进行解密

2. js2py模块的使用

在python中执行js代码，通常两个库：js2py,pyexecjs
js2py是纯python实现的库，用于在python中运行js代码，本质上是将js代码翻译成python代码
js2py安装：pip install js2py

3. js2py快速入门

import js2py

# print('hello world')
js2py.eval_js('console.log("hello world")')  # 'hello world'   实现print('hello world')
# func_js 为add()函数
func_js = """
function add(a,b){
    return a+b
}
"""
add = js2py.eval_js(func_js)  # 实现add()函数
print(add(1, 2))  # 3

print(js2py.eval_js('var a = "python";a'))  # python

add = js2py.eval_js('function add(a,b){return a + b}')
print(add(2, 3))  # 5

4. js代码翻译

翻译js文件为py文件

# 翻译js文件为py文件
import js2py

print(js2py.translate_js("console.log('hello world')"))
'''
from js2py.pyjs import *
# setting scope
var = Scope(JS_BUILTINS)
set_global_object(var)

# Code follows:
var.registers([])
var.get('console').callprop('log', Js('hello world'))'''

from js2py.pyjs import *

# setting scope
var = Scope(JS_BUILTINS)
set_global_object(var)

# Code follows:
var.registers([])
var.get('console').callprop('log', Js('hello world'))
'''
'hello world'
'''

将js文件翻译成python脚本
（1）新建test.js文件，编辑内容：console.log(“hello world”)
（2）写代码，并执行

import js2py

# 将js文件翻译成python脚本
js2py.translate_file('test.js', 'test.py')

（3）生成test.py文件，打开执行

__all__ = ['test']

# Don't look below, you will not understand this Python code :) I don't.

from js2py.pyjs import *
# setting scope
var = Scope(JS_BUILTINS)
set_global_object(var)

# Code follows:
var.registers([])
var.get('console').callprop('log', Js('hello world'))


# Add lib to the module scope
test = var.to_python()
'''
'hello world'
'''

5.js代码中使用函数

示例1

import js2py  

# 执行了一个python代码
print('sum:', sum([1, 2, 3]))  # sum: 6
context = js2py.EvalJs({'python_sum': sum})
print(context)  # <js2py.evaljs.EvalJs object at 0x0000000002834408>  <js2py.evaljs.EvalJs 对象
js_code = '''
python_sum([1,2,6])
'''
# 执行了一个js代码
print('js_code的运行结果：', context.eval(js_code))  # js_code的运行结果： 9

示例2

import js2py

# 在js代码中导入Python模块并使用
# 使用pyimport语法
js_code = """
pyimport requests
console.log('导入成功'); 
var response = requests.get('http://www.baidu.com');
console.log(response.url);
console.log(response.content);
"""
js2py.eval_js(js_code)
'''
'导入成功'
'http://www.baidu.com/'
PyObjectWrapper(b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">\xe7\x99\xbb\xe5\xbd\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88</a>&nbsp;\xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n')
'''

二、selenium的使用

1.爬虫和反爬虫的斗争

在这里插入图片描述
爬虫建议

尽量减少请求次数
保存获取到的HTML，供查错和重复使用
关注网站的所有类型的页面
H5页面
APP
多伪装
代理IP
随机请求头
利用多线程分布式
在不被发现的情况下我们尽可能的提高速度

2.ajax基本介绍

静态网页：向一个网站发起请求，得到响应的数据都在源代码中。
动态HTML技术了解

JS
是网络上最常用的脚本语言,它可以收集用户的跟踪数据,不需要重载页面直接提交表单,在页面嵌入多媒体文件,甚至运行网页
jQuery
jQuery是一个快速、简介的JavaScript框架,封装了JavaScript常用的功能代码
ajax
ajax可以使用网页实现异步更新,可以在不重新加载整个网页的情况下,对网页的某部分进行更新

3.获取ajax数据的方式

1.直接分析ajax调用的接口。然后通过代码请求这个接口。
2.使用Selenium+chromedriver模拟浏览器行为获取数据。

方式	优点	缺点
分析接口	直接可以请求到数据。不需要做一些解析工作。代码量少，性能高。	分析接口比较复杂，特别是一些通过js混淆的接口，要有一定的js功底。容易被发现是爬虫。
selenium	直接模拟浏览器的行为。浏览器能请求到的，使用selenium也能请求到。爬虫更稳定。	代码量多，性能低。

示例1 分析ajax调用的接口爬取百度海贼王贴吧中的壁纸
主页’https://tieba.baidu.com/p/1934517161#!/l/p1’ 一共98张图片
我们发现要爬取的数据是图片，那么就只要找到它所有的src就可以了
在这里插入图片描述
滑动鼠标滚轮，查看XHR中加载的数据:三个list

在这里插入图片描述

图片的地址在上面
目标Request URL:
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=1&pe=40&wall_type=h&=1620885803349
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=40&pe=79&wall_type=h&=1620885803349
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=79&pe=118&wall_type=h&_=1620885805542

for i in range(1, 80, 39):
    response = requests.get('https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=' + str(i) + '&pe=' + str(i + 39),
        headers=headers).text

经本人测试https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=1&pe=118
也可以获取全部数据。
解析数据：可以通过json转换为python的字典数据类型，然后通过key-value形式找到每张图的url。
也可以通过正则：

img_list += re.findall('"murl":"(.*?)"', response)

总结：
首先，确定需要的数据是否在源码当中，不在的话，通过network分析真实的接口。
然后，向url请求数据，获取响应。
最后，采用合适的解析数据方法解析数据并保存。

import re
from urllib.request import urlretrieve
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
img_list = []
for i in range(1, 80, 39):
    response = requests.get('https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=' + str(i) + '&pe=' + str(i + 39),
        headers=headers).text
    # print(response)
    img_list += re.findall('"murl":"(.*?)"', response)  # 匹配图片地址
for num, img in enumerate(img_list):
    urlretrieve(img, './image/%d.jpg' % num)  # 保存图片

4.Selenium+chromedriver获取动态数据

Selenium 介绍

selenium是一个web的自动化测试工具，最初是为网站自动化测试而开发的，selenium可以直接运行在浏览器上，它支持所有主流的浏览器，可以接收指令，让浏览器自动加载页面，获取需要的数据，甚至页面截屏。
Selenium自己不带浏览器，不支持浏览器的功能，它需要与第三方浏览器结合在一起才能使用。但是我们有时候需要让它内嵌在代码中运行，所有我们要用一个叫PhantomJS或者chromedriver的工具代替真实的浏览器。
PhantomJS 和 Chromedriver操作方式以及功能一致
主要区别 PhantomJS 无界面模式节省内存
Chromedriver 完全模仿浏览器消耗内存

Phantomjs快速入门

Phantomjs是一个基于webkit的JavaScript API。它使用QtWebKit作为它核心浏览器的功能，使用webkit来编译解释执行JavaScript代码。任何你可以在基于webkit浏览器做的事情，它都能做到。它不仅是个隐形的浏览器，提供了诸如CSS选择器、支持Web标准、DOM操作、JSON、HTML5、Canvas、SVG等，同时也提供了处理文件I/O的操作，从而使你可以向操作系统读写文件等。PhantomJS的用处可谓非常广泛，诸如网络监测、网页截屏、无需浏览器的 Web 测试、页面访问自动化等。
PhantomJS是一个基于Webkit的"无界面"(headless)浏览器，它会把网站加载到内存并执行页面上的JavaScript，因为不会展示图形界面，所以运行起来比完整的浏览器更高效。
如果我们把Selenium和PhantomJS结合在一起，就可以运行一个非常强大的网络爬虫了，这个爬虫可以处理JavaScript、Cookie、headers，以及任何我们真实用户需要做的事情。

官网下载地址：https://phantomjs.org/download.html
国内下载地址：https://npm.taobao.org/dist/phantomjs/
下载phantomJS的包并解压缩：

若在Windows系统中，将下载的phantomjs文件夹下bin文件夹下的phantomjs.exe文件复制粘贴到python.exe同级目录下（当然也可以在程序中动态的为webdriver.PhantomJS(“指定phantomjs.exe文件路径”)）；
若在Mac系统中，将下载的phantomjs文件夹下bin文件夹下的phantomjs文件拷贝到“Library/Python/2.7/site-packages”目录下。
至此我们就可以在python文件中引用webdriver和phantomjs了（这里phantomjs仅仅发挥了它是无窗口浏览器的作用）。

无头浏览器：一个完整的浏览器内核,包括js解析引擎,渲染引擎,请求处理等,但是不包括显示和用户交互页面的浏览器。
Phantomjs案例

# Phantomjs快速入门
# 导入模块
import time
from selenium import webdriver

# driver = webdriver.PhantomJS("安装目录") #解压文件phantomjs.exe放在python.exe同级目录下
# 驱动
driver = webdriver.PhantomJS()
# 打开百度
driver.get("https://www.baidu.com")

# 2.定位和操作
driver.find_element_by_id("kw").send_keys("羽哥")  # 输入搜索框内容为'羽哥'
driver.find_element_by_id("su").click()  # 点击"百度一下"按钮

time.sleep(3)

# 截屏
driver.save_screenshot("baidu.png")

# 3.查看请求信息
print(driver.current_url)  # 查看当前请求的url
print(driver.page_source)  # 查看请求网页源码
print(driver.get_cookies())  # 查看请求cookie信息


# 4.退出
driver.quit()

chromedriver是一个驱动Chrome浏览器的驱动程序，使用他才可以驱动浏览器。当然针对不同的浏览器有不同的driver。以下列出了不同浏览器及其对应的driver：
Chrome：https://sites.google.com/a/chromium.org/chromedriver/downloads
Firefox：https://github.com/mozilla/geckodriver/releases
Edge：https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Safari：https://webkit.org/blog/6900/webdriver-support-in-safari-10/
下载chromedriver
百度搜索：淘宝镜像(https://npm.taobao.org/)
安装总结：https://www.jianshu.com/p/a383e8970135
安装Selenium：pip install selenium
chromedriver下载地址 https://npm.taobao.org/mirrors/chromedriver/

chromedriver快速入门

# chromedriver快速入门
from selenium import webdriver
import time
driver = webdriver.Chrome()

# 打开百度
driver.get("https://www.baidu.com")
# 窗口最大化
driver.maximize_window()
# 获取cookies
cookie = driver.get_cookies()
print(cookie)

time.sleep(3)
# 退出当前窗口
driver.close()

time.sleep(1)
# 退出浏览器
driver.quit()

5.定位元素

from selenium import webdriver
from selenium.webdriver.common.by import By
'''
class By(object):
    """
    Set of supported locator strategies.
    """

    ID = "id"
    XPATH = "xpath"
    LINK_TEXT = "link text"
    PARTIAL_LINK_TEXT = "partial link text"
    NAME = "name"
    TAG_NAME = "tag name"
    CLASS_NAME = "class name"
    CSS_SELECTOR = "css selector"
'''
driver = webdriver.Chrome()
driver.get("http://www.baidu.com")

find_element_by_id：根据id来查找某个元素

submitTag = driver.find_element_by_id('su')
submitTag1 = driver.find_element(By.ID,'su')

find_element_by_class_name：根据类名查找元素

submitTag = driver.find_element_by_class_name('su')
submitTag1 = driver.find_element(By.CLASS_NAME,'su')

find_element_by_name：根据name属性的值来查找元素

submitTag = driver.find_element_by_name('email')
submitTag1 = driver.find_element(By.NAME,'email')

find_element_by_tag_name：根据标签名来查找元素

submitTag = driver.find_element_by_tag_name('div')
submitTag1 = driver.find_element(By.TAG_NAME,'div')

find_element_by_xpath：根据xpath语法来获取元素

submitTag = driver.find_element_by_xpath('//div')
submitTag1 = driver.find_element(By.XPATH,'//div')

要注意，find_element是获取第一个满足条件的元素。find_elements是获取所有满足条件的元素。

6.操作表单元素

操作输入框：分为两步。
第一步：找到这个元素。
第二步：使用send_keys(value)，将数据填充进去

inputTag = driver.find_element_by_id('kw')
inputTag.send_keys('羽哥')  # 输入内容

使用clear方法可以清除输入框中的内容

inputTag.clear()

操作按钮
操作按钮有很多种方式。比如单击、右击、双击等。这里讲一个最常用的。就是点击。直接调用click函数就可以了

inputTag = driver.find_element_by_id('su')
inputTag.click()

选择select
select元素不能直接点击。因为点击后还需要选中元素。这时候selenium就专门为select标签提供了一个类from selenium.webdriver.support.ui import Select。将获取到的元素当成参数传到这个类中，创建这个对象。以后就可以使用这个对象进行选择了。https://www.17sucai.com/boards/53562.html

示例

from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome()
driver.get("https://www.17sucai.com/pins/demo-show?id=5926")
# 切换iframe
driver.switch_to.frame(driver.find_element_by_id('iframe'))
# ------------------------------------ 1.针对select标签操作--------------------------------------------
# selectTag = Select(driver.find_element_by_class_name('nojs'))
# # 操作方式
# # 1.根据值来选择
# # selectTag.select_by_value('JP')
# # 2.根据索引来选择
# selectTag.select_by_index(6)
# ------------------------------------ 2.针对非select标签操作--------------------------------------------
selectTag = driver.find_element_by_id('dk_container_country-nofake').click()
key = int(input('请输入数字：'))
if key == 1:
    driver.find_element_by_xpath('//*[@id="dk_container_country-nofake"]/div/ul/li[2]').click()
if key == 2:
    driver.find_element_by_xpath('//*[@id="dk_container_country-nofake"]/div/ul/li[3]').click()
if key == 3:
    driver.find_element_by_xpath('//*[@id="dk_container_country-nofake"]/div/ul/li[4]').click()
if key == 4:
    driver.find_element_by_xpath('//*[@id="dk_container_country-nofake"]/div/ul/li[5]').click()
if key == 5:
    driver.find_element_by_xpath('//*[@id="dk_container_country-nofake"]/div/ul/li[6]').click()
if key == 6:
    driver.find_element_by_xpath('//*[@id="dk_container_country-nofake"]/div/ul/li[7]').click()

7.行为链

有时候在页面中的操作可能要有很多步，那么这时候可以使用鼠标行为链类ActionChains来完成。比如现在要将鼠标移动到某个元素上并执行点击事件。

actions = ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag,'python')
actions.move_to_element(submitTag)
actions.context_click()
actions.click(submitTag)
actions.perform()

还有更多的鼠标相关的操作

click_and_hold(element)：点击但不松开鼠标。
context_click(element)：右键点击。
double_click(element)：双击。

示例：

# 行为链官方文档https://selenium-python.readthedocs.io/api.html
# 鼠标的一系列行为和动作：双击double_click()、单击click()、右键单击context_click()、拖拽drag_and_drop()
# drag_and_drop_by_offset(source,xoffset,yoffset)按住源元素的鼠标左键，然后移至目标偏移量并释放鼠标按键。xoffset:要移动的x偏移量,yoffset:要移动的y偏移量
# key_down(value,element=None)鼠标左键按住不放

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

driver = webdriver.Chrome()
driver.get('https://www.baidu.com')

# 定位到输入框
inputTag = driver.find_element_by_id('kw')

# 定位到百度一下按钮
buttonTag = driver.find_element_by_id('su')

# 实例化对象
actions = ActionChains(driver)

# 把鼠标移动到输入框中
actions.move_to_element(inputTag)

# 输入内容
actions.send_keys_to_element(inputTag, '羽哥')

# 移动鼠标到百度一下按钮并点击
actions.move_to_element(buttonTag).click()
# buttonTag.click() # 点击行为不能在鼠标行为链中，该命令可以修改为actions.move_to_element(buttonTag).click()或者放提交行为链之后

# 右键点击，显示右键菜单栏
actions.context_click()

# 提交行为链上的操作
actions.perform()

# buttonTag.click()
'''
1.千万不要忘记提交鼠标行为链
2.注意鼠标行为链里面的点击操作
'''

更多方法请参考：http://selenium-python.readthedocs.io/api.html

8.Selenium页面等待

（1）Cookie操作

获取所有的cookie

cookies = driver.get_cookies()

根据cookie的name获取cookie

value = driver.get_cookie(name)

删除某个cookie

driver.delete_cookie('key')

https://xui.ptlogin2.qq.com/cgi-bin/xlogin?proxy_url=https%3A//qzs.qq.com/qzone/v6/portal/proxy.html&daid=5&&hide_title_bar=1&low_login=0&qlogin_auto_login=1&no_verifyimg=1&link_target=blank&appid=549000912&style=22&target=self&s_url=https%3A%2F%2Fqzs.qzone.qq.com%2Fqzone%2Fv5%2Floginsucc.html%3Fpara%3Dizone&pt_qr_app=手机QQ空间&pt_qr_link=http%3A//z.qzone.com/download.html&self_regurl=https%3A//qzs.qq.com/qzone/v6/reg/index.html&pt_qr_help_link=http%3A//z.qzone.com/download.html&pt_no_auth=0

示例：selenium重构cookie登录qq空间

# selenium重构cookie登录qq空间
import json
import time

import requests
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://xui.ptlogin2.qq.com/cgi-bin/xlogin?proxy_url=https%3A//qzs.qq.com/qzone/v6/portal/proxy.html&daid=5&&hide_title_bar=1&low_login=0&qlogin_auto_login=1&no_verifyimg=1&link_target=blank&appid=549000912&style=22&target=self&s_url=https%3A%2F%2Fqzs.qzone.qq.com%2Fqzone%2Fv5%2Floginsucc.html%3Fpara%3Dizone&pt_qr_app=%E6%89%8B%E6%9C%BAQQ%E7%A9%BA%E9%97%B4&pt_qr_link=http%3A//z.qzone.com/download.html&self_regurl=https%3A//qzs.qq.com/qzone/v6/reg/index.html&pt_qr_help_link=http%3A//z.qzone.com/download.html&pt_no_auth=0'
driver.get(url)
button = driver.find_element_by_class_name('face').click()
time.sleep(5)
# 打印qq空间的url
# print(driver.current_url)  # https://user.qzone.qq.com/515151091

# 获取cookie值 (通过cookie可以模拟登录)  请求url 添加cookie
cookies = driver.get_cookies()
# print(type(cookie))  # <class 'list'>
# print(cookies)

# json.dumps() 将一个python数据结构转换为json类型字符串
# json.loads() 将一个json类型字符串转换为python数据结构

# 转换cookie值的类型
jsonCookie = json.dumps(cookies)
# print('=='*30)
# print(type(jsonCookie),jsonCookie)  # class 'str'

# 将数据保存到一个Json文件当中
with open('qqzone.json','w')as f:
    f.write(jsonCookie)

# 处理cookie的数据格式

with open('qqzone.json','r')as f:
    ListCookie = json.loads(f.read())
# print(ListCookie)

cookie = [item['name'] + '=' + item['value'] for item in cookies]
# print(cookie)
cookieStr = '; '.join(item for item in cookie)
# print(cookieStr)

url1 = 'https://user.qzone.qq.com/515151091'
headers = {
    'cookie':cookieStr,
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
}
html = requests.get(url=url1,headers=headers,verify=False).text
print(html)

（2）页面等待

现在的网页越来越多采用了 Ajax 技术，这样程序便不能确定何时某个元素完全加载出来了。如果实际页面等待时间过长导致某个dom元素还没出来，但是你的代码直接使用了这个WebElement，那么就会抛出NullPointer的异常。为了解决这个问题。所以 Selenium 提供了两种等待方式：一种是隐式等待、一种是显式等待。

隐式等待：调用driver.implicitly_wait。那么在获取不可用的元素之前，会先等待10秒中的时间。

driver.implicitly_wait(10)

driver.implicitly_wait(5)
driver.find_element_by_id('gb_closeDefaultWarningWindowDialog_id').click()  # 关闭弹窗

显式等待：显式等待是表明某个条件成立后才执行获取元素的操作。也可以在等待的时候指定一个最大的时间，如果超过这个时间那么就抛出一个异常。显式等待应该使用selenium.webdriver.support.excepted_conditions期望的条件和selenium.webdriver.support.ui.WebDriverWait来配合完成

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://www.baidu.com/")

element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
 )

# 等待出发地加载出来
WebDriverWait(driver, 1000).until(
    EC.text_to_be_present_in_element_value((By.ID, 'fromStationText'), '成都')
)
# 等待目的地加载出来
WebDriverWait(driver, 1000).until(
    EC.text_to_be_present_in_element_value((By.ID, 'toStationText'), '长沙')
)

一些其他的等待条件

presence_of_element_located：某个元素已经加载完毕了。
presence_of_all_elements_located：网页中所有满足条件的元素都加载完毕了。
element_to_be_clickable：某个元素是可以点击了。

更多条件请参考：http://selenium-python.readthedocs.io/waits.html

9.打开多窗口和切换页面

有时候窗口中有很多子tab页面。这时候肯定是需要进行切换的。selenium提供了一个叫做switch_to.window来进行切换，具体切换到哪个页面，可以从driver.window_handles中找到

# 打开一个新的页面
driver.execute_script("window.open('url')")

print(driver.current_url)

# 切换到这个新的页面中
driver.switch_to.window(self.driver.window_handles[1])

示例：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.baidu.com')  # 百度
driver.execute_script('window.open("https://www.douban.com")')  # 新打开一个豆瓣网页

# driver.close() 关闭的是当前的窗口

# driver.quit() 退出驱动了，关闭所有的窗口

driver.find_element_by_id('kw').send_keys('python')
print(driver.current_url)  # https://www.baidu.com/

# 切换窗口
driver.switch_to.window(driver.window_handles[1])
print(driver.current_url)  # https://www.douban.com/

二、图形验证码识别

1.Tesseract安装以及简介

阻碍我们爬虫的。有时候正是在登录或者请求一些数据时候的图形验证码。因此这里我们讲解一种能将图片翻译成文字的技术。将图片翻译成文字一般被称为光学文字识别（Optical Character Recognition），简写为OCR。实现OCR的库不是很多，特别是开源的。因为这块存在一定的技术壁垒（需要大量的数据、算法、机器学习、深度学习知识等），并且如果做好了具有很高的商业价值。因此开源的比较少。这里介绍一个比较优秀的图像识别开源库：Tesseract。

Tesseract是一个将图像翻译成文字的OCR(光学文字识别,Optical Character Recognition),目前由谷歌赞助。Tesseract是目前公认最优秀、最准确的开源OCR库。Tesseract具有很高的识别度，也具有很高的灵活性，他可以通过训练识别任何字体
Windows系统安装在以下链接下载可执行文件,https://github.com/tesseract-ocr/

在Python中调用Tesseract:

pip install pytesseract

设置环境变量
安装完成后，如果想要在命令行中使用Tesseract，那么应该设置环境变量。Mac和Linux在安装的时候就默认已经设置好了。在Windows下把tesseract.exe所在的路径添加到PATH环境变量中。
还有一个环境变量需要设置的是，要把训练的数据文件路径也放到环境变量中。在环境变量中，添加一个

TESSDATA_PREFIX=D:\Tesseract-OCR\tessdata

进入cmd输入下面的命令查看版本，正常运行则安装成功

tesseract --version

在命令行中使用tesseract
tesseract 图片路径文件路径

tesseract demo.png a

识别中文图像,需要下载语言安装包
URL地址：https://github.com/tesseract-ocr/tessdat

2.在代码中使用tesseract识别图像

import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract.exe'
tessdata_dir_config = r'--tessdata-dir "D:\Tesseract-OCR\tessdata"'
image = Image.open('demo.png')
print(pytesseract.image_to_string(image, lang='eng',  config=tessdata_dir_config))

3.用pytesseract处理图形验证码

验证码URL：https://passport.lagou.com/vcode/create?from=register&refresh=1513081451891

4.打码云平台

http://www.ttshitu.com/

import json
import requests
import base64
from io import BytesIO
from PIL import Image
from sys import version_info


def base64_api(uname, pwd,  img):
    img = img.convert('RGB')
    buffered = BytesIO()
    img.save(buffered, format="JPEG")
    if version_info.major >= 3:
        b64 = str(base64.b64encode(buffered.getvalue()), encoding='utf-8')
    else:
        b64 = str(base64.b64encode(buffered.getvalue()))
    data = {"username": uname, "password": pwd, "image": b64}
    result = json.loads(requests.post("http://api.ttshitu.com/base64", json=data).text)
    if result['success']:
        return result["data"]["result"]
    else:
        return result["message"]
    return ""


if __name__ == "__main__":
    img_path = "captcha.png"
    img = Image.open(img_path)
    result = base64_api(uname='', pwd='', img=img)
    print(result)

加油小羽哥

关注

4
点赞
踩
16

收藏

觉得还不错? 一键收藏
1
评论
爬虫第四讲 js2py和selenium

爬虫第四讲 selenium一、selenium的使用1.爬虫和反爬虫的斗争爬虫建议尽量减少请求次数保存获取到的HTML，供查错和重复使用关注网站的所有类型的页面H5页面APP多伪装代理IP随机请求头利用多线程分布式在不被发现的情况下我们尽可能的提高速度2.ajax基本介绍动态了解HTML技术JS是网络上最常用的脚本语言,它可以收集用户的跟踪数据,不需要重载页面直接提交表单,在页面嵌入多媒体文件,甚至运行网页jQueryjQuery是一个快速、简介的JavaS
复制链接

扫一扫