selenium验证码识别_爬虫基础（cookie，代理，验证码识别，selenium，无头浏览器）...

最新推荐文章于 2024-07-01 23:27:54 发布

weixin_39562185

最新推荐文章于 2024-07-01 23:27:54 发布

阅读量218

点赞数

文章标签： selenium验证码识别动态cookie

cookie
代理
验证码识别
selenium
无头浏览器

cookie

处理方式2种:

手动处理
- 将抓包工具中指定数据包中的cookie作用到headers中
自动处理
- session对象 = requests.Session()
- session的作用:可以进行get和post的请求发送.其独特的作用在于,在进行请求发送的过程中产生了cookie, 则cookie会被自动存储到session对象中.
  - 在session使用的过程中,该对象至少要被调用两次.一次存，一次用。

例：雪球网

思路：直接去爬网站数据的时候，是不成功的，所有的不成功，在代码无误的情况下，那么结果就是爬取数据时，模拟浏览器发送的强度不够强，所以尝试着加cookie，就是用request.Session()

https://xueqiu.com/

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
# 给请求头中加cookie
s = requests.Session()
first_url = 'https://xueqiu.com/'
s.get(first_url,headers=headers)
url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=110605&size=15'
# 用带用cookie的request对象来做请求
json_text = s.get(url,headers=headers).json()
print(json_text)

代理

代理指的就是代理服务器.
为何在爬虫中需要使用代理:
- 如果在短时间内对一个网站发起了一个高频的请求.服务端就会将异常请求的ip禁掉.
代理的作用:
- 可以进行请求和响应的拦截和转发
代理的类型:
- http
- https
匿名度
- 透明
- 匿名
- 高匿（最好用的）

例：代理服务器

快代理：https://www.kuaidaili.com/free/

代理精灵：http://http.zhiliandaili.cn/

可以去代理精灵买代理服务器，然后爬取快代理。

思路：就是在request方法加一个参数：proxies，格式为字典

先爬取我们需要的代理服务器的数据
封装成一个列表套字典的形式
然后requests.get(url, heahers=headers, proxies=random.choice(装着ip字典的列表))

import requests
from lxml import etree
import random
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
# url = 'https://www.sogou.com/web?query=ip'
# page_text = requests.get(url,headers=headers).text
# tree = etree.HTML(page_text)
# ip = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
# print(ip)

#代理
# url = 'https://www.sogou.com/web?query=ip'
# page_text = requests.get(url,headers=headers,proxies={'https':'110.18.152.241:30554'}).text
# tree = etree.HTML(page_text)
# ip = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
# print(ip)

#代理池的封装
ips = []
ip_url = 'http://t.ipjldl.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=50&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2'
page_text = requests.get(url=ip_url,headers=headers).text
tree = etree.HTML(page_text)
ip_list = tree.xpath('//body//text()')
for ip in ip_list:
    dic = {'https':ip}
    # ips添加了所有的代理服务器的ip和port
    ips.append(dic)


url = 'https://www.kuaidaili.com/free/inha/%d/'
all_data = []
for page in range(2,10):
    print('爬取的是第%d'%page)
    new_url = format(url%page)
    page_text = requests.get(new_url,headers=headers,proxies=random.choice(ips)).text
    tree = etree.HTML(page_text)
    #注意:在xpath表达式中不可以出现tbody标签
    ip_list = tree.xpath('//*[@id="list"]/table//tr/td[1]/text()')
    all_data.append(ip_list)
print(len(all_data))

验证码识别

打码平台

超级鹰:http://www.chaojiying.com/about.html
注册,登陆
创建一个软件ID
查看价目以及验证码编号
查看接口文档
云打码

进入超级鹰官网，登录进去，创建软件ID

然后下载超级鹰的文档

下载完毕解压得到这个界面

点开chaojiying那个文件，也可以直接把他丢到pycharm里面，然后进行导包就可以用了。

超级鹰的流程，需要先把图片加载到本地，然后去识别
例：爬取需要登录的网页数据

from Chaojiying import Chaojiying_Client
import requests
from lxml import etree
s = requests.Session()
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def get_text(imgPath,imgType):
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')  # 用户中心>>软件ID 生成一个替换 96001
    im = open(imgPath, 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
    return chaojiying.PostPic(im, imgType)['pic_str']

# print(get_text('./a.jpg',1902))

url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = s.get(url,headers=headers).text
tree = etree.HTML(page_text)
img_src = 'https://so.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_src,headers=headers).content
with open('./code.jpg','wb') as fp:
    fp.write(img_data)
#将动态变化的请求参数解析出来
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

result = get_text('./code.jpg',1902)
print(result)

login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
data = {
    '__VIEWSTATE': __VIEWSTATE,
    '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
    'from': 'http://so.gushiwen.cn/user/collect.aspx',
    'email': '13102165156@163.com',
    'pwd': '123456',
    'code': result,
    'denglu': '登录',
}
page_text = s.post(url=login_url,headers=headers,data=data).text
with open('./login.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

动态变化的请求参数的处理:

是由js动态生成
会隐藏在前台页面

例：上面例子的VIEWSTATE参数和VIEWSTATEGENERATOR参数就是动态生成的

可以用下面的代码解析出来(这种是在前台页面生成的)

__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

selenium:（获取动态数据）

https://www.cnblogs.com/bobo-zhang/p/9685362.html

是一个基于浏览器自动化的模块.
环境安装:pip install selenium
下载浏览器的驱动程序:http://chromedriver.storage.googleapis.com/index.html
和爬虫之间的关联
- 便捷帮我们爬取动态加载的数据(可见即可得)
- 实现模拟登陆

Pyppeteer:https://www.cnblogs.com/bobo-zhang/p/11113388.html

优势：点击验证码，有些网站的数据是需要下拉才可以显示的，所以也需要注入js

基础操作

from selenium import webdriver
from time import sleep

#实例化任意一款浏览器对象
bro = webdriver.Chrome(executable_path='../chromedriver.exe')

#对指定url发起请求
bro.get('https://www.jd.com/')

#标签定位
search_box = bro.find_element_by_xpath('//*[@id="key"]')
search_box.send_keys('mac pro')

btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#js注入
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)

bro.quit()

seleniux爬取动态加载数据

from selenium import webdriver
from time import sleep

#实例化任意一款浏览器对象
bro = webdriver.Chrome(executable_path='../chromedriver.exe')

#对指定url发起请求
bro.get('https://www.jd.com/')

#标签定位
search_box = bro.find_element_by_xpath('//*[@id="key"]')
search_box.send_keys('mac pro')

btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#js注入
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)

bro.quit()

动作链

from selenium import webdriver
from time import sleep
from lxml import etree
from selenium.webdriver import ActionChains
#实例化任意一款浏览器对象
bro = webdriver.Chrome(executable_path='../chromedriver.exe')
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)
sleep(1)
#注意:如果定位的标签出现了嵌套的页面中,则会定位失败
#修正方式如下
bro.switch_to.frame('iframeResult') #参数就是iframe标签的id属性值
div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')


#点击且长按
action = ActionChains(bro) #实例化了动作连对象
action.click_and_hold(div_tag)

for i in range(7):
    #perform()表示将动作连立即执行
    action.move_by_offset(15,21).perform()
    sleep(0.5)

例：12306模拟登录

点击验证码，需要用到裁剪。

from selenium import webdriver
from Chaojiying import Chaojiying_Client
from time import sleep
from lxml import etree
from PIL import Image
from selenium.webdriver import ActionChains
#pip install PIL/Pillow

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def get_text(imgPath,imgType):
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')  # 用户中心>>软件ID 生成一个替换 96001
    im = open(imgPath, 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
    return chaojiying.PostPic(im, imgType)['pic_str']

#实例化任意一款浏览器对象
bro = webdriver.Chrome(executable_path='../chromedriver.exe')
login_url = 'https://kyfw.12306.cn/otn/login/init'
bro.get(login_url)
sleep(1)

#对当前浏览器页面进行截图
bro.save_screenshot('./main.png')

#对main.png进行裁剪,需要将验证码图片区域进行裁剪,且将裁剪下来的验证码图片保存到本地
code_img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')
location = code_img_tag.location #验证码图片的左下角坐标
size = code_img_tag.size #验证码图片的长宽

#指定裁剪区域
#注意:如果裁剪出现问题,可以尝试调整电脑屏幕显示的缩放比例,调整成100%
rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height']))
i = Image.open('./main.png')
frame = i.crop(rangle)
frame.save('./imgCode.png')


#11,22|33,44 ==>[[11,22],[33,44]]
result = get_text('./imgCode.png',9004)
all_list = []
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)


for loc in all_list:
    x = loc[0]
    y = loc[1]
    ActionChains(bro).move_to_element_with_offset(code_img_tag,x,y).click().perform()
    sleep(2)

bro.find_element_by_id('username').send_keys('111111')
sleep(1)
bro.find_element_by_id('password').send_keys('111111')
sleep(1)
bro.find_element_by_id('loginSub').click()
sleep(3)

bro.quit()

无头浏览器

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# bro = webdriver.Chrome(executable_path='../chromedriver.exe')
#
# bro.get('https://www.chaojiying.com/price.html#table-item6')
#获取cookie
# print(bro.get_cookies())


#无头浏览器:无可视化界面的浏览器
#早先比较出名一款无头浏览器:phantomJS
#现在常用的是谷歌无头浏览器

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')


bro =   webdriver.Chrome(executable_path='../chromedriver.exe',chrome_options=chrome_options)
bro.get('https://www.baidu.com/')
print(bro.page_source)
bro.save_screenshot('./baidu.png')
bro.quit()

weixin_39562185

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
selenium验证码识别_爬虫基础（cookie，代理，验证码识别，selenium，无头浏览器）...

cookie代理验证码识别selenium无头浏览器cookie处理方式2种:手动处理将抓包工具中指定数据包中的cookie作用到headers中自动处理session对象 = requests.Session()session的作用:可以进行get和post的请求发送.其独特的作用在于,在进行请求发送的过程中产生了cookie, 则cookie会被自动存储到session对象中.在session...
复制链接

扫一扫