爬虫08-验证码的处理

最新推荐文章于 2023-03-04 08:30:00 发布

VIP文章 qwerLoL123456

最新推荐文章于 2023-03-04 08:30:00 发布

阅读量171

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qwerLoL123456/article/details/82532577

版权

1、用cookie模拟登录

下面是用cookie模拟登录csdn的一个案例，cookie需要获取登录时的cookie

from urllib import request
import chardet
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                         ' Chrome/68.0.3440.106 Safari/537.36',
           'Accept-Language': 'zh-CN,zh;q=0.9',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'Cache-Control': 'max-age=0',
           'Connection': 'keep-alive',
           'Cookie': 'anonymid=jl4m5fxn-3yrdyq; _r01_=1; _ga=GA1.2.55598029.1534921231; ln_uact=17752558702; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; depovince=HEN; JSESSIONID=abcr2TE_NKy2ZhDXjMYww; jebe_key=fd550795-812c-4587-bc70-8e3f15107ea7%7C077a3e2b1c00096d5c13732ceee74ce5%7C1534925077691%7C1%7C1536299679609; ick_login=08481884-827c-4c60-b431-03bc7e808af1; first_login_flag=1; wp_fold=0; wp=0; jebecookies=4047de36-5dab-44ec-ad91-07a21a72c724|||||; _de=32B20555AD3784A6BF2D3D01B72FE013; p=e3d71d28d2a54983c9f23fa49425047f2; t=fb2811d9a2c4767edcd48c5922ee28062; societyguester=fb2811d9a2c4767edcd48c5922ee28062; id=966924492; xnsid=4d654f12; ver=7.0; loginfrom=null'}
req = request.Request('https://www.renren.com/', headers=headers)
response = request.urlopen(req)
html = response.read()
charset = chardet.detect(html)['encoding']
print(charset)
print(html.decode(charset))

2、传统的验证码识别

需要安装一个库 pip install pytesseract

简单的使用黑白背景，

import pytesseract
from PIL import Image
image = Image.open('./images/tesseracttest.jpg')
text = pytesseract.image_to_string(image)
print(text)

背景带有多种颜色，字体不发生大的改变


import pytesseract
from PIL import Image
img =Image.open('./images/recaptcha.png')
img.show()
# 可以看出，验证码文本一般都是黑色的，背景则会更加明亮，所以我们可以通过检查像素
# 是否为黑色将文本分离出来，该处理过程又被称为阈值化。通过 Pillow 可以很容易地实现
# 该处理过程。
gray = img.convert('L') #灰度化
gray.show()
bw = gray.point(lambda x: 0 if x < 1 else 255,'1')
bw.show()
print(pytesseract.image_to_string(bw))

3、网页中验证码的获取

from selenium import webdriver
from PIL import Image
browser = webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.save_screenshot('./images/baidu.png')
element=browser.find_element_by_xpath('//div[@id="lg"]/img[1]')
#location 办法可能会有偏移，但是每次都会锁定了了验证码的位置，所以稍微修正一下

最低0.47元/天解锁文章

qwerLoL123456

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
爬虫08-验证码的处理

1、用cookie模拟登录下面是用cookie模拟登录csdn的一个案例，cookie需要获取登录时的cookiefrom urllib import requestimport chardetheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, l...
复制链接

扫一扫