爬虫-字符验证码解决方案

最新推荐文章于 2024-05-29 14:37:45 发布

狄鸠

最新推荐文章于 2024-05-29 14:37:45 发布

阅读量755

点赞数 4

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/weixin_44038881/article/details/106738141

版权

Python爬虫专栏收录该内容

20 篇文章 3 订阅

订阅专栏

字符型验证码

一，字符验证码简介

1.什么是验证码

在开发爬虫的过程中会遇到一种常见的反爬措施，验证码。验证码（CAPTCHA）是“Completely Automated Public Turing test to tell Computers and Humans Apart”（全自动区分计算机和人类的图灵测试）的缩写，是一种区分用户是计算机还是人的公共全自动程序。

2.验证码种类

图形验证码：这类验证码大多是计算机随机产生一个字符串，在把字符串增加噪点、干扰线、变形、重叠、不同颜色、扭曲组成一张图片来增加识别难度。
滑动验证码：也叫行为验证码，比较流行的一种验证码，通过用户的操作行为来完成验证，其中最出名的就是极验。

滑动验证码的原理就是使用机器学习中的深度学习技术，根据一些特征来区分是否为正常用户。通过记录用户的滑动速度，还有每一小段时间的瞬时速度，用户鼠标点击情况，以及滑动后的匹配程度来识别。而且，不是说滑动到正确位置就是验证通过，而是根据特征识别来区分是否为真用户，滑到正确位置只是一个必要条件。
点触验证码：点击类验证码都是给出一张包含文字的图片，通过文字提醒用户点击图中相同字的位置进行验证。

今天主要讨论如何通过程序处理图像验证码。

二，图像处理库Pillow

1.简介

官方文档：https://pillow.readthedocs.io/en/latest/installation.html

Pillow 的前身是 PIL，PIL 只支持 Python2 ，Pillow 是基于 PIL 的，并且支持 Python3

2.下载+导入

下载: pip install pillow

导入：from PIL import Image（注意：在python3里面使用pillow必须要用from PIL这样）

3.简单使用

创建图片

from PIL import Image
# Image 是最重要的一个类

# 创建图像（1.模式(RGB, L) 2.大小[像素点大小] 3.颜色[可以是#333]）
img = Image.new('RGB', (20, 20), (0, 255, 0))

# 保存图片
img.save('red.png')

读取图片

img = Image.open('one.png')	
img.show()           	# 默认使用系统自带的看图器查看图片
print(img.filename)  	# 查看图片的名字（只有open打开图片才有这个属性）
print(img.mode)      	# 查看图片的模式（RGB, L, RGBA）
print(img.size)		    # 查看图片的大小
print(img.info)			# 只能查看jpg的大小

常用方法

- 切割

img = Image.open('black.jpg')
a = img.crop((0, 0, 5, 5))		# 切割图片，返回一个新的图片
a.show()

- 图片粘贴

img1 = Image.open('red.jpg')
img2 = Image.open('black.png')

img2.paste(img1, (0, 0))    # 粘贴开始的点
img2.show()

- 转换灰度图

img = Image.open('code.jpg')
img.show()
img = img.convert('L')	# 转换成灰度图
img.show()

- 像素点的获取

img = Image.open('code.jpg')
img = img.convert('L')      # 转换成单通道
print(list(img.getdata()))

a = img.getpixel((10, 10))	# 获取单个点的像素
print(a)

三，字符识别

1.工具安装

Tesseract OCR 引擎安装

GitHub下载地址 : https://github.com/UB-Mannheim/tesseract/wiki
Python -Tesseract 安装

官方文档 : https://github.com/madmaze/pytesseract

是一种用于 Python 的光学字符识别（OCR）工具。也就是说，它将识别和“读取”图像中嵌入的文本。Python-tesseract 是对谷歌Tesseract OCR引擎的python封装。它还可用作Tesseract 的独立调用脚本，因为它可以读取 Pillow 和 Leptonica 图像库支持的所有图像类型，包括jpeg、png、gif、bmp、tiff等。

安装
```
pip install pytesseract
pip install pytesser3	# (在python里面使用)
```

微博登录

# -.- encoding = utf-8 -.-
import time
from hashlib import md5
from io import BytesIO
from PIL import Image
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC


class Chaojiying_Client(object):
    def __init__(self, username, password, soft_id, img_type):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.img_type = img_type
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': self.img_type,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post(
            'http://upload.chaojiying.net/Upload/Processing.php',
            data=params,
            files=files,
            headers=self.headers,
        )
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


class WeiBoLogin:
    binary_location = r"D:\All_Work_App\Google\Google\Chrome\Application\chrome.exe"
    chromedriver_path = 'D:\Python\CREATE_PYTHON_ENV\Spider_env\chromedriver.exe'

    def __init__(self):
        self.opt = webdriver.ChromeOptions()
        self.opt.binary_location = self.binary_location
        self.driver = webdriver.Chrome(executable_path=self.chromedriver_path, chrome_options=self.opt)
        self.wait = WebDriverWait(self.driver, timeout=10)
        self.url = 'https://weibo.com/'
        self.super_eagle = Chaojiying_Client('1210947553', '13738939057xyb.', '901157', '1006')

    def input_and_click(self, username, password):
        # 1. 等待页面加载完成
        self.wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'W_unlogin_v4')))

        # 2. 输入用户名和密码
        username_input = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginname')))
        username_input.send_keys(username)
        password_input = self.wait.until(EC.element_to_be_clickable((By.NAME, 'password')))
        password_input.send_keys(password)
        time.sleep(2)

        # 3. 输入验证码
        verify_door = self.get_verify_door()
        verify_input = self.wait.until(EC.element_to_be_clickable((By.XPATH, '//input[@name="verifycode"]')))
        verify_input.send_keys(verify_door)

        # 4. 点击登录按钮
        submit = self.driver.find_element_by_xpath('//a[@class="W_btn_a btn_32px"][1]')
        submit.click()

        # 5. 确认是否登录成功
        try:
            login_success = WebDriverWait(self.driver, timeout=5).until(EC.presence_of_element_located((By.CLASS_NAME, 'nameBox')))
            cookies = {}
            for cookie in self.driver.get_cookies():
                cookies[cookie['name']] = cookie['value']
        except Exception:
            print('登录失败')
            return None

        # 5. 返回cookies数据
        print('登录成功')
        return cookies

    def get_verify_door(self):
        verify_img = self.wait.until(EC.element_to_be_clickable((By.XPATH, '//a[@class="code W_fl"]/img')))
        verify_img_ = BytesIO(verify_img.screenshot_as_png)
        res = self.super_eagle.PostPic(verify_img_)
        print(res)
        return res['pic_str']

    def main(self):
        self.driver.get(self.url)
        self.driver.maximize_window()
        cookies = self.input_and_click(username, password)
        print(cookies)

    def shutdown(self):
        time.sleep(5)
        self.driver.quit()

if __name__ == '__main__':
    username = 'xxxxxxx'
    password = 'xxxxxxx'
    wb = WeiBoLogin()
    wb.main()

狄鸠

关注

4
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
爬虫-字符验证码解决方案

字符型验证码一，字符验证码简介1.什么是验证码在开发爬虫的过程中会遇到一种常见的反爬措施，验证码。验证码（CAPTCHA）是“Completely Automated Public Turing test to tell Computers and Humans Apart”（全自动区分计算机和人类的图灵测试）的缩写，是一种区分用户是计算机还是人的公共全自动程序。2.验证码种类图形验证码：这类验证码大多是计算机随机产生一个字符串，在把字符串增加噪点、干扰线、变形、重叠、不同颜色、扭曲组成一张图
复制链接

扫一扫