数据挖掘实战（九）--用神经网络破解验证码

最新推荐文章于 2024-07-22 15:15:45 发布

bb8886

最新推荐文章于 2024-07-22 15:15:45 发布

阅读量467

点赞数 2

分类专栏：数据挖掘数据分析 python 文章标签：数据挖掘算法机器学习 python Powered by 金山文档

本文链接：https://blog.csdn.net/bb8886/article/details/129579214

版权

python 同时被 3 个专栏收录

23 篇文章 2 订阅

订阅专栏

数据挖掘

16 篇文章 17 订阅

订阅专栏

数据分析

15 篇文章 8 订阅

订阅专栏

本章内容：本章介绍如何根据验证码图像的像素值，用神经网络识别图像中的字母，从而自动识别验证码。

步骤：

(1) 把大图像分成只包含一个字母的4张小图像。 (2) 为每个字母分类。 (3) 把字母重新组合为单词。 (4) 用词典修正单词识别错误。

一、神经网络简介

神经网络由一系列相互连接的神经元组成。每个神经元都是一个简单的函数，接收一定输入，给出相应输出。神经元可以使用任何标准函数来处理数据，比如线性函数，这些函数统称为激活函数。这些神经元紧密连接，密切配合，能够通过学习得到一个模型，使得神经网络成为机器学习领域最强大的概念之一。

用于数据挖掘应用的神经网络，神经元按照层级进行排列。第一层，也就是输入层，接收来自数据集的输入。第一层中的每个神经元对输入进行计算，把得到的结果传给第二层的神经元。这种叫作前向神经网络。神经网络中，上一层的输出作为下一层的输入，直到到达最后一层：输出层。输出结果表示的是神经网络分类器给出的分类结果。输入层和输出层之间的所有层被称为隐含层，因为在这些层中，其数据表现方式，常人难以理解。大多数神经网络至少有三层，而如今大多数应用所使用的神经网络层次比这多得多。

二、单字母预测

绘制验证码

编写创建验证码的函数，目标是绘制一张含有单词的图像，对单词使用错切变化效果。

Draw.text()函数介绍：
ImageDraw.Draw.text(xy, text, fill=None, font=None, anchor=None, spacing=0, align=”left”)
参数:
xy-文字的位置。
text-要绘制的文本。如果包含任何换行符，则文本将传递到multiline_text()
fill-用于文本的颜色。
font-一个ImageFont实例。
spacing-如果文本传递到multiline_text()，则行之间的像素数。
align-如果文本已传递到multiline_text()，“left”，“center”或“right”

from PIL import Image, ImageDraw, ImageFont
# 创建生成验证码的函数（参数：一个单词，错切值：通常在0 ~ 0.5之间，大小：图像维度）
def create_captcha(text, shear=0, size=(100, 30)):
    im = Image.new("L", size, "black")
    # 为ImageDraw类初始化一个实例
    draw = ImageDraw.Draw(im)
    # 指定验证码文字所使用的字体,字体文件--Coval-Regular.otf
    font = ImageFont.truetype('./data/Coval-Regular.otf', 22)
    # 用pIL绘图
    draw.text((0, 0), text, fill=1, font=font)
    # 把PIL图像转换为numpy数组
    image = np.array(im)
    # 应用错切变化效果
    affine_tf = tf.AffineTransform(shear=shear)
    image = tf.warp(image, affine_tf)
    # 对图像特征进行归一化处理，确保特征值落在0到1之间。
    return image / image.max()

image = create_captcha("GENE", shear=0.5)
from matplotlib import pyplot as plt
plt.imshow(image, cmap='Greys')
plt.show()

将图像切分为单个字母

分割单词，找到其中的字母。具体做法：创建一个函数，寻找图像中连续的黑色像素，抽取它们作为新的小图像。这些小图像（或者至少应该）就是我们要找的字母。函数参数：图像，返回值:小图像列表，每张小图像为单词的一个字母。

图像分割函数：

（1）label()：它能找出图像中像素值相同且又连接在一起的像素块。label函数的参数为图像数组，返回跟输入同型的数组。在返回的数组中，图像连接在一起的区域用不同的值来表示（>0），在这些区域以外的像素用0来表示。

（2）regionprops():抽取连续区域的函数,字体倾斜时效果变差。

def segment_image(image):
    labeled_image = label(image > 0)
    subimages = []  # 抽取每一张小图像，将它们保存到一个列表中
    for region in regionprops(labeled_image):
        start_x, start_y, end_x, end_y = region.bbox
        subimages.append(image[start_x:end_x, start_y:end_y])
    if len(subimages) == 0:
        return [image, ]
    return subimages

subimages = segment_image(image)
f, axes = plt.subplots(1, len(subimages), figsize=(10, 3))
for i in range(len(subimages)):
    axes[i].imshow(subimages[i], cmap='gray')
plt.show()

创建数据集

from sklearn.utils import check_random_state
random_state = check_random_state(14)
letters = list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
shear_values = np.arange(0, 0.5, 0.05)
# 生成一条训练数据（单字母）
def generate_sample(random_state=None):
    random_state = check_random_state(random_state)
    letter = random_state.choice(letters)
    shear = random_state.choice(shear_values)
    return create_captcha(letter, shear=shear, size=(20, 20)), letters.index(letter)

image, target = generate_sample(random_state)
plt.imshow(image, cmap="Greys")
plt.show()
print("The target for this image is: {0}".format(target))

多数据集创建

调用几千次该函数，就能生成足够的训练数据。把这些数据传入到numpy的数组里，因为数组操作起来比列表更容易。

dataset, targets = zip(*(generate_sample(random_state) for i in range(3000)))
dataset = np.array(dataset, dtype='float')
targets = np.array(targets)

数据集调整

共有26个类别，每个类别（字母）用从0到25之间的一个整数表示。

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
# 将类别转化为矩阵
y = onehot.fit_transform(targets.reshape(targets.shape[0], 1))
# 稀疏矩阵->密集矩阵
y = y.todense()
# 训练集调整 将图像统一调整为20x20像素
from skimage.transform import resize
# skimage.transform.resize函数参数：（figure，size）
dataset = np.array([resize(segment_image(sample)[0], (20, 20)) for sample in dataset])
# dataset二维化(存储的是二维图像信息)
X = dataset.reshape((dataset.shape[0], dataset.shape[1] * dataset.shape[2]))

数据集分割与单字母模型训练

# 数据集分割
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9)
# 单字母模型训练
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100,), random_state=14)
clf.fit(X_train, y_train)

神经网络评估

y_pred = clf.predict(X_test)
from sklearn.metrics import f1_score
score = f1_score(y_pred=y_pred, y_true=y_test, average='macro')
print(score)
# 分类结果查看
from sklearn.metrics import classification_report
# print(classification_report(y_pred=y_pred, y_true=y_test))

三、单词预测

预测函数创建

# 参数为：验证码，神经网络
def predict_captcha(captcha_image, neural_network):
    # plt.imshow(captcha_image, cmap='Greys')
    # （1）单词分割：把大图像分成只包含一个字母的4张小图像
    subimages = segment_image(captcha_image)
    # 字母像素调整
    dataset = np.array([tf.resize(subimage, (20, 20)) for subimage in subimages])
    X_test = dataset.reshape((dataset.shape[0], dataset.shape[1] * dataset.shape[2]))
    # （2）逐字母预测 & 并选取最有可能的预测值
    y_pred = neural_network.predict_proba(X_test)
    predictions = np.argmax(y_pred, axis=1)
    # （3）将预测值转换为字母。把字母重新组合为单词
    predicted_word = ''.join([letters[prediction] for prediction in predictions])
    return predicted_word

# 神经网络测试函数
def test_prediction(word, net, shear = 0.2):
    captcha = create_captcha(word, shear=shear)
    prediction = predict_captcha(captcha, net)
    return word == prediction, word, prediction

print(test_prediction("GENEAL", clf, shear=0))

测试集导入&测试开始

# 测试集导入
from nltk.corpus import words
# 借助NLTK模块创建单词数据集，只使用长度为4的单词
valid_words = [word.upper() for word in words.words() if len(word) == 4]

# 测试开始
num_correct = 0
num_incorrect = 0
for word in valid_words:
    correct, word, prediction = test_prediction(word, clf, shear=0.2)
    if correct:
        num_correct = num_correct + 1
    else:
        num_incorrect = num_incorrect + 1
print("测试集单词的总数量:{0}，预测正确：{1}，预测错误：{2}，\n识别准确率:{3:.1f}".format(num_correct + num_incorrect, num_correct,num_incorrect,num_correct/(num_correct+num_incorrect)))

分析：单个字母识别准确率为0.9883，而单词的准确率为0.1,主要有以下三个原因：

(1)一个字母准确率为97%，四个字母都正确的准确率为88%。

(2)错切值对正确率有影响。错切值越大，正确率越低。

(3)之前随机选取字母组成单词，而字母在单词中的分布不是随机的。使用频度较高，但却常常被识别错误的字母，也会导致错误率上升。

词典提升正确率

我们刚刚是直接返回预测结果，其实返回之前可以先检查一下词典里是否包含该词条。如果单词在词典里，那么就返回预测结果，如果不在，我们找到和预测结果相似的单词，再把它作为更新过的预测结果返回。

rom nltk.metrics import edit_distance
steps = edit_distance("STEP", "STOP")
print("The number of steps needed is : {0}".format(steps))

# 距离函数创建
def compute_distance(prediction, word):
    return len(prediction) - sum(prediction[i] == word[i] for i in range(len(prediction)))

# 预测函数改进
from operator import itemgetter
def improved_prediction(word, net, dictionary, shear=0.2):
    captcha = create_captcha(word, shear=shear)
    prediction = predict_captcha(captcha, net)
    prediction = prediction[:4]

    if prediction not in dictionary:
        distances = sorted([(word, compute_distance(prediction, word)) for word in dictionary],key=itemgetter(1))
        best_word = distances[0]
        prediction = best_word[0]
    return word == prediction, word, prediction

# 开始测试
num_correct = 0
num_incorrect = 0
for word in valid_words:
    shear = 0
    correct, word, prediction = improved_prediction(word, clf, valid_words, shear)
    if correct:
        num_correct += 1
    else:
        num_incorrect += 1
print("Number correct is {0}".format(num_correct))
print("Number incorrect is {0}".format(num_incorrect))
print("测试集单词的总数量:{0}\n识别准确率:{1:.1f}".format(num_correct+num_incorrect, num_correct/(num_correct+num_incorrect)))