基于深度搜索的树路径求解_基于深度学习的自动验证码求解器

基于深度搜索的树路径求解

计算机视觉网络安全深度学习(Computer Vision, Cybersecurity, Deep Learning)

Disclaimer: The following work was created as an academic project. The work was not used, nor was it intended to be used, for any harmful or malicious purposes. Enjoy!

免责声明:以下工作是作为一个学术项目而创建的。 该作品未用于任何有害或恶意目的,也未打算用于任何有害或恶意目的。 请享用!

介绍 (Introduction)

CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are something nearly every internet user had the experience of having to deal with. In the process of logging in or making an account, making an online purchase, or even posting a comment, many people are confronted with strange-looking, stretched, blurred, color, and shape distorted images more so resembling a Dali painting than English text.

验证码(完全自动化的公共图灵测试可以告诉计算机和人类分开)是几乎每个互联网用户都必须面对的经验。 在登录或注册帐户,进行在线购买甚至发表评论的过程中,许多人面临着怪异的外观,被拉伸的,模糊的,彩色的和变形的图像,比起英语,它更像是大理画。 。

For a long time, the 10-character blurred-text CAPTCHA was used in deployment because contemporary computer vision methods had difficulty recognizing letters against a non-uniform background, whereas humans have no problem doing so. This remains a significant issue in the field of character recognition to this day.

长期以来,由于现代计算机视觉方法难以识别非均匀背景下的字母,而人类却没有问题,因此在部署中使用了10个字符的模糊文本CAPTCHA。 迄今为止,这仍然是字符识别领域中的重要问题。

Therefore, in order to develop a program capable of reading the characters in this complex optical character recognition task, we will develop a custom solution specific to the problem. In this article, we will go over the start to finish the pipeline for developing a CAPTCHA solver for a specific class of CAPTCHAs.

因此,为了开发能够读取此复杂光学字符识别任务中字符的程序,我们将开发针对该问题的定制解决方案。 在本文中,我们将重新开始以完成针对特定类别的验证码开发验证码求解器的过程。

数据 (The Data)

10,000 CAPTCHA images were provided from a kaggle dataset (scraped by Aadhav Vignesh, at https://www.kaggle.com/aadhavvignesh/captcha-images). Each image the name NAME.jpg, where NAME is the solution to the puzzle. In other words, the NAME contains the correct letters in the correct sequence of the image, such as 5OfaXDfpue.

从kaggle数据集(由Aadhav Vignesh抓取,位于https://www.kaggle.com/aadhavvignesh/captcha-images )提供了10,000个CAPTCHA图像。 每张图片都命名为NAME.jpg,其中NAME是解决难题的方法。 换句话说,NAME以正确的图像顺序包含正确的字母,例如5OfaXDfpue。

Getting a computer program to read the characters inside the image presents a significant challenge. Open source OCR software, such as PyTesseract, failed when tested on the CAPTCHAs, often not even picking up a single character from the whole image. Training a neural network with the raw images as input and solutions as output might succeed in a dataset of tens of millions, but is not very likely to succeed with a dataset of 10,000 images. A new method has to be developed.

要使计算机程序读取图像中的字符是一个巨大的挑战。 诸如PyTesseract之类的开源OCR软件在CAPTCHA上进行测试时失败了,通常甚至没有从整个图像中拾取单个字符。 以原始图像作为输入并以解决方案作为输出来训练神经网络可能会在数千万的数据集中成功,但对于10,000个图像的数据集却不太可能成功。 必须开发一种新方法。

The issue could be greatly simplified by analyzing one character at a time, rather the entire image at once. If a dataset of alphanumerics and their corresponding labels could be obtained, a simple MNIST-like neural network can be trained to recognize characters. Therefore, our pipeline looks like this:

通过一次分析一个字符,而不是一次分析整个图像,可以大大简化此问题。 如果可以获得字母数字及其对应标签的数据集,则可以训练一个简单的类似于MNIST的神经网络来识别字符。 因此,我们的管道如下所示:

  1. Invert the CAPTCHA, segment the characters, and save each alphanumeric character separately to disk (with its label).

    反转CAPTCHA,对字符进行分段,然后将每个字母数字字符分别保存到磁盘(带有标签)。
  2. Train a neural network to recognize the characters of each class.

    训练神经网络以识别每个班级的特征。
  3. To use our model, feed-in any CAPTCHA, invert the image, and segment the characters. Apply the machine learning model on each character, get the predictions, and concatenate them as a string. Voila.

    要使用我们的模型,请输入任何验证码,反转图像并分割字符。 将机器学习模型应用于每个角色,获取预测,并将其连接为字符串。 瞧

前处理(Preprocessing)

For those of you following my code on Kaggle(https://github.com/sergeiissaev/kaggle_notebooks), the following section refers to the notebook titled captchas_eda.ipynb. We begin by loading a random image from the dataset and then plotting it.

对于那些遵循我在Kaggle( https://github.com/sergeiissaev/kaggle_notebooks )上的代码的人,以下部分引用名为captchas_eda.ipynb的笔记本。 我们首先从数据集中加载随机图像,然后对其进行绘制。

Image for post
This is a randomly selected CAPTCHA from our dataset.
这是从我们的数据集中随机选择的验证码。

Next, using OpenCV (a very useful computer vision library available in Python and C++) the image is converted to grayscale, and then a binary threshold is applied. This helps transform the complex, multicolored image into a much simpler black and white image. This is a very crucial step — contemporary OCR technology often fails when the text is against a nonuniform background. This step removes that problem. However, we aren’t ready to OCR just yet.

接下来,使用OpenCV(Python和C ++中非常有用的计算机视觉库)将图像转换为灰度,然后应用二进制阈值。 这有助于将复杂的彩色图像转换成更简单的黑白图像。 这是非常关键的一步-当文本背景不统一时,当代OCR技术通常会失败。 此步骤解决了该问题。 但是,我们还没有准备好进行OCR。

Image for post
Thresholded and color inverted CAPTCHA image.
阈值和颜色反转的验证码图像。

The next step is to locate the contours of the image. Rather than dealing with the CAPTCHA as a whole, we wish to split the CAPTCHA up into 10 separate images, each containing one alphanumeric. Therefore, we use OpenCV’s findContours function to locate the contours of the letters.

下一步是找到图像的轮廓。 我们不希望将验证码作为一个整体来处理,而是希望将验证码分成10个单独的图像,每个图像包含一个字母数字。 因此,我们使用OpenCVfindContours函数来定位字母的轮廓。

Image for post
Image with contours selected. Notice the small green dots around the top and bottom edges of the image, which get appended to our list of contours.
选择轮廓的图像。 请注意图像顶部和底部边缘周围的小绿点,这些绿点将附加到轮廓列表中。

Unfortunately, most images end up with over 20 contours, despite us knowing that each CAPTCHA contains precisely 10 alphanumerics. So what are the remaining contours that get picked up?

不幸的是,尽管我们知道每个验证码都恰好包含10个字母数字,但大多数图像最终都具有20多个轮廓。 那么剩下的轮廓是什么?

The answer is that they are mostly pieces of the checkered background. According to the docs, “contours can be explained simply as a curve joining all the continuous points (along the boundary), having the same color or intensity”. Some spots in the background match this definition. In order to parse the contours and only retain the characters we want, we are going to have to get creative.

答案是,它们大部分是方格背景的一部分。 根据文档,“轮廓可以简单地解释为连接具有相同颜色或强度的所有连续点(沿边界)的曲线”。 背景中的某些斑点与该定义匹配。 为了解析轮廓并仅保留我们想要的字符,我们将不得不发挥创意。

Since the background specks are much smaller than the alphanumerics, I determined the 10 largest contours (again, each CAPTCHA has precisely 10 alphanumerics in the image), and discarded the rest.

由于背景斑点比字母数字小得多,因此我确定了10个最大的轮廓(同样,每个CAPTCHA在图像中恰好有10个字母数字),并丢弃了其余的轮廓。

Image for post
Individually selected alphanumerics. The largest 10 contours were selected so as to remove small background contours. Notice the incorrect order of the alphanumerics.
单独选择的字母数字。 选择最大的10个轮廓,以去除较小的背景轮廓。 请注意字母数字顺序不正确。

However, there is a new problem. I have 10 contours sorted in order of size. Of course, when solving CAPTCHAs, order matters. I needed to organize the contours from left to right. Therefore, I sorted a dictionary containing the bottom right corner of each of the contours associated with the contour number, which allowed me to plot the contours from left to right as one plot.

但是,存在一个新问题。 我有10个轮廓按大小顺序排序。 当然,在解决验证码时,顺序很重要。 我需要从左到右组织轮廓。 因此,我对包含与轮廓编号相关联的每个轮廓的右下角的字典进行了排序,这使我可以从左到右绘制一个轮廓。

Image for post
The alphanumerics, each one its own image, and in the correct order. The correct order lets us display the label (true class) above each image, as well as the percentage of the image taken up by white pixels.
字母数字,每个字母都有自己的图像,并且顺序正确。 正确的顺序使我们可以在每个图像上方显示标签(真类),以及白色像素所占图像的百分比。

However, I noticed that many of the outputs were nonsensical — the images did not correspond to the label. Even worse, by having one image not correspond to its label, it often disrupted the order of all the downstream letters, meaning that not just one but many of the letters displayed had incorrect labels. This could prove disastrous for training the neural network, since having incorrectly labeled data can make learning much more error-prone. Imagine showing 4-year-old child letters but occasionally reciting the wrong sound for that letter — this would pose a significant challenge to the child.

但是,我注意到许多输出都是荒谬的-图像与标签不符。 更糟糕的是,由于一个图像不符合其标签,它经常打乱所有下游字母的顺序,这意味着不仅显示的一个字母而且显示的许多字母都有不正确的标签。 对于训练神经网络,这可能会带来灾难性的后果,因为具有错误标记的数据会使学习更加容易出错。 想象一下显示4岁的儿童字母,但偶尔会背诵该字母的错误声音-这将对儿童构成重大挑战。

I implemented two error checks. The first was for contours that are complete nonsense, which I noticed usually had very different white to black pixel ratios than ordinary letters. Here is an example:

我实现了两个错误检查。 首先是完全废话的轮廓,我注意到通常白色和黑色像素的比例与普通字母的比例非常不同。 这是一个例子:

Image for post
The third contour in the top row is an example of an image that would be skipped by my first error checking algorithm.
第一行中的第三个轮廓是我的第一个错误检查算法会跳过的图像示例。

The third contour in the top row is actually the inner circle of the previous “O”. Notice how all the downstream letters are now incorrectly labeled as well. We cannot afford to have these incorrectly labeled images appended to our training data. Therefore, I wrote a rule saying that any alphanumeric with a white to black ratio above 63% or below 29% would be skipped.

第一行中的第三个轮廓实际上是上一个“ O”的内圆。 请注意,现在所有下游字母也都被错误地标记了。 我们不能将这些标签错误的图像附加到我们的训练数据中。 因此,我写了一条规则,说白与黑之比高于63%或低于29%的任何字母数字都将被跳过。

Here is an example of my second error check:

这是我第二次错误检查的示例:

Image for post
The rightmost image on the top row, as well as the middle image on the bottom row, exemplifies error type #2.
最上面一行的最右边图像以及最下面一行的中间图像例示了错误类型2。

Sometimes, separate alphanumerics are touching each other slightly, and get picked up as one alphanumeric. This is a big problem because just like error check #1, all the downstream alphanumerics are affected. To solve this issue, I checked whether each image was significantly wider than longer. Most alphanumerics are taller or at least square, and very few are much wider than taller unless they contain an error of type #2, meaning they contain more than one alphanumeric.

有时,单独的字母数字会彼此轻微接触,并被拾取为一个字母数字。 这是一个大问题,因为就像错误检查#1一样,所有下游字母数字都受到影响。 为了解决此问题,我检查了每个图像是否明显宽于长图像。 大多数字母数字都较高或至少为正方形,只有极少数比宽度宽得多的字母数字,除非它们包含类型为2的错误,这意味着它们包含多个字母数字。

Image for post
This is the same CAPTCHA as above, except this time error checking was used.
这与上面的验证码相同,只是这次使用了错误检查。

It seems the error checking works great! Even very difficult CAPTCHA images are being successfully labeled. However, now a new issue arose:

看来错误检查效果很好! 甚至非常困难的验证码图像也已成功标记。 但是,现在出现了一个新问题:

Image for post
An example of a false positive picked up by error check #2.
错误检查#2发现的误报示例。

Here, the “w” in the top row matched the error check #2 criteria — it was very wide, and so the program split the image. I spent a long time trying out different hyperparameter settings wherein wide “w” and “m” images wouldn’t get split, but real cases of two alphanumerics in one image would. However, I found this to be an impossible task. Some “w” letters are very wide, and some image contours containing two alphanumerics can be very narrow.

在这里,第一行中的“ w”符合错误检查#2的条件-它非常宽,因此程序会分割图像。 我花了很长时间尝试不同的超参数设置,其中宽幅的“ w”和“ m”图像不会被分割,但是在一幅图像中包含两个字母数字的真实情况会出现。 但是,我发现这是不可能完成的任务。 有些“ w”字母很宽,有些包含两个字母数字的图像轮廓可能很窄。

Image for post
A very narrow error case #2 can be seen in the middle image of the bottom row.
在最下面一行的中间图像中可以看到非常狭窄的错误情况#2。

There was no secret formula to get what I wanted to be done, so I (shamefully) had to do something I’m not really proud of — take care of things manually.

没有什么秘诀可以解决我想要做的事情,因此我(可耻地)不得不做一些我并不真正引以为傲的事情-手动打理。

获取培训数据 (Obtaining the Training Data)

For those following on Kaggle, this is now the second file, named captchas_save.ipynb. I transferred the data from Kaggle to my local computer, set up a loop to undergo all the preprocessing steps start to finish. At the end of each loop, the evaluator was prompted to perform a safety inspection of the data and labels (“y” to keep it, “n” to discard it). This isn’t ordinarily the way the author prefers to do things, but after spending enough many hours trying to automate the process the efficiency scales started to tip in the automation versus manual balance.

对于Kaggle上的后续用户,现在是第二个文件,名为captchas_save.ipynb。 我将数据从Kaggle传输到我的本地计算机,建立了一个循环,以完成所有预处理步骤。 在每个循环结束时,系统会提示评估人员对数据和标签进行安全检查(“ y”保留数据,“ n”丢弃数据)。 通常,这不是作者喜欢做的事情,但是在花了很多时间尝试使流程自动化之后,效率规模开始在自动化与手动平衡之间逐渐缩小。

I got through evaluating 1000 CAPTCHAs (therefore creating a dataset of roughly 10,000 alphanumerics) in the span of watching the Social Network, switching between myself and my friend who I paid a salary of one beer for his help.

在观看社交网络的过程中,我通过评估1000个验证码(因此创建了大约10,000个字母数字的数据集),在我和我的朋友之间切换,我为自己的帮助付了一杯啤酒的薪水。

Image for post
Example of the user being prompted to confirm the dataset.
提示用户确认数据集的示例。

All the files that received a passing grade were appended to a list. The preprocessing pipeline looped through all the files in the list saved all the alphanumerics into their respective folders, and thus the training set was created. Here is the resulting directory organization, which I uploaded and published as a dataset on kaggle at https://www.kaggle.com/sergei416/captchas-segmented.

所有获得及格分数的文件都会添加到列表中。 预处理管道循环遍历列表中的所有文件,将所有字母数字保存到各自的文件夹中,从而创建了训练集。 这是最终的目录组织,我将其作为数据集上传并发布到kaggle上,网址https://www.kaggle.com/sergei416/captchas-segmented

Image for post
Each alphanumeric was suffixed with “upper” or “lower”. All numbers got “lower” as a suffix.
每个字母数字都以“上”或“下”为后缀。 所有数字都以“较低”为后缀。

Ultimately, I was left with 62 directories, and a total of 5,854 images in my training dataset (meaning 10,000–5854=4146 alphanumerics were discarded).

最终,我剩下62个目录,并且在我的训练数据集中总共有5854张图像(意味着10,000–5854 = 4146个字母数字被丢弃)。

Finally, let us begin the machine learning section of the pipeline! Let’s hop back onto Kaggle, where we will be training a handwritten digit recognition software based on the segmented alphanumerics we received in the training set. Again, existing open-source OCR libraries fail with the images that we have obtained so far.

最后,让我们开始管道的机器学习部分! 让我们跳回到Kaggle,在此我们将根据在训练集中收到的分段字母数字来训练一个手写数字识别软件。 同样,现有的开放源代码OCR库在我们到目前为止获得的图像中都失败了。

训练模型 (Training the Model)

To build the machine learning system, we use a vanilla PyTorch neural network. The complete code is available on Kaggle, at https://www.kaggle.com/sergei416/captcha-ml. For brevity, I will only discuss the interesting/challenging sections of the machine learning code in this article.

为了构建机器学习系统,我们使用香草PyTorch神经网络。 完整代码可在Kaggle上找到,网址https://www.kaggle.com/sergei416/captcha-ml 。 为简洁起见,本文将仅讨论机器学习代码中有趣/具有挑战性的部分。

Thanks to PyTorch’s ImageFolder function, the data was easily loaded into the author’s notebook. There are 62 possible classes, meaning a baseline accuracy for predicting completely random labels is 1/62, or 1.61% accuracy. The author set a test and validation size of 200 images. All the images were reshaped to 128 x 128 pixels, and channels normalization was subsequently applied. Specifically for the training data, RandomCrop, ColorJitter, RandomRotation, and RandomHorizontalFlip were randomly applied to augment the training dataset. Here is a visualization of a batch of training data:

由于PyTorch的ImageFolder功能,数据可以轻松地加载到作者的笔记本中。 有62种可能的类别,这意味着预测完全随机标签的基线准确度为1/62或1.61%准确度。 作者将测试和验证大小设置为200张图像。 将所有图像重塑为128 x 128像素,然后应用通道归一化。 专门针对训练数据,随机应用RandomCrop,ColorJitter,RandomRotation和RandomHorizo​​ntalFlip扩充训练数据集。 这是一批训练数据的可视化:

Image for post
One batch of training data.
一批训练数据。

Transfer learning from wide_resnet101 was used, and training was done using Kaggle’s free GPU. A hyperparameter search was used to identify the best hyperparameters, which resulted in a 99.61% validation accuracy. The model was saved, and the results were visualized:

使用来自wide_resnet101的转移学习,并使用Kaggle的免费GPU进行了培训。 使用超参数搜索来确定最佳的超参数,从而使验证准确性达到99.61%。 保存该模型,并将结果可视化:

Image for post
Loss versus number of epochs.
损失与时期数。
Image for post
Accuracy versus number of epochs.
准确性与历时数。
Image for post
Visualization of the learning rate scheduler.
可视化学习率调度程序。

The final results demonstrate a 99.6% validation accuracy and a 98.8% test accuracy. This honestly exceeded my expectations, as many of the binary, pixelated images would be hard for even a human to decipher correctly. A quick look at the batch of training images above will show at least several questionable images.

最终结果证明了99.6%的验证准确性和98.8%的测试准确性。 老实说,这超出了我的期望,因为即使是人类,许多二进制像素化图像也很难正确解密。 快速浏览一下上面的训练图像批次将显示至少几个可疑图像。

对我们模型的评估 (Evaluation of our Model)

The final step of this pipeline was to test the accuracy of our ready-built machine learning model. We previously exported the model weights for our classifier, and now the model can be applied to any new, unseen CAPTCHA (of the same type as the CAPTCHAs in our original dataset). For sake of completeness, the author set up a code to evaluate the model accuracy.

该管道的最后一步是测试我们现成的机器学习模型的准确性。 之前,我们为模型分类器导出了模型权重,现在可以将模型应用于任何新的,看不见的验证码(与原始数据集中验证码的类型相同)。 为了完整起见,作者设置了代码来评估模型的准确性。

If even a single character is incorrect when solving a CAPTCHA, the entire CAPTCHA will be failed. In other words, all 10 predicted alphanumerics must exactly match the solution for the CAPTCHA to be considered solved. A simple loop comparing equality of the solution to the prediction was set up for 500 images randomly taken from the dataset, and checked whether the predictions were equivalent to the solutions. The code for this is in the fourth (and final) notebook, which can be found at https://www.kaggle.com/sergei416/captchas-test/.

如果在解决CAPTCHA时,甚至单个字符都不正确,则整个CAPTCHA将失败。 换句话说,所有10个预测的字母数字都必须与该解决方案完全匹配,才能将CAPTCHA视为已解决。 对于从数据集中随机抽取的500张图像,建立了一个将解决方案与预测的相等性进行比较的简单循环,并检查了预测是否与解决方案等效。 此代码位于第四本(也是最后一本)笔记本中,可以在https://www.kaggle.com/sergei416/captchas-test/中找到。

This equivalency was true for 125/500 CAPTCHA images tested, or 25% of the test set*. While this number does seem low, it is important to keep in mind that unlike many other machine learning tasks, if one case fails, another case is immediately available. Our program only needs to succeed once but can fail indefinitely. Since our program has 1–0.25 = 0.75 probability of failure for any given try, then given n number of attempts, we will pass the CAPTCHA as long as we do not fail all n attempts. So what is the likelihood of passing within n attempts? It is

对于测试的125/500个CAPTCHA图像或测试集的25%*,此等效性是正确的。 尽管这个数字确实很低,但要记住,与许多其他机器学习任务不同,如果一种情况失败,则立即可以使用另一种情况。 我们的程序只需要成功一次,但是可以无限期地失败。 由于我们的程序对于任何给定的尝试都有1–0.25 = 0.75的失败概率,因此给定n次尝试,只要我们不使所有n次尝试都失败,我们将通过CAPTCHA。 那么在n次尝试中通过的可能性是多少? 它是

Image for post

Therefore, the likelihood of n attempts is as follows:

因此,n次尝试的可能性如下:

Image for post

*EDIT: The accuracy on the kaggle link is 0.3, or 30% for any single CAPTCHA (97.1% with n=10).

*编辑:kaggle链接的准确度为0.3,对于任何单个验证码都为30%(n = 10时为97.1%)。

结论 (Conclusion)

Overall, the attempt to build a machine learning model capable of solving 10-character CAPTCHAs was a success. The final model can solve the puzzles with an accuracy of 30%, meaning there is a 97.1% probability a CAPTCHA image will be solved within the first 10 attempts.

总的来说,建立能够解决10个字符的验证码的机器学习模型的尝试是成功的。 最终模型可以以30%的精度解决难题,这意味着在前10次尝试中将有97.1%的概率可以解决验证码图像。

Thank you to all who made it to the end of this tutorial! What this project showed, is that the 10-character CAPTCHA is not suitable for differentiating human from non-human users and that other classes of CAPTCHA should be used in production.

感谢所有在本教程结束时成功的人! 该项目表明,10个字符的CAPTCHA不适合区分人类用户和非人类用户,并且应在生产中使用其他类别的CAPTCHA。

Links:

链接:

Linkedin: https://www.linkedin.com/in/sergei-issaev/

Linkedin: https : //www.linkedin.com/in/sergei-issaev/

Twitter: https://twitter.com/realSergAI

推特: https : //twitter.com/realSergAI

Github: https://github.com/sergeiissaev

GitHub: https : //github.com/sergeiissaev

Kaggle: https://www.kaggle.com/sergei416

Kaggle: https ://www.kaggle.com/sergei416

Jovian: https://jovian.ml/sergeiissaev/

Jovian: https//jovian.ml/sergeiissaev/

Medium: https://medium.com/@sergei740

媒介: https//medium.com/@sergei740

翻译自: https://medium.com/towards-artificial-intelligence/deep-learning-based-automatic-captcha-solver-12cd49191c58

基于深度搜索的树路径求解

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值