基于深度搜索的树路径求解
计算机视觉,网络安全,深度学习(Computer Vision, Cybersecurity, Deep Learning)
Disclaimer: The following work was created as an academic project. The work was not used, nor was it intended to be used, for any harmful or malicious purposes. Enjoy!
免责声明:以下工作是作为一个学术项目而创建的。 该作品未用于任何有害或恶意目的,也未打算用于任何有害或恶意目的。 请享用!
介绍 (Introduction)
CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are something nearly every internet user had the experience of having to deal with. In the process of logging in or making an account, making an online purchase, or even posting a comment, many people are confronted with strange-looking, stretched, blurred, color, and shape distorted images more so resembling a Dali painting than English text.
验证码(完全自动化的公共图灵测试可以告诉计算机和人类分开)是几乎每个互联网用户都必须面对的经验。 在登录或注册帐户,进行在线购买甚至发表评论的过程中,许多人面临着怪异的外观,被拉伸的,模糊的,彩色的和变形的图像,比起英语,它更像是大理画。 。
For a long time, the 10-character blurred-text CAPTCHA was used in deployment because contemporary computer vision methods had difficulty recognizing letters against a non-uniform background, whereas humans have no problem doing so. This remains a significant issue in the field of character recognition to this day.
长期以来,由于现代计算机视觉方法难以识别非均匀背景下的字母,而人类却没有问题,因此在部署中使用了10个字符的模糊文本CAPTCHA。 迄今为止,这仍然是字符识别领域中的重要问题。
Therefore, in order to develop a program capable of reading the characters in this complex optical character recognition task, we will develop a custom solution specific to the problem. In this article, we will go over the start to finish the pipeline for developing a CAPTCHA solver for a specific class of CAPTCHAs.
因此,为了开发能够读取此复杂光学字符识别任务中字符的程序,我们将开发针对该问题的定制解决方案。 在本文中,我们将重新开始以完成针对特定类别的验证码开发验证码求解器的过程。
数据 (The Data)
10,000 CAPTCHA images were provided from a kaggle dataset (scraped by Aadhav Vignesh, at https://www.kaggle.com/aadhavvignesh/captcha-images). Each image the name NAME.jpg, where NAME is the solution to the puzzle. In other words, the NAME contains the correct letters in the correct sequence of the image, such as 5OfaXDfpue.
从kaggle数据集(由Aadhav Vignesh抓取,位于https://www.kaggle.com/aadhavvignesh/captcha-images )提供了10,000个CAPTCHA图像。 每张图片都命名为NAME.jpg,其中NAME是解决难题的方法。 换句话说,NAME以正确的图像顺序包含正确的字母,例如5OfaXDfpue。
Getting a computer program to read the characters inside the image presents a significant challenge. Open source OCR software, such as PyTesseract, failed when tested on the CAPTCHAs, often not even picking up a single character from the whole image. Training a neural network with the raw images as input and solutions as output might succeed in a dataset of tens of millions, but is not very likely to succeed with a dataset of 10,000 images. A new method has to be developed.
要使计算机程序读取图像中的字符是一个巨大的挑战。 诸如PyTesseract之类的开源OCR软件在CAPTCHA上进行测试时失败了,通常甚至没有从整个图像中拾取单个字符。 以原始图像作为输入并以解决方案作为输出来训练神经网络可能会在数千万的数据集中成功,但对于10,000个图像的数据集却不太可能成功。 必须开发一种新方法。
The issue could be greatly simplified by analyzing one character at a time, rather the entire image at once. If a dataset of alphanumerics and their corresponding labels could be obtained, a simple MNIST-like neural network can be trained to recognize characters. Therefore, our pipeline looks like this:
通过一次分析一个字符,而不是一次分析整个图像,可以大大简化此问题。 如果可以获得字母数字及其对应标签的数据集,则可以训练一个简单的类似于MNIST的神经网络来识别字符。 因此,我们的管道如下所示:
- Invert the CAPTCHA, segment the characters, and save each alphanumeric character separately to disk (with its label). 反转CAPTCHA,对字符进行分段,然后将每个字母数字字符分别保存到磁盘(带有标签)。
- Train a neural network to recognize the characters of each class. 训练神经网络以识别每个班级的特征。
- To use our model, feed-in any CAPTCHA, invert the image, and segment the characters. Apply the machine learning model on each character, get the predictions, and concatenate them as a string. Voila. 要使用我们的模型,请输入任何验证码,反转图像并分割字符。 将机器学习模型应用于每个角色,获取预测,并将其连接为字符串。 瞧
前处理(Preprocessing)
For those of you following my code on Kaggle(https://github.com/sergeiissaev/kaggle_notebooks), the following section refers to the notebook titled captchas_eda.ipynb. We begin by loading a random image from the dataset and then plotting it.
对于那些遵循我在Kaggle( https://github.com/sergeiissaev/kaggle_notebooks )上的代码的人,以下部分引用名为captchas_eda.ipynb的笔记本。 我们首先从数据集中加载随机图像,然后对其进行绘制。
Next, using OpenCV (a very useful computer vision library available in Python and C++) the image is converted to grayscale, and then a binary threshold is applied. This helps transform the complex, multicolored image into a much simpler black and white image. This is a very crucial step — contemporary OCR technology often fails when the text is against a nonuniform background. This step removes that problem. However, we aren’t ready to OCR just yet.
接下来,使用OpenCV(Python和C ++中非常有用的计算机视觉库)将图像转换为灰度,然后应用二进制阈值。 这有助于将复杂的彩色图像转换成更简单的黑白图像。 这是非常关键的一步-当文本背景不统一时,当代OCR技术通常会失败。 此步骤解决了该问题。 但是,我们还没有准备好进行OCR。
The next step is to locate the contours of the image. Rather than dealing with the CAPTCHA as a whole, we wish to split the CAPTCHA up into 10 separate images, each containing one alphanumeric. Therefore, we use OpenCV’s findContours function to locate the contours of the letters.
下一步是找到图像的轮廓。 我们不希望将验证码作为一个整体来处理,而是希望将验证码分成10个单独的图像,每个图像包含一个字母数字。 因此,我们使用OpenCVfindContours函数来定位字母的轮廓。
Unfortunately, most images end up with over 20 contours, despite us knowing that each CAPTCHA contains precisely 10 alphanumerics. So what are the remaining contours that get picked up?
不幸的是,尽管我们知道每个验证码都恰好包含10个字母数字,但大多数图像最终都具有20多个轮廓。 那么剩下的轮廓是什么?
The answer is that they are mostly pieces of the checkered background. According to the docs, “contours can be explained simply as a curve joining all the continuous points (along the boundary), having the same color or intensity”. Some spots in the background match this definition. In order to parse the contours and only retain the characters we want, we are going to have to get creative.
答案是,它们大部分是方格背景的一部分。 根据文档,“轮廓可以简单地解释为连接具有相同颜色或强度的所有连续点(沿边界)的曲线”。 背景中的某些斑点与该定义匹配。 为了解析轮廓并仅保留我们想要的字符,我们将不得不发挥创意。
Since the background specks are much smaller than the alphanumerics, I determined the 10 largest contours (again, each CAPTCHA has precisely 10 alphanumerics in the image), and discarded the rest.
由于背景斑点比字母数字小得多,因此我确定了10个最大的轮廓(同样,每个CAPTCHA在图像中恰好有10个字母数字),并丢弃了其余的轮廓。
However, there is a new problem. I have 10 contours sorted in order of size. Of course, when solving CAPTCHAs, order matters. I needed to organize the contours from left to right. Therefore, I sorted a dictionary containing the bottom right corner of each of the contours associated with the contour number, which allowed me to plot the contours from left to right as one plot.
但是,存在一个新问题。 我有10个轮廓按大小顺序排序。 当然,在解决验证码时,顺序很重要。 我需要从左到右组织轮廓。 因此,我对包含与轮廓编号相关联的每个轮廓的右下角的字典进行了排序,这使我可以从左到右绘制一个轮廓。
However, I noticed that many of the outputs were nonsensical — the images did not correspond to the label. Even worse, by having one image not correspond to its label, it often disrupted the order of all the downstream letters, meaning that not just one but many of the letters displayed had incorrect labels. This could prove disastrous for training the neural network, since having incorrectly labeled data can make learning much more error-prone. Imagine showing 4-year-old child letters but occasionally reciting the wrong sound for that letter — this would pose a significant challenge to the child.
但是,我注意到许多输出都是荒谬的-图像与标签不符。 更糟糕的是,由于一个图像不符合其标签,它经常打乱所有下游字母的顺序,这意味着不仅显示的一个字母而且显示的许多字母都有不正确的标签。 对于训练神经网络,这可能会带来灾难性的后果,因为具有错误标记的数据会使学习更加容易出错。 想象一下显示4岁的儿童字母,但偶尔会背诵该字母的错误声音-这将对儿童构成重大挑战。
I implemented two error checks. The first was for contours that are complete nonsense, which I noticed usually had very different white to black pixel ratios than ordinary letters. Here is an example:
我实现了两个错误检查。 首先是完全废话的轮廓,我注意到通常白色和黑色像素的比例与普通字母的比例非常不同。 这是一个例子:
The third contour in the top row is actually the inner circle of the previous “O”. Notice how all the downstream letters are now incorrectly labeled as well. We cannot afford to have these incorrectly labeled images appended to our training data. Therefore, I wrote a rule saying that any alphanumeric with a white to black ratio above 63% or below 29% would be skipped.
第一行中的第三个轮廓实际上是上一个“ O”的内圆。 请注意,现在所有下游字母也都被错误地标记了。 我们不能将这些标签错误的图像附加到我们的训练数据中。 因此,我写了一条规则,说白与黑之比高于63%或低于29%的任何字母数字都将被跳过。
Here is an example of my second error check:
这是我第二次错误检查的示例:
Sometimes, separate alphanumerics are touching each other slightly, and get picked up as one alphanumeric. This is a big problem because just like error check #1, all the downstream alphanumerics are affected. To solve this issue, I checked whether each image was significantly wider than longer. Most alphanumerics are taller or at least square, and very few are much wider than taller unless they contain an error of type #2, meaning they contain more than one alphanumeric.
有时,单独的字母数字会彼此轻微接触,并被拾取为一个字母数字。 这是一个大问题,因为就像错误检查#1一样,所有下游字母数字都受到影响。 为了解决此问题,我检查了每个图像是否明显宽于长图像。 大多数字母数字都较高或至少为正方形,只有极少数比宽度宽得多的字母数字,除非它们包含类型为2的错误,这意味着它们包含多个字母数字。
It seems the error checking works great! Even very difficult CAPTCHA images are being successfully labeled. However, now a new issue arose:
看来错误检查效果很好! 甚至非常困难的验证码图像也已成功标记。 但是,现在出现了一个新问题:
Here, the “w” in the top row matched the error check #2 criteria — it was very wide, and so the program split the image. I spent a long time trying out different hyperparameter settings wherein wide “w” and “m” images wouldn’t get split, but real cases of two alphanumerics in one image would. However, I found this to be an impossible task. Some “w” letters are very wide, and some image contours containing two alphanumerics can be very narrow.
在这里,第一行中的“ w”符合错误检查#2的条件-它非常宽,因此程序会分割图像。 我花了很长时间尝试不同的超参数设置,其中宽幅的“ w”和“ m”图像不会被分割,但是在一幅图像中包含两个字母数字的真实情况会出现。 但是,我发现这是不可能完成的任务。 有些“ w”字母很宽,有些包含两个字母数字的图像轮廓可能很窄。
There was no secret formula to get what I wanted to be done, so I (shamefully) had to do something I’m not really proud of — take care of things manually.
没有什么秘诀可以解决我想要做的事情,因此我(可耻地)不得不做一些我并不真正引以为傲的事情-手动打理。
获取培训数据 (Obtaining the Training Data)
For those following on Kaggle, this is now the second file, named captchas_save.ipynb. I transferred the data from Kaggle to my local computer, set up a loop to undergo all the preprocessing steps start to finish. At the end of each loop, the evaluator was prompted to perform a safety inspection of the data and labels (“y” to keep it, “n” to discard it). This isn’t ordinarily the way the author prefers to do things, but after spending enough many hours trying to automate the process the efficiency scales started to tip in the automation versus manual balance.
对于Kaggle上的后续用户,现在是第二个文件,名为captchas_save.ipynb。 我将数据从Kaggle传输到我的本地计算机,建立了一个循环,以完成所有预处理步骤。 在每个循环结束时,系统会提示评估人员对数据和标签进行安全检查(“ y”保留数据,“ n”丢弃数据)。 通常,这不是作者喜欢做的事情,但是在花了很多时间尝试使流程自动化之后,效率规模开始在自动化与手动平衡之间逐渐缩小。
I got through evaluating 1000 CAPTCHAs (therefore creating a dataset of roughly 10,000 alphanumerics) in the span of watching the Social Network, switching between myself and my friend who I paid a salary of one beer for his help.
在观看社交网络的过程中,我通过评估1000个验证码(因此创建了大约10,000个字母数字的数据集),在我和我的朋友之间切换,我为自己的帮助付了一杯啤酒的薪水。
All the files that received a passing grade were appended to a list. The preprocessing pipeline looped through all the files in the list saved all the alphanumerics into their respective folders, and thus the training set was created. Here is the resulting directory organization, which I uploaded and published as a dataset on kaggle at https://www.kaggle.com/sergei416/captchas-segmented.
所有获得及格分数的文件都会添加到列表中。 预处理管道循环遍历列表中的所有文件,将所有字母数字保存到各自的文件夹中,从而创建了训练集。 这是最终的目录组织,我将其作为数据集上传并发布到kaggle上,网址为https://www.kaggle.com/sergei416/captchas-segmented 。
Ultimately, I was left with 62 directories, and a total of 5,854 images in my training dataset (meaning 10,000–5854=4146 alphanumerics were discarded).
最终,我剩下62个目录,并且在我的训练数据集中总共有5854张图像(意味着10,000–5854 = 4146个字母数字被丢弃)。
Finally, let us begin the machine learning section of the pipeline! Let’s hop back onto Kaggle, where we will be training a handwritten digit recognition software based on the segmented alphanumerics we received in the training set. Again, existing open-source OCR libraries fail with the images that we have obtained so far.
最后,让我们开始管道的机器学习部分! 让我们跳回到Kaggle,在此我们将根据在训练集中收到的分段字母数字来训练一个手写数字识别软件。 同样,现有的开放源代码OCR库在我们到目前为止获得的图像中都失败了。
训练模型 (Training the Model)
To build the machine learning system, we use a vanilla PyTorch neural network. The complete code is available on Kaggle, at https://www.kaggle.com/sergei416/captcha-ml. For brevity, I will only discuss the interesting/challenging sections of the machine learning code in this article.
为了构建机器学习系统,我们使用香草PyTorch神经网络。 完整代码可在Kaggle上找到,网址为https://www.kaggle.com/sergei416/captcha-ml 。 为简洁起见,本文将仅讨论机器学习代码中有趣/具有挑战性的部分。
Thanks to PyTorch’s ImageFolder function, the data was easily loaded into the author’s notebook. There are 62 possible classes, meaning a baseline accuracy for predicting completely random labels is 1/62, or 1.61% accuracy. The author set a test and validation size of 200 images. All the images were reshaped to 128 x 128 pixels, and channels normalization was subsequently applied. Specifically for the training data, RandomCrop, ColorJitter, RandomRotation, and RandomHorizontalFlip were randomly applied to augment the training dataset. Here is a visualization of a batch of training data:
由于PyTorch的ImageFolder功能,数据可以轻松地加载到作者的笔记本中。 有62种可能的类别,这意味着预测完全随机标签的基线准确度为1/62或1.61%准确度。 作者将测试和验证大小设置为200张图像。 将所有图像重塑为128 x 128像素,然后应用通道归一化。 专门针对训练数据,随机应用RandomCrop,ColorJitter,RandomRotation和RandomHorizontalFlip扩充训练数据集。 这是一批训练数据的可视化:
Transfer learning from wide_resnet101 was used, and training was done using Kaggle’s free GPU. A hyperparameter search was used to identify the best hyperparameters, which resulted in a 99.61% validation accuracy. The model was saved, and the results were visualized:
使用来自wide_resnet101的转移学习,并使用Kaggle的免费GPU进行了培训。 使用超参数搜索来确定最佳的超参数,从而使验证准确性达到99.61%。 保存该模型,并将结果可视化:
The final results demonstrate a 99.6% validation accuracy and a 98.8% test accuracy. This honestly exceeded my expectations, as many of the binary, pixelated images would be hard for even a human to decipher correctly. A quick look at the batch of training images above will show at least several questionable images.
最终结果证明了99.6%的验证准确性和98.8%的测试准确性。 老实说,这超出了我的期望,因为即使是人类,许多二进制像素化图像也很难正确解密。 快速浏览一下上面的训练图像批次将显示至少几个可疑图像。
对我们模型的评估 (Evaluation of our Model)
The final step of this pipeline was to test the accuracy of our ready-built machine learning model. We previously exported the model weights for our classifier, and now the model can be applied to any new, unseen CAPTCHA (of the same type as the CAPTCHAs in our original dataset). For sake of completeness, the author set up a code to evaluate the model accuracy.
该管道的最后一步是测试我们现成的机器学习模型的准确性。 之前,我们为模型分类器导出了模型权重,现在可以将模型应用于任何新的,看不见的验证码(与原始数据集中验证码的类型相同)。 为了完整起见,作者设置了代码来评估模型的准确性。
If even a single character is incorrect when solving a CAPTCHA, the entire CAPTCHA will be failed. In other words, all 10 predicted alphanumerics must exactly match the solution for the CAPTCHA to be considered solved. A simple loop comparing equality of the solution to the prediction was set up for 500 images randomly taken from the dataset, and checked whether the predictions were equivalent to the solutions. The code for this is in the fourth (and final) notebook, which can be found at https://www.kaggle.com/sergei416/captchas-test/.
如果在解决CAPTCHA时,甚至单个字符都不正确,则整个CAPTCHA将失败。 换句话说,所有10个预测的字母数字都必须与该解决方案完全匹配,才能将CAPTCHA视为已解决。 对于从数据集中随机抽取的500张图像,建立了一个将解决方案与预测的相等性进行比较的简单循环,并检查了预测是否与解决方案等效。 此代码位于第四本(也是最后一本)笔记本中,可以在https://www.kaggle.com/sergei416/captchas-test/中找到。
This equivalency was true for 125/500 CAPTCHA images tested, or 25% of the test set*. While this number does seem low, it is important to keep in mind that unlike many other machine learning tasks, if one case fails, another case is immediately available. Our program only needs to succeed once but can fail indefinitely. Since our program has 1–0.25 = 0.75 probability of failure for any given try, then given n number of attempts, we will pass the CAPTCHA as long as we do not fail all n attempts. So what is the likelihood of passing within n attempts? It is
对于测试的125/500个CAPTCHA图像或测试集的25%*,此等效性是正确的。 尽管这个数字确实很低,但要记住,与许多其他机器学习任务不同,如果一种情况失败,则立即可以使用另一种情况。 我们的程序只需要成功一次,但是可以无限期地失败。 由于我们的程序对于任何给定的尝试都有1–0.25 = 0.75的失败概率,因此给定n次尝试,只要我们不使所有n次尝试都失败,我们将通过CAPTCHA。 那么在n次尝试中通过的可能性是多少? 它是
Therefore, the likelihood of n attempts is as follows:
因此,n次尝试的可能性如下:
*EDIT: The accuracy on the kaggle link is 0.3, or 30% for any single CAPTCHA (97.1% with n=10).
*编辑:kaggle链接的准确度为0.3,对于任何单个验证码都为30%(n = 10时为97.1%)。
结论 (Conclusion)
Overall, the attempt to build a machine learning model capable of solving 10-character CAPTCHAs was a success. The final model can solve the puzzles with an accuracy of 30%, meaning there is a 97.1% probability a CAPTCHA image will be solved within the first 10 attempts.
总的来说,建立能够解决10个字符的验证码的机器学习模型的尝试是成功的。 最终模型可以以30%的精度解决难题,这意味着在前10次尝试中将有97.1%的概率可以解决验证码图像。
Thank you to all who made it to the end of this tutorial! What this project showed, is that the 10-character CAPTCHA is not suitable for differentiating human from non-human users and that other classes of CAPTCHA should be used in production.
感谢所有在本教程结束时成功的人! 该项目表明,10个字符的CAPTCHA不适合区分人类用户和非人类用户,并且应在生产中使用其他类别的CAPTCHA。
Links:
链接:
Linkedin: https://www.linkedin.com/in/sergei-issaev/
Linkedin: https : //www.linkedin.com/in/sergei-issaev/
Twitter: https://twitter.com/realSergAI
推特: https : //twitter.com/realSergAI
Github: https://github.com/sergeiissaev
GitHub: https : //github.com/sergeiissaev
Kaggle: https://www.kaggle.com/sergei416
Kaggle: https ://www.kaggle.com/sergei416
Jovian: https://jovian.ml/sergeiissaev/
Jovian: https : //jovian.ml/sergeiissaev/
Medium: https://medium.com/@sergei740
媒介: https : //medium.com/@sergei740
基于深度搜索的树路径求解