笔记:基于文本的验证码一般解决方法
Generic Solving of Text-based CAPTCHAs
- Introduction
- Algorithm
- Optimization
- Evaluation
Introduction
captcha
Many websites use CAPTCHAs, or Completely Automated Public Turing tests to tell Computers and Humans Apart, to block automated interaction with their sites, captchas are sometimes called reverse Turing tests, because they are intended to allow a computer to determine whether a remote client is human or machine.
许多网站都使用验证码来验证对方是机器还是人类,为了阻止与网站的自动交互,验证码有些时候也被称作反向图灵测试,因为它们的目的是让计算机确定远程客户机是人还是机器。
Negative kerning:
captcha uses negative space between characters to resist segmentation by ensuring that each character is occluded by its neighbors.
验证码使用字符之间的负空间来抵御分割,确保每个字符都会被其旁边的字符覆盖。
the segment then recognize approach
可以看到验证码从分割到识别的过程。
Improvement
The limitation of the segment then recognize approach has been the attacker’s ability to find new flaws.
分割后识别的方法的限制是攻击者找到了方法的缺陷。
The work in this paper overcomes this limitation by segmenting and recognizing the captcha simultaneously, thus removing the need for manually discovered heuristics to segment captchas.
这篇文章中的方法通过同时进行分割识别的方法,克服了分割后识别的方法的限制,从而去除了需要手动发现分割验证码的需要。
dataset
数据集展示如上。
recaptcha数据集解决了古籍中的一些纸质书电子化的工作量。
Algorithm
Overview of the algorithms four components
how the algorithm works
Example of the algorithm successfully appliedto a Yahoo captcha
Reinforcement learning
Occluding lines
Optimization
Reducing the number of cuts
Algorithm recognition rate (recognition), averagesolve time (time) and average number of segments (segments) with and without the cut point reduction heuristic
Sequential recognition
the best approach when making a local decision is to consider a window of two letters at a time.
The sequential and right-left recognition processes.
The closer a character is to the center of the captcha, the less accurate the algorithm is. We also observed that the sequential recognition approach is less accurate on the right side of the captcha than on the left side.
Recognition rate per letter for the each learning approach on the Baidu 2011 scheme
the idea of performing the sequential recognition from both directions and then combining the two recognition scores to improve the overall accuracy.
Recognition rate per letter for the various approaches for the Yahoo! scheme
Recognition rates for real-world schemes.
a test set of approximately 1000 captchas for each captcha scheme that were not used during training
Experiment
Evaluation
Learnability
Recognition rate as a function of the number of example in each class.
it does not take very many examples to achieve a sufficient accuracy rate.
Human accuracy
Ranged from 0 pixels to -7 pixels . Thecaptchas were 6 characters long.
Human and Algorithm accuracy vs. spacing between letters in pixel