Keywords: python captcha
Most people don’t know this but my honours thesis was about using a computer program to read text out of web images. My theory was that if you could get a high level of successful extraction you could use it as another source of data which could be used to improve search engine results. I was even quite successful in doing it, but never really followed my experiments up. My honours advisor Dr Junbin Gao http://csusap.csu.edu.au/~jbgao/ had suggested the following writing my thesis I should write some form of article on what I had learnt. Well I finally got around to doing it. While what follows is not exactly what I was studying it is something I wish had existed when I started looking around.
I am posting this article solely for information and educational purposes. I do not condone the use or act of ethical or unethical "hacking" nor circumvention of copy protection etc... The use of this code for any unethical or illegal purposes is not condoned by myself nor any other person(s) mentioned in this article.
This work is licenced under a Creative Commons Licence.
So as I mentioned essentially what I attempted to do was take standard images on the web, and extract the text out them as a way of improving search results. Interestingly I based most of my research/ideas by looking at methods of cracking CAPTCHA's. A CAPTCHA as you may well know is one of those annoying "Type in the letters you see in the image above" things you see on many website signup pages or comment sections.
A CAPTCHA image is designed so that a human can read it without difficulty while a computer is unable to. This in practice has never really worked with pretty much every CAPTCHA that is published on the web getting cracked within a matter of months. Knowing this my theory was that since people can get a computer to read something that it shouldn’t be able to, then normal images such as website logos should be much easier to break using the same methods.
I was actually surprisingly successful in my goal with over 60% successful recognition rates for most of the images I used in my sample set. Rather high considering the variety of different images that are on the web.
What I did find however while doing my research was a lack of sample code or applications which show you how to crack CAPTCHA's. While there are some excellent tutorials and many published papers on it they are very light on algorithms or sample code. In fact I didn't find any beyond some non working PHP scripts and some Perl fragments which strung together a few non related programs and gave some reasonable results when presented with very simple CAPTCHA’s. None of them helped me very much. I found that what I needed was some detailed code with examples I could run and tweak and see how it worked. I think I am just one of those people that can read the theory, and follow along, but without something to prod and poke I never really understand it. Most of the papers and articles said they would not publish code due the potential for missuse. Personally I think it is a waste of time since in reality building a CAPTCHA breaker is quite easy once you know how.
So because of the lack of examples, and the problems I had initially getting started, I thought I would put together this article with full detailed explanations working code showing how to go about breaking a CAPTCHA.
Let’s get started.
Here is a list in order of things I am going to discuss.
- Technology Used
- CAPTCHA’s, what are they anyway
- How to identify text in images/How to extract text in images
- Image Recognition using AI. Neural Networks, Vector Space.
- Building a training set
- Putting it all together
- Results and conclusion
Technology used
All of the sample code is written in Python 2.5 using the Python Image Library. It will probably work in Python 2.6 but 2.5 is what I had installed. To get started just install Python then install the Python Image Library.
Python | http://www.python.org/ |
Python Image Library | http://www.pythonware.com/products/pil/ |
Install them in the above order and you should be ready to run the examples.
Prefix
I am going to hardcode a lot of the values in this example. I am not trying to create a general CAPTCHA solver, but one specific to the examples given. This is just to keep the examples short and concise.
CAPTCHA’s, What are they Anyway?
A CAPTCHA is basically just an implementation of a one way function. This is a function where it is easy to take input and compute the result, but difficult to take the result and compute the input. What is different about them though is that while they are difficult for a computer to take the result and output the inputs, it should be easy for a human to do it. A CAPTCHA can be thought of in simple terms as a "Are you a human?" test. Essentially they are implemented by showing an image which has some word or letters embedded in it.
They are used for preventing automated spam on many online websites. An example can be found on the Windows Live ID signup page