To actually get a good sample set, I used a web scraping Chrome plugin in to go to the login page and download the page's CAPTCHA image. This method led me to having roughly 300 CAPTCHA images to aid me in cracking Reddit's security system.
From here, I decided to buy a textbook on Artificial Intelligence/Machine Vision. I ended up buying this book, which seems to be pretty popular and influential when it comes to AI. I read over algorithms related to machine vision such as edge detection, character recognition, ellipse detection, and more. With these algorithms in my back pocket, I decided to try my hand at the supposedly impossible.
Step 1: Remove Extra Colors and Background Noise
To give you an idea of how the program worked, I'll be providing an example here. The following image is a real CAPTCHA that came from Reddit.
To begin, I decided to write an application that converted the image to one with 4 bpp color. I thought it would serve as a way to lower the amount of colors and hopefully make it easier to detect what was in the image. As luck would have it, the conversion actually really helped! The following is the image with 4 bpp color.
4 BPP CAPTCHA
You might not notice that much of a difference, but there's actually one very large difference between this and the previous image; the grid is now mostly a solid gray with a bit of white, rather than many different colors spanning white to gray. Also, this meant that the foreground letters were basically the only white objects in the image. So, with this knowledge in hand, I wrote a BFS-like algorithm that traverses the image's pixels looking for the white color (-1 in Java) and replaced it with black. Any other color was set to -1, or white. This lead to an amazing milestone; I had successfully removed the grid and created an image with only the letters and some artifacts!
Step 2: Removing Artifacts
Obviously, the next step was to remove those extraneous artifacts. To do this, I wrote a few algorithms to determine whether a black pixel was extraneous or not. One of these algorithms was to find every black pixel and check how many surrounding pixels were black. If I found many, then the pixel most likely belonged to a letter, and if there were very few, then it was most likely extraneous. Another of these algorithms was to count how many black pixels were in each row. If there were a lot, then the word was presumably there. If there were few, I would erase lines 0-i (i being the row with a few black pixels) or i-height, depending on where in the image the algorithm was.
Alright, getting there...
Alright, this ends part 2. Sorry about the crazy amount of posts, but I'm a student and I don't have a lot of time to write this all out immediately. The next part should cover separating the letters and possibly recognition of them. Thanks for reading and I hope you'll stick around for the next post!