Today's pointless hack was brought on by the fact that happened to be in possession of the complete dilbert comics up until 2005. Often I remember some particular strip, but have no way to sensible search thousands of GIF files. (I *could* join comics.com, but that would be less fun, and I already have nearly all the dilbert books, so this info is sort of 'mine' already, ahem). Also inspired by my recent re-reading of Hofstadter's Letter Spirit, I set to work. Note that there are good commercial solutions for this, and probably lots of well known algorithms etc., but I wanted to discover this myself. I did a quick google for tutorials on OCR, but nothing sensible came up.)
So, looking at some strips, the dilbert text seems to have many nice features for automatic extraction:
- The font is always the same size/type
- the letters are all capitals
- the lines of text are always straights
- there is always white-space behind the text
- There is very little punctuation or digits.
- with some exceptions to all of these, but these bits i can live without:
So with the gimp, python, ImageMagick, Numeric and PIL I set to work. First I cut out one of each letter, and auto-cropped in the gimp:
The first plan was to scan every pixel of the big image for a match with every letter, a match being defined as the sum of the errors for the overlapping pixels being below a certain threshold. Since most dilberts (apart from sunday strips) are black and white anyway, I did everything in gray-scale mode and the pixel values are simply byte values. The threshold for a "match" was set at 40 after some trial and error.
This process was slow as cancer, even when I changed from python lists to Numeric arrays. (later it turned, using a flat list and computing the offset as y*w+x is actually quicker than Numeric arrays, even without psyco, odd…) The first attempt, for a single letter only, looked like this:
Tweaking the numbers for the "match" threshold and removing testing for overlapping matches etc. speeded it up a bit, and I soon discovered another problem. Here illustrated by looking for matches for i's in the first frame:
The problem was that comparing the tightly cropped I image was matching lots of sub-sections of other letters. Since the thing was still horribly slow I decided to try a slightly different approach. I would detect the base-lines of each line of the text in picture, again, this should be pretty easy apart from a few comics where there is content next to the text, but I would worry about that later. First attempt at finding lines that were largely empty:
then group together blocks of lines and remove the ones to close to the top to fit a whole letter:
Now I retried the above matching of letters, trying each letter on every possible X position along each line, then order them by X value.
This produces the first real result, i.e. it spat out some text:
visiting a customer
our office was
designed with the
science of feng shui
Correct answers in italics. Interestingly this produces more letters than in the original :) So I try grouping the letters that are duplicates, i.e. both were detected in the same place:
v<it><it>ing a cu<i t>mer
<od>ur <od>f<tf><it>o as
<iod>es<it>gne<od> w<it>h lt<fh>e
c<it>ence <ocg><ep> fe<cg> hu<itl>
It's not quite useful yet… even if I generated all possible words for each ambiguous character I wouldn't get anything sensible.
Also, there is clearly a big problem in recognising S's and in telling Is and Ts apart. Maybe I shouldn't disallow overlapping boxes… this made it slightly slower again, but didn't improve results.
So after two evenings of hacking I now had some random text that was completely useless. My instinct told me I had probably taken the simple sum-of-error approach as far as it would go, and now – there is a fork in the path:
1. Keep considering the letters independently, but let the program learn, i.e. sit through a few sessions of : "i reckon this is an 'S', no that's a 'Z', try harder…. "
2. Make the matching pattern of each letter more aware of the special features of each letter, i.e. ignore the bits the I and T have in common, focus on the difference. Not sure how to do this when 3 (or more) letters match the same thing… I would probably just try a dodgy hack and see where I get to.
3. Neural networks does 2 much better than I can ever hope to.
I've done some more work on this now, and I will shortly follow it up with a chapter II! :)
I was a bit late in typing up this part I — and I needed a Saturday to find the time (as well as the need to prepare 150 slides on Semantic Web Services for Monday made it easy to want to do other things…)
PS: Leo was convinced I was wasting my time (BUT IT'S THE JOURNEY NOT THE DESTINATION!), and tried OmniPage Pro on some dilbert cartoons, and it sucked! Ha! This isn't completely pointless after all!