mkbitmap -x -f 50 -t 0.02 $file convert $base.pnm $base.tif tesseract $base.tif $baseAfter building the latest version of tesseract from Google, the results improved markedly, but still is not very satisfactory. Ah, I read that 300dpi is really much better (and quite sufficient) and a scan of a Traveller supplement worked much better with tesseract.
Sunday, November 20, 2011
Indulging in a long standing interest in OCR, I begin an adventure in the realm of free ocr software.
TWAIN supports my ancient scanner very well (better than it was on Windows or Mac, actually. Linux seems very kind to old hardware) so I begin. Scanning my ancient copy of Gods, Demi-Gods and Heroes at 600 dpi I then start searching through apt-cache for ocr software.
I tried a few with fairly dismal results, although unpaper seems quite handy for improving the readability of the raw scans. It doesn't seem to fix as much as I'd expect, so I imagine there still much to figure out there.
ocrad didn't work well, and tesseract (from the debian distro) didn't work at all with the pnm files produced by TWAIN.
Dig up the utilities to convert the files.
Unpaper does handle splitting multiple pages into single pages nicely, though.
So:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment