In case anyone wonders: I tried if Google could solve its own captchas. It can, if each character is separated, but once they overlap, like they usually do, it doesn't work.
I was unable to find any reference regarding this. Personally gocr worked better for me than tessearct/pytesseract. Google docs inbuilt OCR gives pretty satisfactory results too.
I find it tremendously frustrating that so many people are creating this problem for themselves.
Anything that needs to be data should be data, not images. Except for some very specific cases, you're not doing anybody any favors by outputting PDF. That format is a data black hole. It allows you to transmit very well-formatted output, but it absolutely stops you from reliably using anything in that content.
I beg you all: if it's anything that contains data, or really, if it's anything for which layout and formatting is not absolutely critical, please don't use PDF. Send data as data.
Obviously. But if we could make everyone understand, then we'd be covered.
Every few months here, we get a customer asking why we can't automatically handle purchase orders that they send us in PDF format, and every time they get the same explanation.
If we could make everyone understand, we wouldn't need computer programmers. We could just have computers talk to each other, and all their formats would be magically compatible, and the vast body of data conversion code wouldn't exist.
The problem is that computers are made for humans, and humans are often wantonly illogical. You're not going to change this, short of Skynet and the rise of the machines. So it makes sense to put up with a fair amount of coding pain to make things easier for your users. It's lucrative, at least.
Think of it as a full-employment theorem for data-miners.
I scan in all my documents as PDFs. The text, including tables, gets scanned in as text. I can copy from the document and paste it anywhere else. Spotlight and Google can index the PDFs.
The only thing I can't do easily is edit the documents, but I don't need to do that with scanned documents (e.g. tax statements).
Has anyone checked to see if this works with Japanese, Korean, or Chinese? What about Arabic or Hindi? This would shed some light on whether it's likely to be tesseract or ocrpus....
Incidentally, I noticed that if you try to use tesseract on an image taken from a Google Books page, you get terrible OCR accuracy. Anyone know why that is?
Trying to improve some scanned forms I have, I got an average of 5 characters per page recognized. Also form formatting recognized as "1 1 1 1 1 1 1 1 1 1 1 1 1".
I may not rely entirely on google docs for my OCR needs in future ;)