I think this is one of the few functional applications of LLMs that is really undeniably useful.
OCR has always been “untrustworthy” (as in you cannot expect it to be 100% correct and know you must account for that) and we have long used ML algorithms for the process.
It is not OCR to blame, when you have garbage in you should not expect anything of high quality, especially with handwriting and tables and different languages. Even human beings fail to understand some documents (see doctor's prescriptions)
E.g. oftentimes there is l and I (capital I), this may be an issue for OCR. The perfect case is when there is a PDF document and data embedded as XML data, but unfortunately it is not the case.
OCR has always been “untrustworthy” (as in you cannot expect it to be 100% correct and know you must account for that) and we have long used ML algorithms for the process.