This is mostly what I worked on for many years at Apple with reasonable success. The main secret was to accept that everything was geometry, and use cluster analysis to try to distinguish between word gaps and letter gaps. On many PDF documents, it works really well, but there are so many different kinds of PDF documents that there are always cases were the results are not that great. If I were to do it today, I would stick with geometry, avoid OCR completely, but use machine learning. One big advantage for machine learning is that I could use existing tools to generate PDFs from known text, so that the training phase could be completly automatic.
(Here is Bertrand Serlet announcing the feature at WWDC in 2009: https://youtu.be/FTfChHwGFf0?si=wNCfI9wZj1aj9rY7&t=308)