This is a wonderful idea, but while I appreciate the venerable Tesseract I also ...

This is a wonderful idea, but while I appreciate the venerable Tesseract I also think it's time to move on.

I personally use PaddlePaddle and have way better results to correct with LLMs.

With PPOCRv3 I wrote a custom Python implementation to cut books at word-level by playing with whitespace thresholds. It works great for the kind of typesetting found generally on books, with predictable whitespace threshold between words. This is all needed because PPOCRv3 is restricted to 320 x 240 pixels if I recall correctly and produces garbage if you downsample a big image and make a pass.

Later on I converted the Python code for working with the Rockchip RK3399Pro NPU, that is, to C. It works wonderfully. I used PaddleOCR2Pytorch to convert the models to rknn-api first and wrote the C implementation that cuts words on top of the rknn-api.

But with PPOCRv4 I think this isn't even needed, it's a newer architecture and I don't think it is bounded by pixel size restriction. That is, it will work "out of the box" so to speak. With the caveat that PPOCRv3 detection always worked better for me, PPOCRv4 detection model gave me big headaches.