This looks great! You might be interested in surya - https://github.com/VikParuchuri/surya (I'm the author). It does OCR (much more accurate than tesseract), layout analysis, and text detection.
The OCR is slow on CPU (working on it), but faster than tesseract (CPU-only) on GPU.
You could probably replace pymupdf, tesseract, and some layout heuristics with this.
Happy to discuss more, feel free to email me (in profile).
It should be possible to call a GPL library in a separate process (surya can batch process from the CLI) and avoid GPL - ocrmypdf does this with ghostscript.
Can I send a PR extending the benchmark against doctr and potentially textract? I believe these represent the SOTA for open and proprietary OCR.
The benefit is to let people evaluate surya against the open source and commercial SOTA, improving the integrity and applicability of the benchmark in a business or research setting.
There's a risk: it could make surya's benchmark look less attractive. Also, picking textract to represent the proprietary SOTA might be dicey, since it has competitors (Google cloud ocr, Azure ocr)
Still, ranking surya with doctr, textract, and tesseract would be really nice baseline. As a research user, business user or open source contributor, those are the results I need to quickly understand surya's potential.
The OCR is slow on CPU (working on it), but faster than tesseract (CPU-only) on GPU.
You could probably replace pymupdf, tesseract, and some layout heuristics with this.
Happy to discuss more, feel free to email me (in profile).