This looks great! You might be interested in surya - https://github.com/VikParuc...

nicklo · 2024-04-08T16:28:30 1712593710

OP: please don't poison your MIT license w/ surya's GPL license

vikp · 2024-04-11T15:49:43 1712850583

It should be possible to call a GPL library in a separate process (surya can batch process from the CLI) and avoid GPL - ocrmypdf does this with ghostscript.

barfbagginus · 2024-04-09T15:11:04 1712675464

Can I send a PR extending the benchmark against doctr and potentially textract? I believe these represent the SOTA for open and proprietary OCR.

The benefit is to let people evaluate surya against the open source and commercial SOTA, improving the integrity and applicability of the benchmark in a business or research setting.

There's a risk: it could make surya's benchmark look less attractive. Also, picking textract to represent the proprietary SOTA might be dicey, since it has competitors (Google cloud ocr, Azure ocr)

Still, ranking surya with doctr, textract, and tesseract would be really nice baseline. As a research user, business user or open source contributor, those are the results I need to quickly understand surya's potential.

vikp · 2024-04-11T15:49:02 1712850542

I've benchmarked against google cloud ocr, but the results are on Twitter, not the repo yet - https://twitter.com/VikParuchuri/status/1765440195124691339 . The reason I didn't benchmark against doctr is language support.