Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That looks like a pretty good starting point, thanks. I've been dabbling in vision models but need a much higher degree of accuracy than they seem able to provide, opting instead for more traditional techniques and handling errors manually.


For non-table documents a fine tuned yolov8 + tesseract with _good_ image pre-processing has basically a zero percent error rate on monolingual texts. I say basically because the training data has worse labels than what the multi-model system gives out in the cases that I double checked manually.

But no one reads the manual on tesseract and everyone ends up feeding it garbage, with predictable results.

Tables are an open research problem.

We started training a custom version of this model: https://arxiv.org/pdf/2309.14962 but there wasn't the business case since the bert search model dealt well enough with the word soup that came out of easy ocr. If you're interested drop a line. I'd love to get a model like that trained since it's very low hanging fruit that no one has done right.


The first thing I did when I saw this thread was ctrl-f for doclaynet :)

I've been at this problem since 2013, and a few years ago turned my findings into more of a consultancy than a product. See HTTPS://pdfcrun.ch

However, due to various events, I burned out recently and took a permie job, so would love to stick my head in the sand and play video games in my spare time, but secretly hoping you'd see this and to hear about your work.


There's not much to say.

Doclaynet is the easy part and with triple the usual resolution the previous gen of yolo models have solved document segmentation for every document I've looked at.

The hard part is the table segmentation. I don't have the budget to do a proper exploration of hyper parameters for the gridformer models before starting a $50,000 training run.

This is a back burner project along with speaker diarization. I have no idea why those haven't been solved since they are very low hanging fruit that would release tens of millions in productivity when deployed at scale, but regardless I can't justify buying a Nvidia DGX H200 and spending two months exploring architectures for each.


Thanks, that's interesting research, I'll look into it.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: