Hacker News new | past | comments | ask | show | jobs | submit login

I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:

1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.

2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.




One option we've been testing is the 'maintainFormat` mode. This tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. Especially useful if you've got tables that span pages. The flow is pretty much:

- Request #1 => page_1_image

- Request #2 => page_1_markdown + page_2_image

- Request #3 => page_2_markdown + page_3_image


>frequency of character triples

What are character triples? Are they trigrams?


I think so. I'd normalize the text first: lowercase it and remove all non-alphanumeric characters. E.g for the phrase "What now?" I'd create these trigrams: wha, hat, atn, tno, now.


> I extracted the embedded text from the PDF

What did you use to extract the embedded text during this step? Other than some other OCR tech


PyMuPDF, a PDF library for Python.


A different approach from vanilla OCR/parsing seems to be this mixed ColPali integrating a purposed small vision models and a ColBERT type indexing for retrieval. So - if search is the intended use case - it can skip the whole OCR step entirely.

[1] https://huggingface.co/blog/manu/colpali




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: