I used this approach extensively over the past couple of months with GPT-4 and G...

themanmaran · 2024-07-23T22:43:19 1721774599

One option we've been testing is the 'maintainFormat` mode. This tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. Especially useful if you've got tables that span pages. The flow is pretty much:

- Request #1 => page_1_image

- Request #2 => page_1_markdown + page_2_image

- Request #3 => page_2_markdown + page_3_image

sidmitra · 2024-07-23T20:12:10 1721765530

>frequency of character triples

What are character triples? Are they trigrams?

hugodutka · 2024-07-23T20:21:07 1721766067

I think so. I'd normalize the text first: lowercase it and remove all non-alphanumeric characters. E.g for the phrase "What now?" I'd create these trigrams: wha, hat, atn, tno, now.

nbbaier · 2024-07-24T22:41:48 1721860908

> I extracted the embedded text from the PDF

What did you use to extract the embedded text during this step? Other than some other OCR tech

hugodutka · 2024-07-26T18:34:16 1722018856

PyMuPDF, a PDF library for Python.

jimmySixDOF · 2024-07-28T14:46:32 1722177992

A different approach from vanilla OCR/parsing seems to be this mixed ColPali integrating a purposed small vision models and a ColBERT type indexing for retrieval. So - if search is the intended use case - it can skip the whole OCR step entirely.

[1] https://huggingface.co/blog/manu/colpali