Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm trying to OCR a very large book: 45 volumes of ~500 pages each. The digitization has been done (not very good but not too bad either), but the pages have comments in the margin and lots of footnotes.

Just doing plain OCR doesn't really work because the notes in the margin and the footnotes get mingled with the text, which results in gibberish.

But, when sent to Google Vision API, each page results in a json file that has an object for each word and the four coordinates of its bounding box.

That json file is pretty big (around 1.5 Mo when pretty printed, or 500 k with no indents or line breaks) but it can then be fed to Gemini, taking advantage of its large context window.

Gemini is pretty good at identifying each section of the page (headers, main text, margin comments, footnotes) but it takes a looong time to respond (2-5 minutes per page).

So another approach is to ask Gemini to write a python script to analyze the json result and group sections depending of the coordinates of each word, and then run that script against the json output by the OCR phase.

But it's quite difficult to have a script that works for any page; comments in the margin are always in the margin so that's pretty easy, but footnotes can start at any height of the page (some pages contain only footnotes running from previous pages) and Gemini likes to be pretty specific, giving hard 'y' coordinates for where footnotes should start, which obviously only works for the one page it's working on.

I'm iterating and making some progress but I feel like I miss a big breakthrough and it all should be simpler than it currently is. Information about OCR is pretty scarce online. Any pointer is welcome!




Are the footnotes in a different font or fontsize? If so, then the bounding box for footnote words should be smaller. Perhaps that can help with categorization.


Yes thank you, that's one piece of information that I examined. The font size of the footnotes is a little bit smaller but the difference is very narrow. On a given word it's not really obvious enough. But maybe by calculating the average height of a whole line it would be significant.


The width of words would have larger difference than the height of characters, so use the width. I would

1. Manually categorize a few thousand words of normal and footnote text. Then solve a linear system to figure out the width of each letter in normal and footnote. Now, you are able to compute the expected width of any word in normal or footnote font.

2. Now, when you get a fresh page, go down line by line. For every word in the line, compare the actual word width with the expected normal width and expected footnote width. Whichever is closer categorize the word as that. Then for the whole line, take the majority vote on whether to categorize it as normal or footnote line. Once you hit a footnote line, you are done.


You're right that all I have to do is find the first line of the footnotes, because everything above is the text and everything under it is footnotes.

For now I have selected a crude approach: there is a gap in the page between the text and the notes, of about one line height. So if one simply takes all the first words of each line and compares their vertical distance, when that distance grows significantly, it's where the footnotes start.

I have tested this method on a dozen of pages and it works, but it remains to be seen if it will stand the test of many pages, esp. those that are askew.

Using the average width of letters instead of their height is a neat idea though; visually it's undeniable that there is a greater difference of width than of height between the footnotes and the main text. I may resort to that if the crude approach proves too simple!


Good luck


Can you talk about what book you’re trying to digitize?


It's an early 20th century edition of 18th century memoirs, in French. The project is not secret by any means but I'd rather not name it directly so as to not generate expectations that I may not satisfy.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: