I'm trying to OCR a very large book: 45 volumes of ~500 pages each. The digitiza...

abdullahkhalids · 2024-10-28T17:42:53 1730137373

Are the footnotes in a different font or fontsize? If so, then the bounding box for footnote words should be smaller. Perhaps that can help with categorization.

bambax · 2024-10-29T09:15:08 1730193308

Yes thank you, that's one piece of information that I examined. The font size of the footnotes is a little bit smaller but the difference is very narrow. On a given word it's not really obvious enough. But maybe by calculating the average height of a whole line it would be significant.

abdullahkhalids · 2024-10-29T17:08:42 1730221722

The width of words would have larger difference than the height of characters, so use the width. I would

1. Manually categorize a few thousand words of normal and footnote text. Then solve a linear system to figure out the width of each letter in normal and footnote. Now, you are able to compute the expected width of any word in normal or footnote font.

2. Now, when you get a fresh page, go down line by line. For every word in the line, compare the actual word width with the expected normal width and expected footnote width. Whichever is closer categorize the word as that. Then for the whole line, take the majority vote on whether to categorize it as normal or footnote line. Once you hit a footnote line, you are done.

bambax · 2024-10-30T05:29:49 1730266189

You're right that all I have to do is find the first line of the footnotes, because everything above is the text and everything under it is footnotes.

For now I have selected a crude approach: there is a gap in the page between the text and the notes, of about one line height. So if one simply takes all the first words of each line and compares their vertical distance, when that distance grows significantly, it's where the footnotes start.

I have tested this method on a dozen of pages and it works, but it remains to be seen if it will stand the test of many pages, esp. those that are askew.

Using the average width of letters instead of their height is a neat idea though; visually it's undeniable that there is a greater difference of width than of height between the footnotes and the main text. I may resort to that if the crude approach proves too simple!

abdullahkhalids · 2024-10-30T05:39:06 1730266746

Good luck

infoseek12 · 2024-10-28T16:59:43 1730134783

Can you talk about what book you’re trying to digitize?

bambax · 2024-10-29T12:28:21 1730204901

It's an early 20th century edition of 18th century memoirs, in French. The project is not secret by any means but I'd rather not name it directly so as to not generate expectations that I may not satisfy.