Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

PDF is a display format. It is optimised for eyeballs and printers. There has been some feature creep. It is a rubbish machine data transfer mechanism but really good for humans and say storing a page of A4 (letter for the US).

So, you start off with the premise that a .pdf stores text and you want that text. Well that's nice: grow some eyes!

Otherwise, you are going to have to get to grips with some really complicated stuff. For starters, is the text ... text or is it an image? Your eyes don't care and will just work (especially when you pop your specs back on) but your parser is probably seg faulting madly. It just gets worse.

PDF is for humans to read. Emulate a human to read a PDF.




That's it. I could not have said it better.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: