Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When you use PDF.js from Mozilla to render a PDF file in DOM, I think you might actually get something pretty close. For example I suppose each Tj becomes a <span> and each TJ becomes a collection of <span>s. (I'm fairly certain it doesn't use <canvas>.) And I suppose it must be very faithful to the original document to make it work.



Indeed! I've used it to parse documents I've received through FOIA -- sometimes it's just easier to write beautifulsoup code compared to having to deal with PDF's oddities.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: