When you use PDF.js from Mozilla to render a PDF file in DOM, I think you might ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

kccqzy 48 days ago | parent | context | favorite | on: PDF to Text, a challenging problem

When you use PDF.js from Mozilla to render a PDF file in DOM, I think you might actually get something pretty close. For example I suppose each Tj becomes a <span> and each TJ becomes a collection of <span>s. (I'm fairly certain it doesn't use <canvas>.) And I suppose it must be very faithful to the original document to make it work.

chaps 48 days ago [–]

Indeed! I've used it to parse documents I've received through FOIA -- sometimes it's just easier to write beautifulsoup code compared to having to deal with PDF's oddities.

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact