PDF is a display format. It is optimised for eyeballs and printers. There has been some feature creep. It is a rubbish machine data transfer mechanism but really good for humans and say storing a page of A4 (letter for the US).
So, you start off with the premise that a .pdf stores text and you want that text. Well that's nice: grow some eyes!
Otherwise, you are going to have to get to grips with some really complicated stuff. For starters, is the text ... text or is it an image? Your eyes don't care and will just work (especially when you pop your specs back on) but your parser is probably seg faulting madly. It just gets worse.
PDF is for humans to read. Emulate a human to read a PDF.
So, you start off with the premise that a .pdf stores text and you want that text. Well that's nice: grow some eyes!
Otherwise, you are going to have to get to grips with some really complicated stuff. For starters, is the text ... text or is it an image? Your eyes don't care and will just work (especially when you pop your specs back on) but your parser is probably seg faulting madly. It just gets worse.
PDF is for humans to read. Emulate a human to read a PDF.