I've been hacking away at trying to process PDFs into Markdown, having encountered similar obstacles to OP regarding header detection (and many other issues). OCR is fantastic these days but maintaining a global structure to the document is much trickier. Consistent HTML seems still out of reach for large documents. I'm having half-decent results with Markdown using multiple passes of an LLM to extract document structure and feeding it in contextually for page-by-pass extraction.
The project's README mentions that it uses tiktoken[0], which is a separate project created by OpenAI.
tiktoken downloads token models the first time you use them, but it does not mention that. It does cache the models, so you shouldn't see more of those connections, if I'm understanding the code correctly.