I've been hacking away at trying to process PDFs into Markdown, having encounter...

dstryr · 2025-05-13T23:44:45 1747179885

Give this project a try. I've been using it with promising results.

aorth · 2025-05-14T05:05:45 1747199145

I tried with one PDF and was surprised to see it connect to some cloud service:

  2025-05-14 07:58:49,373 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
  2025-05-14 07:58:50,446 - urllib3.connectionpool - DEBUG - https://openaipublic.blob.core.windows.net:443 "GET /encodings/o200k_base.tiktoken HTTP/1.1" 200 361 3922

The project's README doesn't mention that anywhere...

degamad · 2025-05-14T06:26:37 1747203997

The project's README mentions that it uses tiktoken[0], which is a separate project created by OpenAI.

tiktoken downloads token models the first time you use them, but it does not mention that. It does cache the models, so you shouldn't see more of those connections, if I'm understanding the code correctly.

[0] <https://github.com/openai/tiktoken>

varunneal · 2025-05-14T14:29:32 1747232972

I'll check it out!