Standard datasets can no longer be used for benchmarking against LLMs since they have already been fed into it and are thus too well-known to compare to lesser known documents.
Oh you meant for just a single benchmarked document. I thought you meant to report that for every document you process. I wouldn't want to mislead people by giving stats on a particular kind of scan/document, because it likely wouldn't carry over in general.
OCR evaluation has been a thing for decades.
edit: Better than a single document, process a standard OCR dataset: https://paperswithcode.com/task/optical-character-recognitio...