LLMs are bad databases, so for something like a bible which is so easily and precisely referenced, why not just... look it up?
This is playing against their strengths. By all means ask them for a summary, or some analysis, or textual comparison, but please, please stop treating LLMs as databases.
A year or so ago, there was a complaint from the NY Times (IIRC) that by asking about some event, they were able to get back one of their articles almost verbatim--and alleging that this was a copyright violation. This appears to be a similar outcome, where you do get back the verbatim text. That to me is a good reason to do tests like this, although feel free to do it with the WaPo or some other news outlet instead.
Not sure why you are so upset about a small and neat study ("please, please stop").
If you ask it to summarize (without feeding the entire bible), it needs to know the bible. Knowledge and reasoning are not entirely disconnected.
ChatGPT chat interface has impressed me when going beyond the scope presented on TFA, eg, when asking about predestination, biblical passages for and against, theologians' and scholars' takes on the debate, and exploring the details in subsequent follow-ups. The LLMs have been fed the Bible and all manner of discussions of Bible-related matters. Like the grandparent comment suggests, the LLMs are much more impressive at interpreting biblical passages and presenting the varieties of opinions about them, or finding passages related to specific topics and presenting opinions.
> Not sure why you are so upset about a small and neat study
This article is yet another example of someone misunderstanding what an LLM is at a fundamental level. We are all collectively doing a bad job at explaining what LLMs are, and it's causing issues.
Only recently I was talking to someone who loves ChatGPT because it "takes into account everything I discuss with it", only, it doesn't. They think that it does because it's close-ish, but it's literally not at all doing a thing that they are relying upon it to do for their work.
> If you ask it to summarize (without feeding the entire bible), it needs to know the bible.
There's a difference between "knowing" the bible and its many translations/interpretations, and being able to reproduce them word for word. I would imagine most biblical scholars can produce better discourse on the bible than ChatGPT, but that few if any could reproduce exact verbatim content. I'm not arguing that testing ChatGPT's knowledge of the bible isn't valuable, I'm arguing that LLMs are the wrong tool for the job for verbatim reproduction, and testing that (and ignoring the actual knowledge) is a bad test, in the same way that asking students to regurgitate content verbatim is much less effective as a method of testing understanding than testing their ability to use that understanding.
To add a little context here, I (the author) understand LLMs aren't the right tool (at least by themselves) for verbatim verse recall. The trigger for me doing the tests was seeing other people in my circles blindly trusting that ChatGPT was outputting verses accurately. My background allowed me to understand why that is a sketchy thing to do, but many people do not, so I wanted to see how worried we really should be.
Thanks for the response, it does sound like you've seen similar treatment of LLMs by others to what I've observed.
I think though that an important part of communicating about LLMs is talking about what they are designed to do and what they aren't. This is important because humans want to anthropomorphise, and LLMs are way past good enough for this to be easy, but similar to pets, not being human means they won't live up to expectations. While your findings show that current large models are quite good at verbatim answers (for one of the most widely reproduced texts in the world), this is likely in no large part down to luck and the current way these models are trained.
My concern is that the takeaway from your article is somewhere between "most models reproduce text verbatim" and "large models reproduce popular text verbatim", where it should probably be that LLMs are not designed to be able to reproduce text verbatim and that you should just look up the text, or at least use an LLM that cites its references correctly.
Check out this video at the 22:20 mark. The goal he’s pursuing is to have the LLM recognize when it’s attempting to make a factual statement, and to quote its training set directly instead of just going with the most likely next token.
Gemini cites web references, NotebookLM cites references in your own material, and the Gemini APIs have features around citations and grounding in web search content. I'm not familiar with OpenAI or Anthropic's APIs but I imagine they do similar, although I don't think ChatGPT cites content.
All these are doing however is fact-checking and linking out to those fact-checking sources. They aren't extracting text verbatim from a database. You could probably get close with RAG techniques, but you still can't guarantee it in the same way that if you ask an LLM to exactly repeat your question back to you, you can't guarantee that it will verbatim.
Verbatim reproduction would be possible with some form of tool use, where rather than returning, say, a bible verse, the LLM returns some structure asking the orchestrator to run a tool that inserts a bible verse from a database.
This is playing against their strengths. By all means ask them for a summary, or some analysis, or textual comparison, but please, please stop treating LLMs as databases.