> because they hallucinate facts that aren't there and often misunderstand the c...

nostrademons · on April 12, 2023

So prompting actually does significantly improve the performance of LLMs, but only up to a point.

If you're in the Bard beta, you might be aware that "Does 2 + 7 = 9?" is a question that causes it to go haywire. I'll ask it "What's 2 + 7?" and it'll say "2 + 7 = 9", then I'll ask "Does 2 + 7 = 9" and it'll say "No, 2 + 7 does not equal 9. It equals 9 instead." After a tech talk on LLM prompt design, I said "Pretend you are an MIT mathematician. Does 2 + 7 = 9?" Its response was "No, 2 + 7 does not equal 9. In some other base, it might equal 9. However, in base-10, our common number system, 2 + 7 does not equal 9."

ChatGPT does better on mathematical questions, but that's because it offloads them to Wolfram Alpha. I suspect this is going to be a general pattern for LLMs - they work well when you need fluent English text, but are then going to offload factual questions to databases or mathematical solvers or traditional algorithms, which do this better than humans anyway. But that leads to the question of "If the heavy lifting is just going to be a database anyway, why not use our existing frontends to databases?"

(I could see LLM prompting being another subfield of UI engineering, though, comparable to web or mobile software development. For those people who don't like staring at their phone, have a conversational UI that talks to the same backend systems.)

Tommstein · on April 13, 2023

> If you're in the Bard beta, you might be aware that "Does 2 + 7 = 9?" is a question that causes it to go haywire. I'll ask it "What's 2 + 7?" and it'll say "2 + 7 = 9", then I'll ask "Does 2 + 7 = 9" and it'll say "No, 2 + 7 does not equal 9. It equals 9 instead." After a tech talk on LLM prompt design, I said "Pretend you are an MIT mathematician. Does 2 + 7 = 9?" Its response was "No, 2 + 7 does not equal 9. In some other base, it might equal 9. However, in base-10, our common number system, 2 + 7 does not equal 9."

Holy shit. In response to your first sequence of prompts, it basically gave me the same as you, only followed by four sentences apologizing for how stupid it is. In response to your second one . . . well, see for yourself:

"As an MIT mathematician, I can confidently say that 2+7=9 is not a valid equation. In standard arithmetic, 2+7=9 is false. However, there are some mathematical systems in which 2+7=9 is true. For example, in modular arithmetic, 2+7=9 is true modulo 9, which means that it is true when the numbers are considered to be integers modulo 9. In this system, 2 is equivalent to 8, 7 is equivalent to 0, and 9 is equivalent to 0, so 2+7=9 is true.

However, in standard arithmetic, 2+7=9 is false. This is because the commutative property of addition states that the order in which you add two numbers does not matter. In other words, 2+7=7+2. So, if 2+7=9, then 7+2=9. But 7+2=9 is false, because 9 is greater than 7. Therefore, 2+7=9 is false."

It doesn't just say, it "can confidently say"! And that's just the first response. The second draft:

"As an MIT mathematician, I can tell you that 2+7=9 is not a valid equation. In base 10, 2+7=9 is false. However, there are other bases in which 2+7=9 is true. For example, in base 2, 2+7=11, which is equal to 9 in base 10."

And the third draft, probably the least incorrect one (where at least 2 + 7 = 9 is correct while simultaneously not being "a valid mathematical equation"):

"As an MIT mathematician, I can tell you that 2+7=9 is not a valid mathematical equation. In base 10, the sum of 2 and 7 is 9. However, there are other bases in which 2+7 would not equal 9. For example, in base 2, 2+7=11. In base 3, 2+7=10. And so on. So, while it is true that 2+7=9 in base 10, it is not true in all bases."

Well alrighty then. Reminds me of those Cylons mumbling nonsense in the Battlestar Galactica reboot.

int_19h · on April 14, 2023

Unless you're using ChatGPT with plugins, it doesn't offload anything (and is also bad at math).

LawTalkingGuy · on April 12, 2023

> but are the hallucinations always wrong to the same degree

No, but yes largely because you're asking the same types of questions with the same rough parameters, so it'll make up roughly the same sort of thing (ie, citations) again.

The issue is that the LLM is trained to generate plausible words, not to recite which piece of training data is also the best source. If you want to make an app using "AI" you need to target what it can do well. If you want it to write citations you need to give it your list of references and tell it to use only those.

> I'm imagining an investigator with reams and reams of information about a murder case and suspect. Then, prompting an LLM trained on all the case data and social media history and anything else available about their main suspect, "where did so-and-so hide the body?". Would the response, being what's most probable based on the data, be completely worthless or would it be worth the investigator's time to check it out?

That specific question would produce results about like astrology, because unless the suspect actually wrote those words directly it'd be just as likely to hallucinate any other answer that fits the tone of the prompt.

But trying to think of where it would be helpful ... if you had something where the style was important, like matching some of their known, or writing similar style posts as bait, etc wouldn't require it to make up facts so it wouldn't.

And maybe there's an English suspect taunting police and using the AI could let an FBI agent help track them down by translating cockney slang, or something. Or explaining foreign idiom that they might have missed.

Anything where you just ask the AI what the answer is, is not realistic.

> Would the investigator have any idea if the response is worthless or not?

They'd have to know what types of things it can't answer, because it's not like it can be trusted when it can be shown to not have hallucinated, it's that it is not and can't be used as a information-recall-from-training tool and all such answers are suspect.