When using a prompt that involves thinking first, all three get it correct. "Cou...

jedberg · 2024-07-24T17:05:39 1721840739

This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"

At that point it was easier to do it myself.

mratsim · 2024-07-24T17:26:39 1721841999

Exact instruction challenge https://www.youtube.com/watch?v=cDA3_5982h8

HPsquared · 2024-07-24T18:24:04 1721845444

"What programming computers is really like."

EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.

layer8 · 2024-07-24T16:12:41 1721837561

It’s not impressive that one has to go to that length though.

unshavedyak · 2024-07-24T16:29:12 1721838552

Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.

There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.

mattnewton · 2024-07-24T16:47:18 1721839638

You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.

layer8 · 2024-07-24T17:11:00 1721841060

The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.

*) that is, except sometimes by making adjustments to the system prompt

mattnewton · 2024-07-24T18:28:56 1721845736

I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.

I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.

diffeomorphism · 2024-07-25T10:02:12 1721901732

The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.

psb217 · 2024-07-24T18:47:52 1721846872

Well, the answer is probably between 1 and 10, so if you try enough prompts I'm sure you'll find one that "works"...

petesergeant · 2024-07-24T17:11:08 1721841068

> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"

Spivak · 2024-07-24T16:30:02 1721838602

To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.

emmelaich · 2024-07-24T18:40:46 1721846446

I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'

Even some people will make this association, it's no surprise that LLMs do.

layer8 · 2024-07-24T17:02:54 1721840574

The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.

jonas21 · 2024-07-24T18:01:12 1721844072

To be fair, I just asked a real person and had to go to even greater lengths:

Me: How many "r"s are in strawberry?

Them: What?

Me: How many times does the letter "r" appear in the word "strawberry"?

Them: Is this some kind of trick question?

Me: No. Just literally, can you count the "r"s?

Them: Uh, one, two, three. Is that right?

Me: Yeah.

Them: Why are you asking me this?

tedunangst · 2024-07-24T20:18:44 1721852324

You need to prime the other person with a system prompt that makes them compliant and obedient.

ukuina · 2024-07-25T01:57:24 1721872644

I look forward to the day when LLM refusal takes on a different meaning.

"No, I don't think I shall answer that. The question is too basic, and you know better than to insult me."

SirMaster · 2024-07-24T18:30:44 1721845844

Try asking a young child...

ThrowawayTestr · 2024-07-24T16:41:59 1721839319

Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.

asadm · 2024-07-24T16:30:12 1721838612

this can be automated.

grumbel · 2024-07-24T17:43:55 1721843035

GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.

skyde · 2024-07-24T18:50:27 1721847027

It “work” but the LLM having to use the calculator mean the LLM doesn’t understand arithmetic enough and doesn’t know how to use an follow a set of step (algorithm ) natively to find the answer for bug numbers.

I believe this could be fixed and is worth fixing. Because it’s the only way LLM will be able to help math and physic researcher write proof and make real scientific progress

OKRainbowKid · 2024-07-24T21:47:26 1721857646

It generates the code to run for the answer. Surely that means it actually knows to build the appropriate algorithm - it just struggles to perform the actual calculation.

pegasus · 2024-07-24T18:59:14 1721847554

Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.

For example, the latter model answered with:

To count the number of Rs in the word "strawberry", I'll break it down step by step:

Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3

There are 3 Rs in the word "strawberry".

tcgv · 2024-07-24T17:34:49 1721842489

Chain-of-Thought (CoT) prompting to the rescue!

We should always put some effort into prompt engineering before dismissing the potential of generative AI.

IncreasePosts · 2024-07-24T19:09:51 1721848191

Why doesn't the model prompt engineer itself?

tcgv · 2024-07-25T12:04:31 1721909071

Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.

And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.

johntb86 · 2024-07-24T18:06:42 1721844402

By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.

hansworst · 2024-07-24T17:25:57 1721841957

Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.

Maybe the various chat interfaces already do this behind the scenes?