Hacker News new | past | comments | ask | show | jobs | submit login

When using a prompt that involves thinking first, all three get it correct.

"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."

Llama 405b: correct

Mistral Large 2: correct

Claude 3.5 Sonnet: correct




This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"

At that point it was easier to do it myself.



"What programming computers is really like."

EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.


It’s not impressive that one has to go to that length though.


Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.

There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.


You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.


The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.

*) that is, except sometimes by making adjustments to the system prompt


I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.

I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.


The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.


Well, the answer is probably between 1 and 10, so if you try enough prompts I'm sure you'll find one that "works"...


> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"


To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.


I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'

Even some people will make this association, it's no surprise that LLMs do.


The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.


To be fair, I just asked a real person and had to go to even greater lengths:

Me: How many "r"s are in strawberry?

Them: What?

Me: How many times does the letter "r" appear in the word "strawberry"?

Them: Is this some kind of trick question?

Me: No. Just literally, can you count the "r"s?

Them: Uh, one, two, three. Is that right?

Me: Yeah.

Them: Why are you asking me this?


You need to prime the other person with a system prompt that makes them compliant and obedient.


I look forward to the day when LLM refusal takes on a different meaning.

"No, I don't think I shall answer that. The question is too basic, and you know better than to insult me."


Try asking a young child...


Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.


this can be automated.


GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.


It “work” but the LLM having to use the calculator mean the LLM doesn’t understand arithmetic enough and doesn’t know how to use an follow a set of step (algorithm ) natively to find the answer for bug numbers.

I believe this could be fixed and is worth fixing. Because it’s the only way LLM will be able to help math and physic researcher write proof and make real scientific progress


It generates the code to run for the answer. Surely that means it actually knows to build the appropriate algorithm - it just struggles to perform the actual calculation.


Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.

For example, the latter model answered with:

To count the number of Rs in the word "strawberry", I'll break it down step by step:

Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3

There are 3 Rs in the word "strawberry".


Chain-of-Thought (CoT) prompting to the rescue!

We should always put some effort into prompt engineering before dismissing the potential of generative AI.


Why doesn't the model prompt engineer itself?


Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.

And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.


By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.


Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.

Maybe the various chat interfaces already do this behind the scenes?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: