I feel like it almost always starts well, given the full picture, but then for non-trivial stuff, gets stuck towards the end. The longer the conversation goes, the more wheel-spinning occurs and before you know it, you have spent an hour chasing that last-mile-connectivity.
For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.
- sure, here's how we can do "xyz" (gets some small part of the error handling for xyz slightly wrong)
- can you add onto this with "abc"
- sure. in order to do "abc" we'll need to add "lmn" to our error handling. this also means that you need "ijk" and "qrs" too, and since "lmn" doesn't support "qrs" out of the box, we'll also need a design solution to bridge the two. Let me spend 600 more tokens sketching that out.
- what if you just use the language's built in feature here in "xyz"? does't that mean we can do it with just one line of code?
- yes, you're absolutely right. I'm sorry for making this over complicated.
If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff. Even one small error early in the chain propagates. That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context. Without the ability to do that, it's nearly worthless. It's also why I think we'll be seeing absurdly, wildly wrong chains of thought coming out of o1. Because "thinking" for 20s may well cause it to just go totally off the rails half the time.
> If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff.
If you think about it, that's probably the most difficult problem conversational LLMs need to overcome -- balancing sticking to conversational history vs abandoning it.
Humans do this intuitively.
But it seems really difficult to simultaneously (a) stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL and (b) know when to legitimately bail on a previous misstatement or something that was demonstrably false.
What's SOTA in how this is being handled in current models, as conversations go deeper and situations like the one referenced above arise? (false statement, user correction, user expectation of subsequent corrected statement that still follows the rear of the conversational history)
If you talk for a while and the facts don't add up and make sense, an intelligent human will notice that, and get upset, and will revisit and dig in and propose experiments and make edits to make all the facts logically consistent. An LLM will just happily go in circles respinning the garbage.
I want to hang out with the humans you've been hanging out with. I know so many people who can't process basic logic or evidence that for my pandemic project a few years I did a year-long podcast about it, even made up a new word describe people who couldn't process evidence "Dysevidentia".
People who have been taught by various forms of news/social media that any evidence presented is fabricated to support only one side of a discussion... And that there's no such thing as impartial factually based reality, only one that someone is trying to present to them.
Some good suggestions here. I have also had success asking things like, “is this a standard/accepted approach for solving this problem?”, “is there a cleaner, simpler way to do this?”, “can you suggest a simpler approach that does not rely on X library?”, etc.
Yes, I’ve seen that too. One reason it will spin its wheels is because it “prefers” patterns in transcripts and will try to continue them. If it gets something wrong several times, it picks up on the “wrong answers” pattern.
It’s better not to keep wrong answers in the transcript. Edit the question and try again, or maybe start a new chat.
For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.