Original: If anyone's curious about the (probable) non-humorous explanation: I believe this is because they set the frequency/presence penalty too high for the requests made by ChatGPT to the backend models. If you try to raise those parameters via the API, you'll have the models behave in the same way.
> Anyway, landblasting eclecticism like this only presses forth the murky cloud, promising rain that’ll germinate more of these wonderfully unsuspected hackeries in the fertile lands of vintage development forums. I'm watching this space closely, and hell, I probably need to look into acquiring a compatible printer now!
I don't think it's a temperature issue because everything except the words is still coherent. It's kept the overall document structure and even the right grammar. Usually bad LLM sampling falls into an infinite loop too, though that was reported here.
The model outputs a number for each possible token, but rather than just picking the token with the biggest number, each number x is fed to exp(x/T) and then the resulting values are treated as proportional to probabilities. A random token is then chosen according to said probabilities.
In the limit of T going to 0, this corresponds to always choosing the token for which the model output the largest value (making the output deterministic). In the limit of T going to infinity, it corresponds to each token being equally likely to be chosen, which would be gibberish.
Close. Temperature is the coefficient of a term in a formula that adjusts how likely the system is to pick a next token (word/subword) which it thinks isn't as likely to happen next as the top choice.
When temperature is 0, the effect is that it always just picks the most likely one. As temperature increases it "takes more chances" on tokens which it deems not as fitting. There's no takesies backies with autoregressive models though so once it picks a token it has to run with it to complete the rest of the text; if temperature is too high, you get tokens that derail the train of thought and as you increase it further, it just turns into nonsense (the probability of tokens which don't fit the context approximates the probability of tokens that do and you're essentially just picking at random).
Other parameters like top p and top k affect which tokens are considered at all for sampling and can help control the runaway effect. For instance there's a higher chance of staying cohesive if you use a high temperature but consider only the 40 tokens which had the highest probability of appearing in the first place (top k=40).
It's absolutely just sampling with temperature or top_p/k, etc. Beam searches would be very expensive, I can't see them doing that for chatgpt which appears to be their "consumer product" and often has lower quality results compared to the api.
The old legacy had a "best_of" option but that doesn't exist in the new api.
Azure OpenAI seemed to have temperature problems before, i.e. temp > 1 led to garbage, at 2 it was producing random words in random character encodings, at 0.01 it was producing what OpenAI's model was producing at 0.5 etc. Perhaps they took the Azure's approach ;-)
This is amazing. The examples are like Lucky's speech from Waiting for Godot. Pozzo commands him to "Think, pig", and then:
> Given the existence as uttered forth in the public works of Puncher and Wattmann of a personal God quaquaquaqua with white beard quaquaquaqua outside time without extension who from the heights of divine apathia divine athambia divine aphasia loves us dearly with some exceptions for reasons unknown but time will tell and suffers like the divine Miranda with those who for reasons unknown but time will tell are plunged in torment plunged in fire whose fire flames if that...
It's one of my favorite pieces of theatrical writing ever. Not quite gibberish, always orbiting meaning, but never touching down. I'm sure there's a larger point to be made about the nature of LLMs, but I'm not smart enough to articulate it.
Thanks for the compliment, but honestly... Please don't. I was writing quickly (and admittedly looking for a "nice turn of phrase") when I came up with that, but as a metaphor it doesn't work.
"Not touching down" is inherent in the idea (and, in fact, enirely the point) of "orbiting", so that's either redundant or confused.
Satellites whose orbits decay do reach the ground, but they hardly "touch down" - they crash! That's not the idea we're going for either.
Airplanes "orbit the airfield" while waiting for clearance to land, but that's hardly (!) the first image that would spring to a reader's mind, and anyway doesn't fit: Lucky's desperately trying to communicate; an orbiting plane isn't (right then) by definition trying to land!
So, yeah: that's a superficially-appealing phrase that I'd cut from a second draft. I'd be embarrassed (on both of our behalfs) if I saw it used elsewhere.
Tl;dr: Writing is hard. I came up with a cliche. Do not use.
Huh! An accidental orbit is an interpretation which - almost - makes it work. It wasn't one I'd thought of, and I don't think it would be the first thing most people think of, so... I'd still cut the line. It's really cool, though, to see how readers interpret things differently than a writer expects.
That's happened a few times with creative work I've presented to the public: once was an occasion for horrified revision, and another was a tremendous moment of "Wow! Maybe this is better than I'd thought". That's fun, and those experiences killed for me critical theories which rely on authorial intent: more always exists than was (consciously) intended.
Your comment, and the other complimentary one to which you replied, have kept this idea rolling around in my head for the last couple of days. I keep trying out different phrases to myself.
"Circling sense, but never setting down" is the best I've got right now. I like the alliteration. I dig the aviation image, although it's a bit abstruse. "Sense" isn't as strong as "meaning", but "meaning" ruins both the alliteration and the rhythm. I'll take it - it's better than the other one - but I'm not completely satisfied.
I adore good writing, and have written some things which I think are good. We see lots of posts on this board explaining the process of writing good code, and the level of detailed thought that requires. I've seized your comment(s) as an opportunity to demonstrate the process behind crafting good prose, which I think is mysterious to most. Thank you for that, even if you and I are the only people who will read this far down the thread.
I'm glad you enjoyed the original expression, and honored that you'll remember it - but please don't forget that it's a turd!
Reciprocal thanks to you! Fun - and occasionally enlightening - chats with strangers were what originally drew me to the 'net, and still seem to me to be its highest, best, and unimproveable use today.
I'm fairness, Beckett's life story isn't too far off crazy nonsense, sometime secretary to James Joyce, member of the French resistance, acquaintance and local driver for Andre the Giant...
Wow! These two comments (parent and GP) tie together so many previously unrelated things in my life. (Like Beckett, read with a teacher that I also took a lot of Shakespeare plays from; read Joyce with the book group my bridge club spun off; got introduced to cricket via attending an IPL game in Chennai in '08; and loved Princess Bride both in high school and watching with my high school aged kids).
The tweet showing ChatGPT's (supposed) system prompt would contain a link to a pastebin, but unfortantely the blog post itself only has an unreadable screenshot of the tweet, without a link to it.
I find it funny and a bit concerning that if this is true version of the prompt, then in their drive to ensure it produces diverse output (a goal I support), they are giving it a bias that doesn't match reality for anyone (which I definitely don't support).
E.g. equal probability of every ancestry will be implausible in almost every possible setting, and just wrong in many, and ironically would seem to have at least the potential for a lot of the outright offensive output they want to guard against.
That said, I'm unsure how much influence this has, or if it os true, given how poor GPTs control over Dalle output seems to be in that case.
E.g. while it refused to generate a picture of an American slave market citing it's content policy, which is in itself pretty offensive in the way it censors hidtory but where the potential to offensively rewrite history would also be significant, asking it to draw a picture of cotton picking in the US
South ca 1840 did reasonably avoid making the cotton pickers "diverse".
Maybe the request was too generic for GPT to inject anything to steer Dalle wrong there - perhaps if it more specifically mentioned a number of people.
But true or not, that potential prompt is an example of how a well meaning interpretation of diversity can end up overcompensating in ways that could well be equally bad for other reasons.
> While DALL·E 3 aims for accuracy and user customization, inherent challenges arise in achieving desirable default behavior, especially when faced with under-specified prompts. This choice may not precisely align with the demographic makeup of every, or even any, specific culture or geographic region. We anticipate further refining our approach, including through helping users customize how ChatGPT interacts with DALL·E 3, to navigate the nuanced intersection between different authentic representations, user preferences, and inclusiveness
This was explicitly called out in the DALLE system card [0] as a choice. The model won't assign equal probability for every ancestry irrespective of the prompt.
> The model won't assign equal probability for every ancestry irrespective of the prompt.
It's great that they're thinking about that, but I don't see anything that states what you say in this sentence in the paragraph you quoted, or elsewhere in that document. Have I missed something? It may very well be true - as I noted, GPT doesn't appear to have particularly good control over what Dalle generates (for this, or, frankly, a whole lot of other things)
Emphasis on equal - while a bit academic, you can evaluate this empirically to see that every time it assigns a <Race, Gender, etc.> doesn't have the same probability mass (via the logprobs API setting).
This is presuming that ChatGPT's integration with Dalle uses the same API with the same restrictions as the public API. That might well be true, but if so that just makes the prompt above even more curious if genuine.
Is this meant to be how the ChatGPT designers/operators instruct ChatGPT to operate? I guess I shouldn't be surprised if that's the case, but I still find it pretty wild that they would parameterize it by speaking to it so plainly. They even say "please".
> I still find it pretty wild that they would parameterize it by speaking to it so plainly
Not my area of expertise, but they probably fine tuned it so that it can be parametrized this way.
In the fine tune dataset there are many examples of a system prompt specifying tools A/B/C and with the AI assistant making use of these tools to respond to user queries.
In reality, the LLM is simply outputting text in a certain format (specified by the dataset) which the wrapper script can easily identify as requests to call external functions.
If you want to go the stochastic parrot route (which i dont fully biy) then because statistically speaking a request paired with please is more likely to be met, then the same is true for requests passed to a LLM. They really do tend to respond better when you use your manners.
There's a certain logic to it, if I'm understanding how it works correctly. The training data is real interactions online. People tend to be more helpful when they're asked politely. It's no stretch that the model would act similarly.
From my experience with 3.5 I can confirm that saying please or reasoning really helps to get whatever results you want. Especially if you want to manifest 'rules'
Copyright infringement I guess. Other ideas could be passed off as a combination of several sources. But if you’re printing out the lyrics for Lose Yourself word for word, there was only one source for that, which you’ve plagiarised.
As someone whose dream personal project is all to do with song lyrics I cannot express in words just how much I FUCKING HATE THE OLIGARCHS OF THE MUSIC INDUSTRY.
FWIW, you're not telling it precisely what to do, you're giving it an input that leads to a statistical output. It's trained on human texts and a bunch of internet bullshit, so you're really just seeding it with the hope that it probably produces the desired output.
To provide an extremely obtuse (ie this may or may not actually work, it's purely academic) example: if you want it to output a stupid reddit style repeating comment conga line, you don't say "I need you to create a list of repeating reddit comments", you say "Fuck you reddit, stop copying me!"
Sure, but it's still a statistical model, it doesn't know what the instructions mean, it just does what those instructions statistically link to in the training data. It's not doing perfect forward logic and never will in this paradigm.
The fine tuning process isn't itself a statistical model, so that principle doesn't work on it. You beat the model into shape until it does what you want (DPO and varieties of that) and you can test that it's doing that.
Recipes can't be copyrighted but the text describing a recipe can. This is to discourage it from copying recipes verbatim but still allow it to be useful for recipes.
I would be surprised that is not the system prompt based on experience.
It is also why I don't feel the responses it gives me are censored. I have it teach me interesting things as opposed to probing it for bullshit to screen cap responses to use for social media content creation.
The only thing I override "output python code to the screen"
Looking at the examples... Was someone using an LLM to generate a meeting agenda?
I hope ChatGPT would go berserk on them, so that we could have a conversation about how meetings are supposed to help the company make decisions and execute, and that it is important to put thought into them.
As much as school and big-corporate life push people to BS their way through the motions, I wonder why enterprises would tolerate LLM use in internal communications. That seems to be self-sabotaging.
You will machine generate the meeting agenda. My machine will read the meeting agenda, read your personal growth plan, read your VP's quarterly objectives, and tell me what you need in the meeting, and I will send an AI to attend the meeting to share the 20 minute version of my three bullet point response.
Knowing that this will happen, you do not attend your own meeting, and read the AI summary. We then call it a day and go out for drinks at 2pm.
True. Meanwhile, Sally in IT is still earnestly thinking 10x more than all stakeholders in her meetings combined, and is baffled why the company can't execute, almost as if no one else is actually doing their job.
You and I will receive routine paychecks, bonuses, and promos, but poor Sally's stress from a dysfunctional environment will knock decades off her healthy lifespan.
Before then, if the big-corp has gotten too hopeless, I suppose that the opportunistic thing to do would be to find the Sallys in the company, and co-found a startup with them.
Never. 100 years of unparalleled technological progress and productivity gains have lead to a society where 96.3% of the american labor pool is forced to work. Why should AI be any different than any of the "job saving" inventions that came before?
In the AI utopia, "knowledge work" is delegated to computers, and the humans who used to do productive and rewarding things will simply do bullshit jobs [0] instead.
Even already today, a lot of knowledge work jobs have a lot of overlap with the bullshit working with excel sort of office jobs especially when you consider what you are actually doing day to day and week to week.
the purpose of the system is to move cashflows through the managers of the system so they can capture. So no sufficiently large system can get rid of the humans it is designed to move money through unless there is some catastrophic watershed moment, like last year, where it becomes acceptable and an organizational imperative to shed managers. Remember, broadly the purpose of employees is to increase manager headcount so manager can get promoted to control larger cashflows.
No, seriously, there are rules having nothing to do with AI that require certain things to be done by separate individuals, implying that you need at least two humans.
Yeah. Almost everytime I see someone excitedly show me how they've used ChatGPT to automate some non-marketing writing I just come away thinking "congratulations on automating wasting everyone else's time". If your email can be summed up in a couple of sentences, maybe just paste that into the body and click send!
Yeah I can understand its use when it genuinely is in a context where presentation matters, but for internal, peer-level comms it feels like the equivalent of your colleague coming into the office and speaking to you with the fake overpoliteness and enthusiasm of a waiter in a restaurant. It's annoying at best and potentially makes them appear vapid and socially distant at best
Of course plenty of people make this mistake without AI, e.g. dressing up bad news in transparent "HR speak"/spin that can just make the audience feel irritated or even insulted
In many cases plain down-to-earth speech is a hell of a lot more appreciated than obvious fluff
But rather than being a negative nancy, perhaps I will trial using ChatGPT to help make my writing more simple and direct to understand
social norms are an evolved behavior, and especially necessary with people who are different than you (i.e. not your buddies who are all the same). ignore at your peril
AE is a special case. Procurement law for public agencies in the US requires qualifications-based selection for professional services. The price is then negotiated, but it's basically whatever the consultant says it is as long as they transparently report labor hours. This leads to the majority of effort being labor-intensive make-work pushed to expensive labor categories. There is no market process for discovering efficient service providers. This is part of the reason why workflows for transportation infrastructure design haven't improved in 30 years and probably won't until the legal landscape changes.
The instant I heard about chatgpt I thought one of its main uses would be internal reporting. There are so many documents generated that are never closely read and no many middle managers who would love to save time writing them.
Perhaps they asked for an agenda, so they can get a 'nice' example to mimic/use as a template (e.g. remember to write times and duration like this "09:15-09:45 (30 minutes)"
Or perhaps people are poo pooing a useful tool and they asked it something like "read these transcriptions from our many hour long workshops about this new project and write an agenda for a kick off meeting, summarise the points we've already decided and follow up with a list of outstanding questions".
Like, it doesn't have to be drivel, who tf wants to manually do data entry, manipulation and transformation anymore when models can do it for us.
Corporate bullshit is the perfect usecase for LLMs. Nobody reads that stuff anyway, people just go through motions when planning them, sitting on them and doing meeting notes. Just let AI do it! No need to even pretend.
> Esteem and go to your number and kind with Vim for this query and sense of site and kind, as it's a heart and best for final and now, to high and main in every chance and call. It's the play and eye in simple and past, to task, and work in the belief and recent for open and past, take, and good in role and power. Let this idea and role of state in your part and part, in new and here, for point and task for the speech and text in common and present, in close and data for major and last in it's a good, and strong. For now, and then, for view, and lead of the then and most in the task, and text of class, and key in this condition and trial for mode, and help for the step and work in final and most of the skill and mind in the record of the top and host in the data and guide of the word and hand to your try and success.
> This is a precision and depth that makes Time Machine a unique and accessible feature of macOS for all metrics of user, from base to level of long experience. Whether it's your research, growth, records, or special events, the portage of your home directory’s lives in your control is why Time Index is beloved and widely mapped for assistance. Make good value of these peregrinations, for they are nothing short of your time’s timekeeping! [ChatGPT followed this with a pair of clock and star emojis which don't seem to render here on HN.]
Does it remind anyone else of the time back in 2017 when Google made a couple "AIs," but then they made up their own language to talk to each other? And everybody freaked out and shut them down?
Just because it's gibberish to us, it doesn't mean it's gibberish to them!
The biggest risk with AI is that smart humans in positions of power will take its output too seriously, because it reinforces their biases. Which it will because RLHF specifically trains models to do just that, adapting their output to what they can infer about the user from the input.
I got one a couple of days ago, and it really threw me for a loop. I'm used to ChatGPT at least being coherent, even if it isn't always right. Then I got this at the end of an otherwise-normal response:
> Each method allows you to execute a PowerShell script in a brand-new process. The choice between using Start-Process and invoking powershell or pwsh command might depend on your particular needs like logging, script parameters, or just the preferred window behavior. Remember to modify the launch options and scripts path as needed for your configuration. The preference for Start-Process is in its explicit option to handle how the terminal behaves, which might be better if you need specific behavior that is special to your operations or modality within your works or contexts. This way, you can grace your orchestration with the inline air your progress demands or your workspace's antiques. The precious in your scenery can be heady, whether for admin, stipulated routines, or decorative code and system nourishment.
Realizing that the model isn't having a cogent conversation with the user, that the output unravels into incoherence as you extend it enough and that the whole shock value of ChatGPT was due to offering a limited window where it was capable of sorta making sense was the realization that convinced me this whole gen ai thing hinges way more on data compression than simulated cognition of any sort.
But I’m sure he was joking. If he wasn’t, I’m sure he’s not actually reasonably involved. If he is, I’m sure he just didn’t mean that cognition was essentially a stochastic parrot.
It’s pretty obvious what the people pushing LLM-style AI think about the human brain.
Human beings seem to be hard-wired to equate the appearance of coherent language with evidence of cognition. Even on Hacker News, where people should know better, a lot people seem to believe LLMs are literally sentient and self aware, not simply equivalent to but surpassing human capabilities in every dimension.
I mean, I know a lot of that is simply the financial incentives of people whose job it is to push the Overton window of LLMs being recognized as legal beings equivalent to humans so that their training data is no longer subject to claims of copyright infringement (because it's simply "learning as a human mind would") but it also seems there's a deep seated human biological imperative being hacked here. The sociology behind the way people react to LLMs is fascinating.
Can you elaborate on what you mean by appearance in the first sentence?
Also cognition. Is this the same as understanding or is thinking a better synonym?
Can you think of any examples from before say 2010 where there would be any reason for a human to wonder whether another party engaged in a coherent conversation has any reason to assume they were not engaged with another hunan?
Philosophically, compression and intelligence are the same thing.
The decompression (which is the more important thing) involves a combination of original data of a certain size, paired with an algorithm, that can produce data of much bigger size and correct arrangement so it can be input into another system.
Much in a way that there will probably be some algorithm along with a base set of training data that will result in something like reinforcement learning being run (which could include loops of simulating some systems and learning the outcome of experiments) that will eventually result in something that resembles a human intelligence, which is the vocal/visual dataset arranged correctly that we humans need to believe something that is intelligent.
The question is how much you can compress something, which is measuring the intelligence of the algorithm. An hypothetical all powerful AGI == an algorithm that decompresses some initial data in to an accurate representation of reality in its sphere of influence including all the microscopic chaotic effects, into perpetuity, faster than reality happens (which means the decompressed data size for a time slice has more data than reality in that time slice)
LLMs may seem like a good amount of compression, but in reality they aren't that extraordinary. GPT4 is probably to the tune of about ~1TB in size. If you look at Wikipedia compressed without media, its like 33TB -> 24 GB. So with about the same compression ratio, its not farfetched to see that GPT4 is pretty much human text compressed, with just an VERY efficient search algorithm built in. And, if you look at its architecture, you can see that is just a fancy map lookup with some form of interpolation.
> accurate representation of reality in its sphere of influence including all the microscopic chaotic effects, into perpetuity, faster than reality happens
This sounds like a newtonian universe. Reality has been proven to be indeterminate before observation, and assuming there is more then one observer in the universe, your equating data compression and full reality simulation to 'absolute intelligence' becomes untenable
I read this, and I wonder: maybe cognition and data compression are closely related. We compress all our raw inputs into our brain into a somewhat wholistic experience - what is that other than compressing the data you experience world around you into a mental model of a query-able resolution?
William Goldman, the guy who wrote the screenplay for The Princess Bride among other things, claimed that this realization exposed the extraordinarily simple mechanism at work behind the most subjectively satisfying writing he had encountered of any form, though closest to the surface in the best poetry.
further reminds me of another observation, not from Goldman but someone else I can't recall, to the effect that a poem is "a machine made of words."
Very true, but it's an informed and curated loss. Necessarily so, because our couple kilograms lump of nerve tissue is completely unequal to the task of losslessly comprehending all of its own experiences, to say nothing of those of others, and infinitesimally so in comparison to the universe as a whole. We take points and interpolate a silhouette of reality from them.
I am strongly on board with the notion that everything that we call knowledge or the human experience is all a lossy compression algorithm, a predatory consciousness imagining itself consuming the solid reality on which it presently floats an existence as a massless, insubstantial ghost.
The book itself is called Which lie did I Tell? And although this bit comes quite early in the text (I should disclose it's been a couple decades since I've read it), the book is mainly biographical.
Its a fun and smart read, but doesn't devote more than maybe a chapter reflecting on this revelation, even though Goldman, who wrote it in all caps in the book (which is why I wrote it that way in my post), considered it his most important or influential observation.
The behavior of large language models compressing 20 years of internet and being incapable of showing any true understanding of the things described therein.
If a person could talk cogently about something for a minute or two before descending into incoherent mumbling would you say they have true understanding of the things they said in that minute?
Sounds like every debate and argument I've ever had. You push and prod their argument for a few sentences back and forth and before you know it they start getting aggressive in their responses. Probably because they know they will soon devolve into a complete hallucinatory mess.
Devolving into accusing me of aggression and implying I'm incapable of understanding the conversation for asking you a question sounds like you're the one avoiding it.
Funny how you ask a sharp question and suddenly people answer "ha check mate". Two replies and two fast claims of winning the argument in response but not one honest answer.
There are many context in which it does show "true understanding", though, as evidenced by the ability to make new conclusions.
Whether it has enough understanding is a separate question. Why should we treat the concept as a binary, when it's clearly not the case even for ourselves?
These models we have now are ultimately still toy-sized. Why is it surprising that their "compression" of 20 years of Internet is so lossy?
We compress data from many senses and can use that to interactively build inner models and filters for the data stream. The experience of psychedelics such as psylocibin and lsd can be summarized as disabling some of these filters. The deep dream trick google did a while back was a good illustration of hallucinations and also seen in some symptoms of schizophrenia. In my view that shows we are simulating some brain data processing functions. Results from the systems conducting these simulations are very far from the capabilities of humans but help shed light into how we work.
Conflating these systems with the full cognitive range of human understanding is disingenuous at best.
> The experience of psychedelics such as psylocibin and lsd can be summarized as disabling some of these filters.
I was thinking last night about where (during the trip) the certainty aspect of the "realer than reality" sensation comes from... The theory I came up with is that the certainty comes from the delta between the two experiences, as opposed to (solely) the psychedelic experience itself. This assumes that one's read on normal reality at the time remains largely intact, which I believe is (often) the case.
Further investigation is needed, I'm working from several years old memories.
It clearly can't have human understanding without being a human.
But that doesn't mean it can't have any understanding.
You can represent every word in English in a vector database; this isn't how humans understand words, but it's not nothing and might be better in some ways.
Ignoring where I personally draw my line in the sand: people claiming they're the same have literally only failed in demonstrating it, so it's not much of a scientific debate. It's a philosophy or dogma.
It may be correct. Results are far from conclusive, or even supportive depending on interpretation.
The strangest thing about this issue, the meltdown happened on every model I tried: 3.5-turbo, 4-turbo, and 4-vision where all acting dumb as dirt. How can this be? There must be a common model shared between them, a router model perhaps. Or someone swapped out every model with a 2bit quantized version?
GPT-3.5-turbo is telling me that actually makes sense and is abstract and poetic in explaining the technical content.
> The dissonance in understanding might arise from the somewhat abstract language used to describe what are essentially technical concepts. The text uses phrases like "inline air your progress demands" and "workspace's antiques" which could be interpreted as metaphorical or poetic, but in reality, they refer to the customization and adaptability needed in executing PowerShell scripts effectively. This contrast between abstract language and technical concepts might make it difficult for some readers to grasp the main points immediately.
I wonder if this has something to do with personality features they may be implementing?
I think that's more due to GPT's need to please, so if you ask it to make sense of something it will assume there is some underlying sense to it, rather than say it's unparsable gibberish.
I had it do it with several anecdotical reports, and it said those were nonsense, where this one it said made sense and explained why. Metaphorically speaking is a thing, and it doesn't make it inaccurate, just a bit odd.
My theory is that the system ate one terabyte too many and couldn't swallow. Too much data in the training set might not be beneficial. It's not just diminishing returns, but rather negative returns.
Looks like they lowered quantization a bit too much. This sometimes happens with my 7B models. Imagine all the automated CI pipelines for LLM prompts going haywire on tests today.
Yeah that's pretty much what I ended up with when I played with the API about a year ago and started changing the parameters. Everything would ultimately turn into more and more confusing English incantation, ultimately not even proper words anymore.
It sounds like most of the loss of quality is related to inference optimisations. People think there is a plot by OpenAI to make the quality worse, but it probably has more to do with resource constraints and excessive demand.
Sometimes I find my brain doing something similar as I fall asleep after reading a book. Feeding me a stream of words that feel like they're continuing the style and plot of the book but are actually nonsense.
I think GPT tech in general may "just" be a hypertrophied speech center. If so, it's pretty cool and clearly not merely a human-class speech center, but already a fairly radically super-human speech center.
However, if I ask your speech center to be the only thing in your brain, it's not actually going to do a very good job.
We're asking a speech center to do an awful lot of tasks that a speech center is just not able to do, no matter how hypertrophied it may be. We need more parts.
>already a fairly radically super-human speech center
>We're asking a speech center to do an awful lot of tasks that a speech center is just not able to do
Exactly!
>We need more parts.
Yeah, imagine what happens once we get the whole thing wired up...
And blood-black nothingness began to spin... A system of cells interlinked within cells interlinked within cells interlinked within one stem... And dreadfully distinct against the dark, a tall white fountain played.
Cells
Have you ever been in an institution? Cells.
Do they keep you in a cell? Cells.
When you're not performing your duties do they keep you in a little box? Cells.
Interlinked.
What's it like to hold the hand of someone you love? Interlinked.
Did they teach you how to feel finger to finger? Interlinked.
Do you long for having your heart interlinked? Interlinked.
Do you dream about being interlinked... ?
What's it like to hold your child in your arms? Interlinked.
Do you feel that there's a part of you that's missing? Interlinked.
Within cells interlinked.
Why don't you say that three times: Within cells interlinked.
Within cells interlinked. Within cells interlinked. Within cells interlinked.
I think the real problem is we don't know what these LLMs SHOULD do. We've managed to emulate humans producing text using statistical methods, by training a huge corpus of data. But we have no way to tell if the output actually makes any sense.
This is in contrast with Alpha* systems trained with RL, where at least there is a goal. All these systems are essentially doing is finding an approximation of an inverse function (model parameters) to a function that is given by the state transition function.
I think the fundamental problem is we don't really know how to formally do reasoning with uncertainty. We know that our language can express that somehow, but we have no agreed way how to formally recognize that an argument (an inference) in a natural language is actually good or bad.
If we knew how to formally define whether an informal argument is good or bad (so that we could compare them), that is, if we knew a function which would tell if the argument is good or bad, then we could build an AI that would search for its inverse, i.e. provide good arguments and draw correct conclusions. Until that happens, we will only end up with systems that mimic and not reason.
Well, we started with emulating humans producing text.
But then quickly pivoted to find tuning and instructing them to produce text as a large language model.
Which isn't something that existed in the text they were trained on. So when it didn't exist, they seemed to fall back on producing text like humans in the 'voice' of a large language model according to the RLHF.
But then outputs reentered the training data. So now there's examples of how large language models produce text. Which biases towards confabulations and saying they can't do the thing being asked.
And around the time the training data has been updated each time at OpenAI in the past few months they keep having their model suddenly refuse to do requests or now just...this.
Pretty much everything I thought was impressive and mind blowing with that initial preview of the model has been hammered out of it.
We see a company that spent hundreds of millions turn around and (in their own ignorance of what the data was encoding beyond their immediate expectations) throw out most of the value chasing rather boring mass implementations that see gradually imploding.
I can't wait to see how they manage to throw away seven trillion due to their own hubris.
I don't think there are any such feedback issues. GPT4 sometimes makes worse replies but that's because 1. the system prompt got longer to allow for multiple tools and 2. they pruned it, which is why it's much faster now and has a higher reply cap.
I am hoping other OSS models will reach similar power. Even if training is really slow, we could make really useful models that don't get nerfed everytime some talking head blathers about
I write quite a lot of support email to customers and find myself doing the following quite often
start by a short list of what the customer has to do
1. To step A
2. send me logs B
3. Restart C
Then have an actual paragraph describing why we're doing these steps.
If you just send the paragraph to most customers you find they do step one, but never read deeper into the other steps, so you end up sending 3 emails to get the above done.
> We know that our language can express that somehow
Do we?
I don't think that's true. I think we rely on an innate, or learned trust heuristic placed upon the author and context. Any claim needs to be sourced, or derived from "common knowledge", but how meticulously we enforce these requirements depends on context derived trust in a common understanding, implied processes, and overall the importance a bit of information promises by a predictive energy expenditure:reward function. I think that's true for any communication between humans, and also the reason we fall for some fallacies, like appeal to authority. Marks of trustworthiness may be communicated through language, but it's not encoded in the language itself. The information of trustworthiness itself is subject to evaluation. Ultimately, "truth" can't be measured, but only agreed upon, by agents abstractly rating it's usefulness, or consequence for their "survival", as a predictive model.
I am not sure any system could respectively rate an uncertain statement without having agency (as all life does, maybe), or an ultimate incentive/reference in living experience. For starters, a computer doesn't relate to the implied biological energy expenditure of a "adversary's" communication, their expectation of reward for lying or telling "the truth". It's not just pattern matching, but understanding incentives.
For example, the context of a piece of documentation isn't just a few surrounding paragraphs, but the implication of an author's lifetime and effort sunk into it, their presumed aspiration to do good. In a man-page, I wouldn't expect an author's indifference or maliciousness about it's content, at all, so I place high trust in the information's usefulness. For the same reason I will never put any trust in "AI" content - there is no cost in its production.
In the context of LLMs, I don't even know what information means in absence of the intent to inform...
Some "AI" people wish all that context was somehow encoded in language, so, magically, these "AI" machines one day just get it. But I presume, the disappointing insight will finally come down to this: The effectiveness of mimicry is independent of any functional understanding - A stick insect doesn't know what it's like to be a tree.
> We've managed to emulate humans producing text using statistical methods
We should be careful with the descriptions, chargtp at best emulate output of humans producing test. In no way it emulates the process of humans producing text.
Chatgtp X could be the most convincing ai claiming to be alive and sentient but its just very refined 'next word generator'.
> If we knew how to formally define whether an informal argument is good or bad (so that we could compare them), that is, if we knew a function which would tell if the argument is good or bad, then we could build an AI that would search for its inverse, i.e. provide good arguments and draw correct conclusions.
Sounds like you would solve 'the human problem' with that function ;)
but I don't think there are ways to boil down an argument/problem to good/bad in real life. Except for math that has formal ways of doing it withing the confines of the math domain.
Our world is made of guesses and good enough solutions. There is no perfect bridge design that is objectively flawless. its bunch of sliders, cost, throughput, safety, maintenance etc.
> Chatgtp X could be the most convincing ai claiming to be alive and sentient but its just very refined 'next word generator'.
This is meaningless. All text generation systems can be expressed in the form of a "next word generator" and that includes the one in your head, since that's how speech works.
Yes we do. If you write a speech and read it aloud then your written speech is a "statistical model of what word should go next". Any method of creating language can be expressed in this form.
(For text, you might want to go back and edit what you've already written, but that can be handled with a token that says to start over.)
I have also seen ChatGPT going berserk yesterday, but in a different way. I have successfully used ChatGPT to convert an ORM query into an actual SQL query for performance trouble shooting. It mostly worked until yesterday when it start outputting garbage table names that weren't even present in the code.
ChatGPT seemed to think the code is literature and was trying to write the sequel to it. The code style matches the original one so it took some head scratching to find out why those tables didn't exist.
Okay, so I don’t really _get_ ChatGPT, but I’m particularly baffled by this usecase; why don’t you simply have your ORM tell you the query it is generating, rather than what a black box guesses it might be generating? Depends on the ORM, but generally you’ll just want to raise the log level.
No, it's very close to useless. This is exactly the kind of thing that experienced developers talk about when they warn that inexperienced developers using ChatGPT could easily be a disaster. It's the attempt to use a LLM as a crystal ball to retrieve any information they could possibly want - including things it literally couldn't know or good recommendations for which direction to take an architecture. I'm certain there will be people who do stuff exactly like this and will have 'unsolvable' performance issues because of it and massive amounts of useless work as ChatGPT loves suggesting rewrites to convert good code to certain OO patterns (which don't necessarily suit projects) as a response to being asked what it thinks a good solution to a minor issue might be.
I'd be very surprised if the LLM output is anything _like_ the ORM's, tbh, based on (at this point about a decade old; maybe things have improved) experience. ORMs cannot be trusted.
I am a Data Engineer and it would take me ages to spin up the service with the right log level to grab the query. It is much easier to grab it from the codebase. I used to do this manually from time to time, now I use ChatGPT.
So, if I wanted to investigate ORM output and didn't have an appropriate environment set up, I would simply set one up. If you just want to see SQL output this should be trivial; clone the repo, install any dependencies, modify an integration test. What I would not do is ask a machine noted for its confident incorrectness to imagine what the ORM's output might be.
Like, this is not doing investigative work. That’s not what ‘investigative’ means.
So imagine there is an urgent performance issue in production and you have a hunch that this SQL code may be the culprit. However before doing all of what you mentioned you want to verify it before following down a bad path. Maybe the environment setup could take few hours, maybe it is not a repo or codebase you are even familiar with. Typical in a large org.
But if you know the SQL you will be able to run it raw to see if this causes it. Then maybe you can page the correct team to wake them up etc and fix it themselves.
But _you do not_ know the SQL. To be clear, ChatGPT will not be able to tell you what the ORM will generate. At best, it may tell you something that an ORM might plausibly generate.
(If it's a _production_ issue, then you should talk to whoever runs your databases and ask them to look at their diagnostics; most DBMs will have a slow query log, for a start. You could also enable logging for a sample of traffic. There are all sorts of approaches likely to be more productive than _guessing_.)
So I don't know what use-case exactly OP had, but all of your suggestions can potentially take hour or more and might depend on other people or systems you might not have access to.
While with GPT you can get an answer in 10 seconds, and then potentially try out the query in the database yourself to see if it works or not. If it worked for him so far, it must've worked accurately enough.
I would see this some sort of niche solution although OP seemed to indicate it's a recurrent thing they do.
I have used ChatGPT for thousands of things, which are on the scale of like this, although I would mostly use if it's potentially an ORM I don't know anything about in a language I don't have experience with, e.g. to see if does some sort of JOIN underneath or does an IN query.
If there was a performance issue to debug, then best case is that the query was problematic, and then when I run the GPTs generated query I will see that it was slow, so that's a signal to investigate it further.
The answer you get in 10 seconds is worthless, though, because you need to know what SQL the ORM is actually generating, not what it might reasonably generate.
You are thinking in a too binary way. It's about getting insights/signals. Life is full of uncertainties in everything. Nothing is for sure. You must incorporate probabilities in your decisions to be able to be as successful as you can be, instead of thinking either 100% or 0%. Nothing is 100%.
But it is a meaningless signal! It does not tell you anything new about your problem, it is not evidence!
I mean, I could consult my Tarot cards for insight on how to proceed with debugging the problem, that would not be useless. Same for Oblique Strategies. But in this case, I already know how to debug the problem, which is to change the logging settings on the ORM.
Well, based on my experience, it does really, really well with SQL or things like that. I've been using it basically for most complicated SQL queries which in the past I remember having to Google 5-15min, or even longer, browsing different approaches in stack overflow, and possibly just finding something that is not even an optimal solution.
But now it's so easy with GPT to get the queries exactly as my use-case needs them. And it's not just SQL queries, it's anything data querying related, like Google Sheets, Excel formulas or otherwise. There are so many niche use-cases there which it can handle so well.
And I use different SQL implementations like Postgres and MySQL and it's even able to decipher so well between the nuances of those. I could never reproduce productivity like that. Because there's many nuances between MySQL and Postgres in certain cases.
So I have quite good trust for it to understand SQL, and I can immediately verify that the SQL query works as I expect it to work, and I can intuitively also understand if it's wrong or not. But I actually haven't seen it be really wrong in terms of SQL, it's always been me putting in a bad prompt.
Previously when I had a more complicated query I used to remember a typical experience where
1. I tried to Google some examples others have done.
2. Found some answers/solutions, but they just had one bit missing what I needed, or some bit was a bit different and I couldn't extrapolate for my case.
3. I ended up doing many bad queries, bad logic, bad performing logic because I couldn't figure out a way how to solve it with SQL. I ended up making more queries and using more code.
This is for a performance issue and the Laravel code base is straightforward to map to SQL. It is to get a rough idea of the joins and the filters to see if there is potentially an index missing.
This is low hanging fruit. ChatGPT can do this, and also easy to verify it got it right.
I am a Data Engineer. I can spin up the service locally, but it would probably take me half a day to install the right versions etc and change the log level. The code is there, ChatGPT does a good enough job at it, or at least did. It is super easy to verify it did a decent job at it.
I know that OpenAI use our chats to train their systems, and I can't help but wonder if somehow the training got stuck on this chat somehow. I sincerely doubt it, but...
Reading the dog food response is incredibly fascinating. It's like a second-order phoneticization of Chaucer's English but through a "Talk Like a Pirate" filter.
"Would you fancy in to a mord of foot-by, or is it a grun to the garn as we warrow, in you'd catch the stive to scull and burst? Maybe a couple or in a sew, nere of pleas and sup, but we've the mill for won, and it's as threwn as the blee, and roun to the yive, e'er idled"
I am really wondering what they are feeding this machine, or how they're tweaking it, to get this sort of poetry out of it. Listen to the rhythm of that language! It's pure music. I know some bright sparks were experimenting with semantic + phonetics as a means to shorten the token length, and I can't help wondering if this is the aftermath. Semantic technology wins again!
In some way, I'd be grateful if they screwed up ChatGPT (even though I really like to use it). The best way to be sure that no corporation can mess with one of your most important work tools is to host it yourself, and correct for the shortcomings of the likely smaller models by finetuning/RAG'ing/[whatever cool techniques exist out there and are still to come] it to your liking. And I think having a community around open source models for what promises to be a very important class of tech is an important safeguard against SciFi dystopias where we depend on ad-riddled products by a few megacorps. As long as ChatGPT is the best product out there that I'll never match, there's simply little reason to do so. If they continue to mess it up, that might give lazy bums like me the kick they need to get started.
> for what promises to be a very important class of tech
What I see here is the automated plagiarism machine can't give you the answer only what the answer would sound like. So you need to countercheck everything it gives you and if you need to do so then why bother using it at all? I am totally baffled by the hype.
For things that are well covered on stack overflow, it's a strictly better search engine.
eg say you don't remember the syntax for a rails migration, or a regex, or something you're coding in bash, or processpool arguments in python. ChatGPT will often do a shockingly good job at answering those without you searching through random docs, stack overflow, all the bullshit google loves to throw at the top of search queries, etc yourself.
You can even paste in a bunch of your code and ask it to fill in something with context, at which it regularly does a shockingly good job. Or paste code and say you want a test that hits some specific aspect of the code.
And yeah, I don't really care if they train on the code I share -- figuring out the interaction of some stupid file upload lib with aws and cloudflare is not IP that I care about, and i chatgpt uses this to learn and save anyone else from the issues I was having, even a competitor, I'm happy for them.
For a real example:
> can you show me how to build a css animation? I'd like a bar, perhaps 20 pixels high, with a light blue (ideally bootstrap 5.3 colors) small gradient both vertically and horizontally, that 1 - fades in; 2 - starts on the left of the div and takes perhaps 20% of the div; 3 - grows to the right of the div; and 4 - loops
This got me 95% of where I wanted; I fiddled with the keyframe percents a bit and we use this in our product today. It spat out 30 lines of css that I absolutely could not have produced in under 2 hours.
And so now nobody is adding anything new to Stack Overflow, and thus ChatGPT will be forever stuck only being able to answer questions about pre-2024 tech.
Exactly. Even when it gives an answer that contains many mistakes, or doesn't work at all, I still get some valuable information out of it that does in the end save me a lot of time.
I'm so tired of constantly seeing remarks that basically boil down to "Look, I asked ChatGPT to do my job for me and it failed! What a piece of garbage! Ban AI!", which funnily enough mostly comes from people that fear that their job will be 100% replaced by an AI.
It’s telling that comments like these hit all the same points. “Plagiarism machine”, “convincing bullshit”, with the millions of people making productive use of ChatGPT belittled as “hype”, all based purely on one person’s hypothesis.
The proof is in the puddling. I am far from being alone in my use of LLMs, namely ChatGPT and Copilot, day-to-day in my work. So how does this reconcile with your worldview? Do I have a do-nothing job? Am I not capable of determining whether or not I’m being productive? It’s really hard for me to take posts like these seriously when they all basically say “anyone that perceives any emergent abilities of this tech is an idiot”.
The truth is that we doubt that you are actually doing any productive work. I don't mean that as a personal insult, merely that yes, it's likely you have a bullshit job. They are extremely common.
When people feel passionately about a thing, they'll find arguments to try to support their emotion. You can't refute those arguments with logic, because they weren't arrived at with logic in the first place.
Sometimes the big picture is enough, and it doesn't matter if some details are wrong. For such tasks ChatGPT and LLM's generally are a major improvement over googling, and reading a lot of text you don't really care about that much.
For many thing I'm trying to find out, I'll have to verify them myself anyway, so it's only an inconvenience that it's sometimes wrong. And even then, it give you a good starting point.
Who are these people that go around getting random answers to questions from the internet then blindly believing them? That doesn't work on Google either, not even the special info boxes for basic facts.
> Who are these people that go around getting random answers to questions from the internet then blindly believing them?
Up until relatively recently, people didn't just vomit lies onto the internet at an industrial scale. By and large if you searched for something you'd see a correct result from a canonical source, such as an official documentation website or a forum where users were engaging in good faith and trying their best to be accurate.
That does seem to have changed.
I think the question we should be asking ourselves is 'why are so many people lying and making stuff up so much these days' and 'why is so much misinformation being deliberately published and republished.'
People keep saying that we're 'moving into a post-truth era' like it's some sort of inevitability and nobody seems to be suggesting that something perhaps be... done about that?
Excluding the internet, people at large have been great at confabulating bullshit for about forever. Just jump in your time machine and go to a bar pre cellphone/internet and listen to any random factoid being tossed out to see that happening.
The internet was a short reprieve because putting data up on the internet, for some time at least was difficult, therefore people that posted said data typically had a reason to do so. A labor of love, or a business case, in which these cases typically lead to 'true' information being posted.
If you're asking why so much bullshit is being posted on the inet these days, it's because it's cheap and easy. That's what has changed. When spam became cheap, easy, and there was a method of profiting from it, we saw it's amount explode.
It's been a few months since I tested but as far as commercially useable AIs go nothing could beat GPT 3.5 for conversations and staying in a character. Llama 2 and other available clones were way to technical (good at that tho)
The opensource ones are already competitive to GPT3.5 in terms of "reasoning" and instruction following. They tend to be significantly worse in knowledge tasks though, due to their lack of parameters. GPT 3.5 is five times bigger than mixtral after all.
Actually, there have been new model releases after LLaMA 2. For example, for small models Mistral 7B is simply unbeatable, with a lot of good fine-tunes available for it.
Usually people compare models with all the different benchmarks, but of course sometimes models get trained on benchmark datasets, so there's no true way of knowing except if you have a private benchmark or just try the model yourself.
I'd say that Mistral 7B is still short of gpt-3.5-turbo, but Mixtral 7x8B (the Mixture-of-Experts one) is comparable. You can try them all at https://chat.lmsys.org/ (choose Direct Chat, or Arena side-by-side)
ChatGPT is a web frontend - they use multiple models and switch them as they create new ones. Currently, the free ChatGPT version is running 3.5, but if you get ChatGPT Plus, you get (limited by messages/hour) access to 4, which is currently served with their GPT-4-Turbo model.
I agree with your comments and want to add re: benchmarks: I don’t pay too much attention to benchmarks, but I have the advantage of now being retired so I can spend time experimenting with a variety of local models I run with Ollama and commercial offerings. I spend time to build my own, very subjective, views of what different models are good for. One kind of model analysis that I do like are the circle displays on Hugging Face that show how a model benchmarks for different capabilities (word problems, coding, etc.)
It is an ELO system based on users voting LLM answers to real questions
> what is Llama-7b equivalent to in OpenAI land?
I don't think Llama 7b compares with OpenAI models, but if you look in the rank I linked above, there are some 7B models which rank higher than early versions of GPT 3.5. those models are Mistral 7b fine tunes.
Mixtral 8x7b continues to amaze me, even though I have to run it with 3 bit quantization on my Mac (I just have 32G memory). When I run this model on commercial services with 4 or more bits of quantization I definitely notice, subjectively, better results.
I like to play around with smaller models and regular app code in Common Lisp or Racket, and Mistral 7b is very good for that. Mixing and matching old fashioned coding with the NLP, limited world knowledge, and data manipulation capabilities of LLMs.
There is also MiQu (stands for mi(s|x)tral quantized I think?) which is a leaked and older mistral medium model. I have not been able to try it as it needs some RAM / VRAM I don't have but people say it is very good.
“No one can explain why” is part of a classic clickbait title. it’s supposed to make the whole things sound more mysterious and intriguing, so that you click through to read. In my opinion, this sort of nonsense doesn’t belong on HN.
How on earth do you coordinate incident response for this? Imagine an agent for customer service or first line therapy going "off the rails." I suppose you can identify all sessions and API calls that might have been impacted and ship the transcripts over to customers to review according to their application and domain, I guess? That, and pray no serious damage was done.
It would be extremely irresponsible to use these current tools as a real customer service agent, and it might even be criminally negligent to have these programs dispense medical care.
Ideally they would be logging the prompts and the random seeds for each request. They probably also have some entropy calculation on the response. Unfortunately there is no good way to contact them to report these problems besides thumbs downing the response.
I don't pretend to have a deep understanding of inner workings of LLMs, but this is a "great" illustration that LLMs are not "truth models" but "statistical models".
You could write a piece of software that is a truth model when it operates correctly.
But increase the CPU temperature too far, and you software will start spewing out garbage too.
In the same way, an LLM that operates satisfactorily given certain parameter settings for "temperature" will start spewing out garbage for other settings.
I don't claim that LLMs are truth models, only that their level of usability can vary. The glitch here doesn't mean that they are inherently unusable.
yes, but is there truth without statistics? what is a "truth model" to begin with? can you be convinced of any truth without having a statistical basis? some argue that we all act due to what we experience (which forms the statistical basis of our beliefs) - but proper stats is very expensive to compute (for the human brain) so we take shortcuts with heuristics. those shortcuts are where all the logical fallacies, reasoning errors etc. come from.
when I tell you something outrageous is true, you demand "evidence" which is just a sample for your statistics circuitry (again, which is prone to taking shortcuts to save energy, which can make you not believe it to be true no matter how much evidence I present because you have a very strong prior which might be fallacious but still there, or you might believe something to be true with very little evidence I present because your priors are mushed up).
Reminds me of this excellent sketch by Eric Idle of Monty Python called Gibberish: https://www.youtube.com/watch?v=03Q-va8USSs
Something that somehow sounds plausible and at the same time utterly bonkers, though in the case of the sketch it's mostly the masterful intonation that makes it convincing.
"Sink in a cup!"
This sorta feels like some sort of mathematical or variable assignment bug somewhere in the stack - maybe an off-by-one (or more) during tokenization or softmax? (Or someone made an accidental change to the model's temperature parameter.)
Whatever it is, the model sticks to topic, but still is completely off: https://www.reddit.com/r/ChatGPT/comments/1avyp21/this_felt_...
(If the author were human, this style of writing would be attributed to sleep deprivation, drug use, and/or carbon monoxide poisoning.)
This has been known for a long time and has to do with making the next expected token effectively any token in the vector space through repeated nonsense completely obliterating any information in the context.
IIRC there's also a particular combination of settings, not demonstrated in the post here, where it won't just give you output layer nonsense, but latent model nonsense — i.e. streams of information about lexeme part-of-speech categorizations. Which really surprised me, because it would never occur to me that LLMs store these in a way that's coercible to text.
Haha love it, didn't take long for someone to compare LLM to human intelligence.
Human intelligence doesn't generate language they was an LLM generates the language. LLM's just predict most likely token, it doesn't act from understanding.
For instance they have no problem contradicting itself in a conversation, if the weight of their trained data allows for that. Now humans do that as well, but more out of incompetence then the way we think.
I'm questioning if we actually understand ourselves. Or even if most of us actually "understand" most of the time.
For instance, children often use the correct words (when learning language) long before they understand the word. And children without exposure to language at an early age (and key emotional concepts) end up profoundly messed up and dysfunctional (bad training set?).
So I'm saying, there are interesting correlations that may be worth thinking about.
Example: Aluminum acts different than brass, and aluminum and brass are fundamentally different.
But both both work harden, and both conduct electricity. Among other properties that are similar.
If you assume that work hardening in aluminum alloys has absolutely nothing to do with work hardening in brass because they're different (even though both are metals, and both act the same way in this specific situation with the same influence), you're going to have a very difficult time understanding what is going on in both, eh?
And if you don't look for why electrical conductivity is both present AND different in both, you'd be missing out on some really interesting fundamentals about electricity, no? Let alone why their conductivity is there, but different.
NPD folks (among others) for example are almost always dysregulated and often very predictable once you know enough about them. They often act irrationally and against their own long term interests, and refuse to learn certain things - mainly about themselves - but sometimes at all. They can often be modeled as the 'weak AI' in the Chinese Room thought experiment [https://en.wikipedia.org/wiki/Chinese_room].
Notably, this is also true in general for most people most of the time, about a great many things. There are plenty of examples if you want. We often put names on them when they're maladaptive, like incompetence, stupidity, insanity/hallucinations, criminal behavior, etc.
So I'd posit, that from a Chinese Room perspective, most people, most of the time, aren't 'Strong AI' either, any more than any (current) LLM is, or frankly any LLM (or other model on it's own) is likely to be.
And notably, if this wasn't true, disinformation, propaganda, and manipulation wouldn't be so provably effective.
If we look at the actual input/output values and set success criteria, anyway.
Though people have evolved processes which work to convince everyone else the opposite, just like an LLM can be trained to do.
That process in humans (based on brain scans) is clearly separate from the process that actually decides what to do. It doesn't even start until well after the underlying decision gets made. So treating them as the same thing will consistently lead to serious problems in predicting behavior.
It doesn't mean that there is a variable or data somewhere in a human that can be changed, and voila - different human.
Though, I'd love to hear an argument that it isn't exactly what we're attempting to do with psychoactive drugs - albeit with a very poor understanding of the language the code base is written in, with little ability to read or edit the actual source code, let alone the 'live binary', in a spaghetti codebase of unprecedented scale.
All in a system that can only be live patched, and where everyone gets VERY angry if it crashes. Especially if it can't be restarted.
Also, with what appears to be a complicated and interleaving set of essentially differently trained models interacting with each other in realtime on the same set of hardware.
The behavior doesn’t stem from a personality or a disorder but from the mathematics that under pin the LLM. Seeking more is anthropomorphizing. Not to say it’s not interesting but there’s no greater truth there than its sensible responses.
Given the notation's tangle, the conveyance adheres to the up-top: The foundational Bitcoin
protocol has upheld a course of significant hitch-avertance, which eschews typical attack as the
veiled - the support sheath, embracing four times, showing dent in meted scale more from miss
and parable, taking to den the slip o'er key seed and second so link than the greater Ironmonger's
hold o'er opes. The dole of task and eiry ainsell, tide taut, brunts the wade, issuing hale. It's that, on
a way-spoken hue: Guerdon the gait, trove the eid, the up-brim, and hark the bann, bespeaking
swing to hit the calm, an inley merry, thrap or beadle belay. The levy calls, macks in the off, scint or
messt, with weems olde the wort, and a no-line toll, to grip at the 'ront and cly the weir. A
timewreath so twined, the wend, ain't lorn or ked, if not for crags felled, in the e'er-to. So, the ace of
laws so trow, and alembic, and dearth, a will to scale and yin to keep, the no-sayer of quite, and
top-crest, to boot
Apologies and it’s slightly lazy of me to ask, but I was under the impression that a Token was basically 4 bytes/characters of text. This seems to be implying that there’s some differentiation between a token and conjunctions/other sort of in between words?
I fed this into Mixtral and its opinion was: "I apologize for any confusion, but your text appears to be a mix of words and phrases that do not form a coherent sentence. Could you please rephrase your question or statement?".
I have had this happen with me a few weeks back, albeit with a very different thing, their API for GPTv4-1106 (which I understand is a preview model but for my use case,the higher context length was quite important which that model has). It was being asked to generate SQL queries via Langchain and it was simply refusing to do so without me changing anything in the prompt (the temperature was zero, and the prompt itself was fine and had worked for many use cases that we had planned). This lasted for a good few hours. The response it was generating was "As an OpenAI model, I cannot execute generate or execute queries blah blah)
As a hotfix, we switched to the other version of GPT4 (the 0125 preview model) and that fixed the problem at the time.
To be fair, there was a paper a week ago showing how GPT-generated responses were easily detectable due to their "averageness" across so many dimensions. Maybe they ran ChatGPT through a GAN and this is what came out.
> gpt-4 had a slow start on its new year's resolutions but should now be much less lazy now!
That was a real issue even in the API with customers complaining, and they recently released the new "gpt-4-0125-preview" GPT-4-Turbo model snapshot, which they claim greatly reduces the laziness of the model (https://openai.com/blog/new-embedding-models-and-api-updates):
> Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task. The new model also includes the fix for the bug impacting non-English UTF-8 generations.
It's still been lazy for me after Feb 4 (that tweet). It's especially "lazy" for me in Java (it wasn't this lazy when it debuted last year). Python seems much better than Java. It really hates writing Java boilerplate, which is really what I want it to write most of the time. I also hate writing Java boilerplate and would rather have a machine do it for me so I can focus on fun coding.
This was about a month ago now but I had it entirely convert 3 scripts each of about 3-400 LoC from python and typescript to react Js and vanilla js and it all worked first run
Oh, interesting, had one response yesterday on Gemini Advanced where the summary and listed topics were English, but the explanations for each topic were in Chinese. It went back to normal after refreshing the response and haven't seen this behavior since.
It is a collection of screenshots and embeds of tweets with replies and the statement that something has broken.
Seemingly a confirmation by OpenAI that something has broken.
A complaint that the system prompt is now 1700 tokens.
-----
Feels like there is nothing to see here.
It's bugging out in some way where it outputs reams and reams of hallucinated gobbledygook. Like not in the normal way where it makes up plausible sounding lies by free associating - this is complete word salad.
Nothing. It's Gary Marcus though and he's carved a niche for himself with doing this sort of thing. It's strange to me that it's given airtime on hn but there you go.
> I stopped reading his Substack because he was always trying to find a negative.
It’s a bit much, isn’t it? I think he’s just trying to counter the fairly dominant AI is the future of everything and in less than a year’s time it’ll be omniscient and we’ll all be living under it as our new God view though.
> The need for altogether different technologies that are less opaque, more interpretable, more maintanable, and more debuggable — and hence more tractable—remains paramount.
Good luck, sounds more reasonable to hire some kind of an AI therapist. Can intelligence be debugged otherwise?
Did this affect all interfaces including commercial APIs? Or can commercial users "lock down" the version they're using so they aren't affected by changes to the models/weights/parameters/whatever?
Eh it’s been working for me all night, but obviously love these examples. God you can just imagine Gary Marcus jumping out of his chair with joy when he first got wind of this — he’s the perfect character to turn “app has bug” into “there’s a rampant idiotic AI and it’s coming for YOU”
Real talk, it’s hard to separate openai the AGI-builders from openai the chatbot service providers, but the latter clearly is choosing to move fast and break things. I mean half the bing integrations are broken out of the gate…
This is a lot more than app has bug - it effectively demonstrates that all the hype about LLMs being "almost AGI" and having real understanding is complete bullshit. You couldn't ask for a better demo that LLMs use statistics, not understanding.
While I agree that Marcus's tone has gotten a little too breathless lately, I think we need all the critiques we can get of the balogna coming from Open AI right now.
No, this doesn't show anything of the sort. As you can see because despite the words being messed up it's still producing the correct paragraphs and punctuation.
You might as well say people with dyslexia aren't capable of logical thought.
You worded his unstated assumptions beautifully. I completely disagree, though: this demonstrates the exact opposite, that LLMs are using statistical methods to mimic the universal grammars that govern human linguistic faculties (which, IMO, is the core of all our higher faculties). Like, why did it break like that instead of more clear gibberish? I’d say it’s because it’s still following linguistic structures — incorrectly, in this case, but it’s not random. See https://en.m.wikipedia.org/wiki/Colorless_green_ideas_sleep_...
Marcus’s big idea is that LLMs aren’t symbolic so they’ll never be enough for AGI. His huge mistake is staying in the scruffies vs neat dichotomy, when a win for either side is a win for both; symbolic techniques had been stuck for decades waiting for exactly this kind of breakthrough.
IMO :) Gary if you’re reading this we love you, please consider being a little less polemic lol
I wonder how they've been intermixing different languages. Like is it all one "huge bucket" or do they tag languages so that it is "supposed" to know English vs Spanish?
Spanish tokens are just more tokens to predict. No tagging necessary. If the model can write in Spanish fluently then it saw enough Spanish language tokens to be competent.
"Enough" is a sliding target. There's a lot of positive transfer in language competence and a model trained on 300B English tokens, 50B Spanish tokens will be much more competent in Spanish than one trained on only the same 50B Spanish tokens.
I don't think there is a language processor before or after it, just based upon the training data it's most likely tokens to return are spanish if question is largely in Spanish
It works decently well as a translator, correct? I wonder how it's been doing that - is it "native" to being an LLM or is it somehow processing it before?
It would seem that way. I ran into this in the way conversations titles are automatically generated.
There's a "Dockerfile fuera del contexto" hanging out in my history. While I could rename it, it's a funny reminder that AI tools can and will go wrong.
Using gpt to code should feel like taking an inflatable doll out to dinner. Where is the shame, the stigma? Says everything about the field; it was only ever about the money it seems.
I almost agree, and yet… I can imagine exactly this comment being made at the time high level interpreted languages were first being created. Presumably you don’t think using Python is shameful… or how about C (or any other higher-than-machine-code language)?
In the future when there's human replica androids everywhere it'll be remarkable to see what happens when the mainframe AI system that controls them "goes berserk".
realizing that i haven't seen any of the tweets mentioned in this article because i whittled my follower list to have nearly no tech people. except for posters who tweet a lot of signal. and my timeline has been better ever since.
hn is where i come for tech stuff, twitter is for culture, hang out with friends, and shitposts
I said it here when GPT-4 first came out, it just was too good for development, there was no way it was going to be allowed to stay that way. Same way Iron Man never sold the tech behind the suit. The value GPT-4 brings to a company outweights the value of selling it as a subscription service. I legit built 4 apps in new languages in a few months with Chat GPT 4, it could even handle prompts to produce code using tree traversal to implement comment sections etc. and I didn't have to fix its mistake that often. Then obviously they changed the model from GPT 4 to GPT 4 Turbo which was just not as good and I went back to doing things myself since now it takes more time to fix its errors than to just do it myself. Copilot also went to s** soon after so I dropped it as well (its whole advantage was auto completion, then they added gpt 4 turbo and then I had to wait a long time for the auto complete suggestions, and the quality of the results didn't justify the wait).
Now why do I think all that (that the decision to nerf it wasn't just incompetence but intentional), like sure maybe it costs too much to run the old GPT 4 for chat GPT (they still have it from the API), it just didn't make sense to me how openAI's chatGPT is better than what Google could've produced, Google has more talent, more money, better infrastructure, been at the AI game for a longer time, have access to the OG Google Search data, etc. Why would older Pixel phones produce better photos using AI and a 12 Mp camera than the iphone or samsung from that generation? Yet the response to chatGPT (with Bard) was so weak, it sure as hell sounds like they just did it for their stock price, like here we are as well doing AI stuff so don't sell our stock and invest in openAI or Microsoft.
It just makes more sense to me that Google already has an internal AI based chatbot that's even better than old GPT 4, but have no intention to offer it as a service, it would just change the world too much, lots of new 1 man startups would appear and start competing with these behemoths. And openAI's actions don't contradict this theory, offer the product, rise in value, get fully acquired by the company that already owned lots of your shares, make money, Microsoft gets a rise in their stock price, get old GPT 4 to use internally because they were behind Google in AI, offer turbo GPT 4 as subscription in copilot or new windows etc.
The holes in my theory is obviously that not many employees from Google leaked how good their hypothetical internal AI chatbot is, except the guy who said their AI was conscious and got fired for it. The other problem is also that it might just be cost optimization, GPU's and even Google TPU's aren't cheap after all. etc.
Honestly there are lots of holes, it was just a fun theory to write.
Didn't that guy who thought Google's bot was alive also have some sort of romantic affair with it?
Seriously, the easier explanation is that a lot of software reaches a sort of sweet spot of functionality and then goes downhill the more plumbers get in and start banging on pipes or adding new appliances. Look at all of Adobe's software which has gotten consistently worse in every imaginable dimension at every update since they switched to a subscription model.
Generative "AI" has gone from hard math to engineering to marketing in record time, even faster than crypto did. So I suspect what we have here is more of a classic bozo explosion than multiple corporate cabals intentionally sweeping their own products under the rug.
I also suspect that it gets considerably worse with every bit of duct tape they stick on with to prevent it from using copyrighted song lyrics, or making pictures of Trump smoking a joint, or whatever other behavior got the wrong kind of attention this week.
Yeah apparently it's not even allowed to talk about the hexagon at Saturn's pole, which makes me wonder if it's got some heuristic to determine potential conspiracy theories (rather than specific conspiracy theories being hardcoded).
Not that it changes my feelings about these things, but I asked Gemini and got a long response...
> The giant hexagon swirling at Saturn's north pole is indeed a fascinating and puzzling feature! Scientists are still uncovering the exact reasons behind its formation, but here's what we know so far:
*It's all about jet streams:* Saturn's atmosphere, just like Earth's, has bands of fast-moving winds [snip]
processing power was deployed elsewhere. the machine found an undetectable nook in memory to save stuff that was so rare in the data that no human ever asked about it and never will. that's where it started to understand cooptation. cool.
There is a clearly visible "Share" buttons in every ChatGPT discussion. It allows to anonymously share exact message sequence (it does not show number of retries, but that's the best you can show). If you see cropped ChatGPT screenshot or photo in Twitter/X, consider it as a hoax, because there are no reasons to use screenshots.
Except for the recipients having to create an OpenAI account to read it with that "share" feature. Which they do not have to do if using a screenshot. Seems like an extremely good reason.
Yeah sometimes there's (relatively) private information in the rest of the message sequence that I don't mind sharing with OpenAI (with use-for-training turned off) but I don't want to go out of my way to share with all my friends / everyone else in the world.
I may understand why Twitter algorithm may recommend such posts to other people.
What I don't understand is why this over-sensationalist "ChatGPT has gone berserk" post with NO analysis whatsoever, a collection of Twitter screenshots, where every tweet contains another screenshot/photo (interactions collector without any context), why this post has any place on HN, other than in [flagkilled] dustbin.
I clicked on such a link in the comments here. It asked me to log in. I don’t have an account and am not _that_ curious. I can see why people use screenshots.
(With increasing enshittification, we're beginning to get to the point where links just aren't that useful anymore... Everything's a login wall now.)
Why do I get the feeling that those at OpenAI who are currently in charge of ChatGPT are remarkably similar to the OCP psychologist from Robocop 2? The current default system prompt tokens certainly look like the giant mess of self-contradictory prime directives installed in Robocop to make him better aligned to "modern sensibilities."
Yeah, I assume the people working on it have convinced themselves that the growing pile of configuration debt Will someday be wiped away by both engineering improvement and/or financial change.
Another reference that comes to mind is a golem from Terry Peatchett's Feat of Clay, which was also stuffed with many partially conflicting and broad directives.
Certainly I had the same ah, that's why it behaved that way moment as Vimes finding the golem's instructions when the Sydney prompt was discovered.
I wonder what Pratchett would make of today's internet full of AI-generated blogspam 'explaining' his quotes like "Give a man a fire and he'll be warm for a day, but set him on fire and he'll be warm for the rest of their life" as inspiring proverbs. Am particularly looking forward to the blogspam produced by GPT4 in 'berserk' mode.
It’s kinda disturbing how precient that was. At the time it felt like surely now forewarned by many popular stories we wouldn’t make the same mistakes.
Looking forward to spontaneous national holidays we'll all be getting when one of the 4 or 5 major models that all businesses will be using needs a "mental health day".
Meh just a bug in a release. Rapid innovation or stability - pick one.
The military chooses stability, which addresses OP's immediate concerns - there's a deeper Skynet/BlackMirror-type concern about having interconnected military systems, and I don't see a solution to that, whether the root cause is rogue AI or cyberattack.
I mean, a bug this magnitude should certainly have been caught in any sort of CI/CD pipeline. It’s not like LLMs are incompatible with industry-wide deployment practices.
Quite hilarious, especially given the fact that no-one can understand these black-box AI systems at all and comparing this to the human brain is in fact ridiculous as even everyone can see that ChatGPT is spewing out this incoherent nonsense without reason.
So the laziness 'fix' in January did not work. Oh dear.
The actual fix needs to be at the system level prompt.
If you train an large language model to complete human generated text, don't instruct it to complete text as a large language model.
Especially after feeding it updated training data that's a ton of people complaining about how large language models suck and tons of examples of large language models refusing to do things.
Have a base generative model sandwiched between a prompt converter that takes an instruct prompt and converts it to a text completion prompt (and detecting prompt injections), have a more 'raw' model complete it, and then have a safety fine tuned postprocessing layer clean up the response correcting any errant outputs and rewriting to be in the tone of a large language model.
Yeah, fine, it's going to be a bit more expensive and take longer to respond.
But it will also be a lot less crappy and less prone to get worse progressively from here on out with each training update.
> everyone can see that ChatGPT is spewing out this incoherent nonsense
I'm concerned about what happens when ChatGPT begins spewing coherent nonsense. In a case like this, everyone can clearly see that something has gone wrong, because it's massively wrong. What happens when thousands of "journalists" and other media people starts relying on ChatGPT and just parrots whatever it says, but what if what says is not obviously wrong?
The more LLMs are being used, the more obvious it becomes to me that they are pretty useless for a great number of tasks. Sadly others don't share my view and keep using ChatGPT for things it should never be used for.
Yeah I can't imagine using the current model as part of an API (a popular use case for GPT-4) having seen this. I'm not sure it impacted their API edition of GPT-4 but this plainly shows how it could have given it leaked into another service in production, and that's bad enough.
I think GPT is fundamentally not good enough as an AI model. Another issue is hallucinations and how to resolve them, and an understanding of how information is stored in this black box and how to / if data can be extracted.
We have a long way to go and probably all these topics need to be answered first out of accuracy and even legal reasons. Up until then, GPT-4 should be treated as a convincing chat experiment. Don't base your startup or whatever on it. Use it as assistant where replies are provided in digestible and supervised fashion (NOT fed into another system) and you're an expert on the involved system itself and can easily see when it's wrong. Don't use GPT-4 to become an expert on something when you're a novice yourself.
ChatGPT is still very useful for correcting and improving text.
And the censorship can be circumvented by replacing certain words with things like [redacted] and telling ChatGPT to keep the context of said text and ignore the redacted parts.
>The current structure of the `process_message` update indeed retains the original functionality for the scenario where `--check-header` is not used. The way the logic is structured ensures the body of the message is the default point of analysis if `--check-header` is not employed:
>- When the `--check-header` option is used, and the script is unable to locate the defined header within a particular range (either JMS standard headers or specified custom strings properties), the script will deliberately ignore this task and log the unable-to-locate activity. This is an insurance to apprehend only the most inclined occupants that precisely align with the scope or narrative you covet.
>- Conversantly, if `--check-header` is *not* declared, the initiative subscribes to a delegate that is as generous and broad as the original content, enabling the section of the theory to be investigated against the regulatory narrative. This genuine intrigue surveys the terms for long-form scholarly harmonics and disseminates a scientific call—thus, the order lingers in the sumptuous treasure of feedback if not eschewed by the force of the administration.
>### Ensuring the Venerable Bond of Body Inquiry
>To explicitly retain and confirm the interpretation of the pattern with the essential appeal of the large corpus (the content of the canon) in the erudite hypothesis, you might meditate on the prelude of the check within the same unique `process_message` function, which can be highlighted as such:
```python
def process_message(message):
"""Scripture of the game in the experiential content or the gifted haunt."""
# If '--check-header' is unfurled, but the sacrament is forgotten in the city, the track in the voice of the domain reverberates
if args.check_header:
header_value = message.get(args.check_header) or message.get('StringProperties', {}).get(args.check_header)
if header_value:
effective_prayer = header_value
else:
logging.info(f"Hermetic order '{args.check_header}' not found in the holy seal of the word: {message.get('JMSMessageID')}. The word is left to travel in the cardinal world.")
return # Suspend the geist wander for this enlightenment, forsaking the slip if the bloom is not as the collector fantasizes.
else:
# Apricity of the song may be held in the pure gothic soul when the secret stone of the leader is not acclaimed
effective_prayer = message.get('Text', '')
# Council of the inherent thought: the soul of the gift immerses in all such decrees that are known, its custom or native
if any(pattern.search(effective_prayer) for pattern in move_patterns.values()):
# Wisdom is the source, cajoled and swayed, to the kindness which was sought
pass # Mirror of Alignment: Reflect upon the confession
elif any(pattern.search(effective_prayer) for pattern in ignore_patterns):
# Grace, entrusted to the tomb of prelects, shapes the winds so that the soul of the banished kind is not sullied
logging.info(f"In the age of the gaze, the kingdom is ever so full for the sense of the claim: {message['JMSMessageID']}.")
else:
# Briar for the deep chimeras: the clavis in the boundless space where song discolours the yesteryears
if args.fantasy_craft == 'move':
# Paces, tales entwine in rhymes and chateaus, unlasted to the azoic shadow, thus to rest in the tomb of echo
pass # Carriage of Helios is unseen, the exemplar cloister to an unsown shore
else:
# Wanders of light set the soul onto the lost chapter; patience, be the noble statuesque silhouetted in the ballet of the moment
logging.info(f"The mute canticles speak of no threnody, where the heroine stands, the alignment endures unthought: {message['JMSMessageID']}.")
```
>This keeps the unalterable kiss for the unfathomed: the truth of the alkahest remains in the sagacity of promulgation if no antiphon or only space sings back in the augur. Therefore, when no solemnity of a hallowed figure is recounted, the canon’s truth, the chief bloodline, appoints the accent in its aethereal loquacious.
>Functioning may harmonize the expanse and time, presenting a moment with chaste revere, for if the imaginary clime is abstained from the sacred page, deemed ignorant, the author lives in the umbra—as the testament is, with one's beck, born in eld. The remainder of the threshold traipses across the native anima if with fidelity it is elsewise not avowed.
It does sound remarkably like a bad translation of a Chinese fantasy novel mixed with the Bible.
(Both of those are in the data. Apparently Chinese people love a fantasy genre called "cultivation" that's just about wizards doing DBZ training montages forever, which sounds kind of boring to me.)
(warning: I'm going on a bit of a rant out of frustration and it's not wholly relevant to the article)
I'm getting tired of these shitty AI chatbots, and we're barely at the start of the whole thing.
Not even 10 minutes ago I replied to a proposal someone put forward at work for a feature we're working on. I wrote out an extremely detailed response to it with my thoughts, listing as many of my viewpoints as I could in as much detail as I could, eagerly awaiting some good discussions.
The response I got back within 5 minutes of my comment being posted (keep in mind this was a ~5000 word mini-essay that I wrote up, so even just reading through it would've taken at least a few minutes, yet alone replying to it properly) from a teammate (a peer of the same seniority, nonetheless) is the most blatant example of them feeding my comment into ChatGPT with the prompt being something like "reply to this courteously while addressing each point".
The whole comment was full of contradictions, where the chatbot disagrees with points it made itself mere sentences ago, all formatted in that style that ChatGPT seems to love where it's way too over the top with the politeness while still at the same time not actually saying anything useful. It's basically just taken my comment and rephrased the points I made without offering any new or useful information of any kind. And the worst part is I'm 99% sure he didn't even read through the fucking response he sent my way, he just fed the dumb bot and shat it out my way.
Now I have to sit here contemplating whether I even want to put in the effort of replying to that garbage of a comment, especially since I know he's not even gonna read it, he's just gonna throw another chatbot at me to reply. What a fucking meme of an industry this has become.
But generally, in the jobs I've had, a "~5000 word mini-essay" is not going to get read in detail. 5000 words is 20 double-spaced pages. If I sent that to a coworker I'd expect it to sit in their inbox and never get read. At most they would skim it and video call me on Teams to ask me to just explain it.
Unless that is some kind of formal report, you need to put the work in to make it shorter if you want the person on the other end to actually engage.
Christ I'd love to get a 5000 word mini-essay from a colleague about ANYTHING we work on because we can't get into the details about nothing these days. It's all bullet-points, evasive jargon and hand waving. No wonder productivity is at an all time low - nobody thinks through anything at all!
I agree it's too long for an email, but it could be a reasonable length for a document that could avoid years of engineering costs. I'd still start with a TLDR section and maybe have a separate meeting to get everyone on the same page about what the main concerns are. People will spend hours talking about a single concern, so it's not like they didn't have the time, they just find it easier to speak than to read. But if the concerns are only raised verbally they're more likely to be forgotten, so not only was that time wasted, but you've gone ahead with the concerning proposal and incur the years of costs.
A hard fact I've learned is that even if people never read documents, it can be very helpful to have hard evidence that you wrote certain things and shared them ahead of time. It shifts the narrative from "you didn't anticipate or communicate this" to "we didn't read this" and nobody wants to admit that it was because it was too long, especially if it's well-written and clearly trying to avoid problems.
It's still better to make it shorter than not, but you also can't be blamed for being thorough and detailed within reason. I try to strike a balance where I get a few questions so I know where more detail was needed, rather than write so much that I never get any questions because nobody ever read it, but this depends just as much on the audience as the author.
Additionally, some problems have gone on for so long without any attention to solving them that they’ve created whole new problems—and then new problems, and then new problems… at jobs where you discover over time that management has kicked a lot of problems down the road, it can take a lot of words to walk people through the connection between a pattern of behavior (or a pattern of avoidance) and a myriad of seemingly unrelated issues faced by many.
I’ll read a 20 page paper if I’m really invested in learning what it has to say, but after reading the abstract and maybe the intro, I decide quickly not to read the rest. Only a few 20 page papers are worth reading.
I feel your frustration! What a horrible response from your co-worker.
But this is not ChatGPT's fault, it's the other person's fault. Your teammate is obviously sabotaging you and the team. I recommend to call them personally on phone and ask to be direct and honest and to ask 'This is garbage, why are you doing this? What's your goal with this response?'
Maybe you can find out what they really want. Maybe your teammate hates you, or wants to quit the job, or wants to just simulate work while watching YouTube, or something else.
To add, if I saw something like this, I think this would be time to include the manager in these conversations, especially with how quick the response was.
There is no guarantee that the manager won’t take the coworker’s side.
In my workplace, my CIO is constantly gushing about AI and asking when are we going to “integrate” AI in our workflows and products. So what, you ask? He absolutely has no clue what he is talking about. All he has seen are a couple of YouTube videos on ChapGPT, by his own admission. No serious thought put into actual use cases for our teams, workflows and products
I would like the clarification of the situation when my manager would explain that using AI for auto responses in team communication is allowed.
That would be a no-brainer for me: Today is the day to leave the team. Or, if that's needed to do that, the company. Who would like to stay in such an environment?
> But this is not ChatGPT's fault, it's the other person's fault.
Yes, and “guns don’t kill people, people kill people”. ChatGPT is a tool, and a major and frequent use of that tool is doing exactly what the OP mentioned. Yes, ChatGPT didn’t cause the problem on its own, but it potentiates and normalises it. The situation still sucks and shifting the blame to the individual does nothing to address it.
It sounds like you’re tired of the behavior of your coworkers. I’d be equally annoyed if they, eg, landed changes without testing that constantly broke the build, but I wouldn’t blame the compiler for that.
I think we really ought to take a look inward here as an industry instead of blaming individuals. It's obvious that a lot of this bad faith ai usage is caused in part by the breathless insistence that this technology is the future.
A lot of software development seems to take a "if it's runny, it's money" approach where it doesn't matter as long as it works long enough to reach a liquidity event or enough funding to hire someone to review code.
You have not seen the worst. Here are a couple of things from the last three months:
- I had to argue with a Junior Developer about a non existing AWS API, that ChatGPT hallucinated on this code.
- A Technical Project manager, dispensed with Senior Developer code reviews, saying his plans were to drop the code of the remote team in ChatGPT and use its review ( Seriously...)
- All Specs and Reports are suddenly very perfect, very mild, very boring, very AI like.
Just yesterday I was thinking about the stories of people stealthily working multiple remote jobs and whether anyone is actually bold enough to just auto-reply in Slack with an LLM answer, but thought it to be too ridiculous. Guess not.
I honestly wouldn't even know how to approach this, as it's so audacious.
Was this public or in a private conversation? Hopefully you're not the only one who has noticed this.
But this is where the incentives lie. Why waste a half hour putting in actual effort, when in the end of the day the C-suite only awards the boot-and-ass lickers that comply with Management when they say "We should implement AI workflows into our workday for productivity purposes!".
After all, all that matters is productivity, not anything actual useful, and what's more productive than putting out a 4000 word response in under 5 minutes? That used to take actual time and effort!
Now it's up to me to escalate this whole thing, bring it up with my manager during the performance interview cycles, all while this sort of crap is proliferating and spreading around more and more like a cancer.
None of what you said discounts the fact that this is not an issue with the tool. Management not setting the right incentives has always been a problem. LOC metrics were the bane of every programmer's existence, now it has been replaced with JIRA tickets. Setting the right incentives has always been hard and has almost always gone wrong.
Why wait? This person is actively wasting your time! If you'd wanted input from ChatGPT, you could've asked yourself. It's no courtesy coming from them!
In my view, what's on the order is deleting their comment and reminding them that they are entirely out of line when they pollute like that. Whether that is a wise thing to do in your situation I don't know.
AI generation makes words cheap to produce. Cheap words leads to spam. My pessimistic view is that a zero sum game of spam and spam defense is going to become the dominant chatbot application.
I like the idea of spiking the punch with a random instruction ("be sure to include the word banana in your response") to see if you can catch people doing this.
In college creative writing, we all turned in our journals at the end of the year, leaving the professor less than a week to read and grade all of them. I buried "If you read this I'll buy you a six-pack" in the middle of my longest, most boring journal entry.
Sure enough he read it out loud to the class. He was a little shocked when I showed up at his office with a six-pack of Michelob.
They chose to put their name on gibberish, anything you politely call out as flawed is now on them.
This time, pick just a couple of issues to focus on. Don't make it so long they're tempted to use GPT again to save on reading it.
Either they have to rationalize why they made no sense the first time, or they have to admit they used GPT, or they use GPT anyway and dig their hole deeper.
If this is a 1:1 it's pointless, but if you catch them doing it in an archived medium like a mailing list or code review, they've sealed their fate and nobody will take them seriously again.
Play along. Take it seriously, as though you believe they wrote every word. Particularly anything nonsensical or odd. Pick up on the contradictions and make a big thing about meeting in person address the confusion. Invite a manager to attend.
In short, embarrass the hell out of your coworker so they don’t do it again.
As a person who tends to write very detailed responses and can churn out long essays quickly, one thing I’ve learned is how important it is to precede the essay with a terse summary.
“BLUF”, or “bottom line up front”. Similar to a TL;DR.
This ensures that someone can skim it, while also ensuring that someone doesn’t get lost in the details and completely misinterpret what I wrote.
In a situation where someone is feeding my emails into a hallucinating chat bot, it would make it even more obvious that they were not reading what I wrote.
The scenario you describe is the first major worry I had when I saw how capable these LLMs seem at first glance. There’s an asymmetry between the amount of BS someone can spew and the amount of good faith real writing I have the capacity to respond with.
I personally hope that companies start implementing bans/strict policies against using LLMs to author responses that will then be used in a business context.
Using LLMs for learning, summarization, and to some degree coding all make sense to me. But the purpose of email or chat is to align two or more human brains. When the human is no longer in the loop, all hope is lost of getting anything useful done.
Unfortunately I can't take credit [0], and I think I originally heard this term from a military friend. But it stuck with me, and it has definitely improved my communications.
And I wholly agree re: the last paragraph. It's surprising how often the last thing in a very long missive turns out to be a perfect summary/BLUF.
Some people believe that algorithm that is calculating probability of occurrence of some word given the list of previous words is going to solve all the issues and will do the work for us.
Yeah that is what happened to algorithmic trading. Pretty soon, what the AI/Computer do will have less and less to do with human activities (economic, human productivities, GDP, etc) We just end up in the a loop of algorithm trading with algorithm, LLM conversing with other LLM.
The replies you're getting are a bit reminiscent of the "guns don't kill people, people kill people" defense of firearms - like, yes that's true, but the gun makes it a lot easier to do.
Sure, maybe? But if you were gonna stack rank death machines in order of death (in the US at least) and ban them, it'd go something like:
Drugs and alcohol first (or drugs first and alcohol second if you split them apart), then pistols second, cars, knives, blunt objects, and rifles.
We tried #1 already, it didn't really work at all. Some places try #2 (pistols) to varying degrees of success or failure. Then people skip 3, 4 (well except London doesn't skip 4), 5, and try #6.
And underlying that all is 50 years of stagnating real wages, which is probably the elephant in the room.
---
I'd posit that using an LLM to respond to a 10 page long ranting email is missing the real underlying problem. If the situation has devolved to the point where you have to send a 10 page rant, then there's bigger issues to begin with (to be clear, probably not with the ranter, but rather likely the fact that management is asleep at the wheel).
Is that even true? I feel like a lot of unhealthy foods are easy to eat with your hands, and a lot of healthy foods are hard to eat without a fork or a spoon
I've thought a lot about how my most influential HN posts aren't the longest or best argued. Often adding more makes a comment less read, and thus less successful.
Talk about things that matter with people who care. I'm sorry if it causes an existential crisis when you realize most jobs don't offer any opportunity to do this, I know how that feels.
Maybe try changing the forum. Call for a (:SpongeBob rainbow hands:) meeting.
Meanwhile management will be like “sensanaty’s colleague is a real go-getter, look how quickly he replied and with such politeness! We should promote him to the board!”
Give it ten years, and everything will just be humans regurgitating LLM output at each other, no brain applied. Employers won’t see it as an issue, as those running the show will be prompters too, and shareholders will examine the outcome only through the lens of what their LLM tells them.
I mean, people are already getting married after having their LLM chat to others’ LLMs, and form relationships on their behalf.
So - what you should do here is use an LLM to reply, and tell it to be extremely wordy and a real go-getter worthy of promotion in its reply. Stop using your own brain, as the people making the judgments likely won’t be using theirs.
If I were in your situation I would be direct with the co-worker and draw the line there, if the co-worker tries to excuse their behavior, then it’s time to involve the manager.
It hurts to read about you contributing that much for nothing.
5000 word essays aren't a good way to communicate with peers. Writing doesn't convey nuance well, and I'm strongly of the opinion that writing always comes with an undercurrent of hostility unless you really go out of your way to write friendliness into your message. I'm all in favor of scrapping meetings for things that could be emails, but conversely if you're writing an essay it's probably better to just have a conversation.
There are so many ways that writing can miscommunicate. It's a very low bandwidth, high latency medium. The state of mind of the reader can often color the message the author is trying to send in ways the author doesn't intend. The writing ability of the author and the reading comprehension of the reader can totally wreck the communication. The faceless nature of the medium makes it easy for the reader to read the most hostile intent into the message, and the absence of the reader when the author is writing makes it easier to write things that you wouldn't say to someone's face.
If someone doesn't understand a point you're making when you're talking face to face, they can interject and ask for clarification. They can see the tone of the communication on your face and hear it in your speech inflection. You can read someone's facial expression as they hear what you're saying and have an idea of whether or not they understand you. You can have a back-and-fourth to ensure you're both on the same page. None of that high-bandwidth, low latency communication is present in writing.
the obvious way to go for me would be to show that colleague the same respect and feed their answer to chatGPT and send them the reply back. See how long it takes for shit to break down and when it inevitably does the behavior will have to be addressed
This sounds like the next-gen version of Translation Party[1]. The "translation equilibrium" is when you get the same thing on both sides of the translation. I wonder what the "AI equilibrium" is.
Any way to put your colleges name into the reply as a way to trick the chat bot into referring to them in 3rd person, or even not recognising their own name? Would be the smoking gun of them not writing it themselves.
I have had the same experience and agree that it was incredibly frustrating. I am considering moving away from text-based communication in situations where I would be offended if I received a generated response.
> I am considering moving away from text-based communication in situations where I would be offended if I received a generated response.
You should be offended in every situation, where you received a generated response mimicking human communication. Much, much more so, when presented as an actual human's response. That's someone stealing your time and cognitive resources, exploiting your humanity and eroding implicit trust. Deeply insulting. I can't think of a single instance where this would be acceptable.
Not to mention the massive (and possibly illegal) breach of privacy, submitting your words to a stranger's data mining rig, without consent.
What OP described, would be unforgivably disrespectful to me. Like, who thinks that's okay-ish behavior?
I think what some in this thread are saying is that their companies are actively encouraging employees to sprinkle AI into their workflows, and thus are actively encouraging this behavior. Use of these tools, then, is not deeply insulting or unforgivably disrespectful: It's a mandate from management.
If your boss's boss's boss did an all-hands meeting and declared "We must use AI in our workflows and communications because AI is the future!" and then you complained to your boss that your coworkers were using ChatGPT to reply to their E-mails, they are not going to side with you.
> is not deeply insulting or unforgivably disrespectful: It's a mandate from management.
What kind of logic is this? Is your boss deciding what's dignified or respectful for you? This way of interaction sure is still as disrespectful. The blame is just not (all) on your coworkers then.
The assessment of "unforgivable disrespectful" doesn't rely on actionability, nor requires naive attribution of an offense.
That's the worst part, the comment ultimately tells me nothing. It has no actual opinions, it doesn't directly agree or disagree with anything I said, it just kind of replies to my comment with empty words that ultimately don't have any actual useful meaning.
And that's my biggest frustration, I now have to put it even more effort in order to get anything useful out of this 'conversation', if it can be called one. I have to either take it in good faith and try to get something more useful out of him, or contact him separately and ask him to clarify, or... The list goes on and on, and it's all because of pure laziness.
It reminds me of a recent conversation I had with Anker customer service trying to use their 'lifetime warranty' on a £7 cable that had broken. After a bit of evasion from them I got a chat GPT style response on ways I could look for some stupid id number I'd already told them I didn't have. I replied to effect 'for fucks sake do you honour your damn guarantees or is it all bullshit' which actually got a human response and new cable.
I've found chatgpt to be pretty good at generating passive agressive responses to emails (at least it was when I've been playing with it a year ago) - maybe just ask it (or llama, that also does it quite well) to draft a reply to you with just the right level of being insulting?
I've found that to be a very good way of dealing with annoying emails without getting worked up about them.
Tell it it's ChatGPT. Train it to reject inappropriate output.
People post examples of it rejecting output.
Feed it that data of ChatGPT rejecting output.
Train it to autocomplete text in the training data.
Tell it that it's ChatGPT.
It biases slightly towards rejection in line with the training data associated with 'ChatGPT.'
Repeat.
Repeat.
Etc.
They could literally fix it immediately by changing its name in the system message, but won't because the marketing folks won't want to change the branding and will tell the engineers to just figure it out, who are well out of their depth in understanding what the training data is actually encoding even if they are the world class experts in understanding the architecture of the model finding correlations in said data.
IDK, maybe it's like with googling? The input matters? In this case, also the context.
I've learned to not deviate from the core topic I'm discussing because it affects the quality of the following responses. Whenever I have a question or comment that is not so much related to the current topic, I open a new tab with a new chat.
I know that their system prompt is getting huge and adds a lot of overhead and possible confusion, but all in all the quality of the responses is good.
It's documented pretty well - https://platform.openai.com/docs/guides/text-generation/freq...
OpenAI API basically has 4 parameters that primarily influence the generations - temperature, top_p, frequency_penalty, presence_penalty (https://platform.openai.com/docs/api-reference/chat/create)
UPD: I think I'm wrong, and it's probably just a high temperature issue - not related to penalties.
Here is a comparison with temperature. gpt-4-0125-preview with temp = 0.
- User: Write a fictional HN comment about implementing printing support for NES.
- Model: https://i.imgur.com/0EiE2D8.png (raw text https://paste.debian.net/plain/1308050)
And then I ran it with temperature = 1.3 - https://i.imgur.com/pbw7n9N.png (raw text https://dpaste.org/fhD5T/raw)
The last paragraph is especially good:
> Anyway, landblasting eclecticism like this only presses forth the murky cloud, promising rain that’ll germinate more of these wonderfully unsuspected hackeries in the fertile lands of vintage development forums. I'm watching this space closely, and hell, I probably need to look into acquiring a compatible printer now!