A bit of an unpopular opinion as it seems, but I would actually bet that the current prompt engineering is just a short term thing.
When the performance of LLMs continue to improve I actually expect that they will become much better to understand not so well formed prompts. Especially when you take into consideration that they now are trained with RLHF on _real_ users input.
So it will probably become less of an engineering problem but more an articulation of what exactly you want
Learning to say what you want is a skill. Much like you can get better at searching, you can get better at saying what you want.
The framework described in the blog post seems like a more formal way to do it, but there are other ways to iterate in conversation. After seeing the first result, you can explain better what you want. If you're not expecting to repeat the query then maybe that's good enough?
I expect there will be better UI's that encourage iteration. Maybe you see a list of suggested prompts that are similar and decide which one you really want?
True, but “learning to figure out what someone wanted to say but wasn’t able to express themselves” is also a skill, I expect LLMs will be able to learn that pretty well.
Imagine you prompt ChatGPT4.5 and it doesn’t give you what you want. You click the thumbs down. ChatGPT says “Hold on, let me try again”. (Behind the scenes, OpenAI runs your prompt through their “prompt improver”, replaces your prompt with that prompt, and just shows you the output, no point showing the optimized prompt since it might be gibberish to a human). The new response actually is what you want, so you click thumbs up on the new output. That “thumbs down, thumbs up” pattern generates high quality labeled data for training the prompt improver, with very little cost.
Yes, "thumbs up, thumbs down" voting is a pretty good way to collect directionally unambiguous feedback that can easily be aggregated across many users to serve as training data.
But it's terrible interaction design for communicating what you want. Imagine trying to fulfill the 1001st most common user need and pressing "thumbs down" 1000 times until you finally get there.
In such cases, being able to construct a more specific prompt would save you a lot of time.
Remember when Google search actually found the words you were searching for instead of creating some quasi common search keywords in a. Shallow pool of most ranked websites?
Right now, most of these AI systems are no better than current Google search. Weird and ironic.
The next major leap in LLMs (in the next year) is probably going to be the prompt context size. Right now we have 2k, 4k, 8k ... but OpenAI also has a 32k model that they're not really giving access to unfortunately.
The 8k model is nice but it's GPT4 so it's slow.
I think the thing that you're missing is that zero shot learning is VERY hard but anything > GPT3 is actually pretty good once you give it some real world examples.
I think prompt engineering is going to be here for a while just because, on a lot of task, examples are needed.
Doesn't mean it needs to be a herculean effort of course. Just that you need to come up with some concrete examples.
This is going to be ESPECIALLY true with Open Source LLMs that aren't anywhere near as sophisticated as GPT4.
In fact, I think there's a huge opportunity to use GPT4 to train the prompts of smaller models, come up with more examples, and help improve their precision/recall without massive prompt engineering efforts.
>> The next major leap in LLMs (in the next year) is probably going to be the prompt context size. Right now we have 2k, 4k, 8k ... but OpenAI also has a 32k model that they're not really giving access to unfortunately.
Saw this article today about a different approach that opens up orders of magnitude larger contexts
This is just ToS violation which will just result in loss of access to OpenAI. There is nothing they can do to stop you from commercially competing, given there is no copyright law precedence
It can be argued that if you build a model using their outputs such that you can then stop using their API, your model is effectively competing with their’s.
Let’s just say that if you’re a startup or SMB, you do not want to be the one dragged to court to iron out whether this holds or not.
They're probably talking about the TOS a user would've had to agree to when using their services. It's actually a lot more permissive then I expected
> Restrictions. You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI;
> use output from the Services to develop models that compete with OpenAI
It can be argued that if you build a model using their outputs such that you can then stop using their API, your model is effectively competing with their’s.
Let’s just say that if you’re a startup or SMB, you do not want to be the one dragged to court to iron out whether this holds or not.
Sure, but the ways of acquiring those outputs legally have vampiric licensing that bind you to those ToS, since the re-licenser is bound by the original ToS.
It's like distributing GPL code in a nonfree application. Even if you didn't "consent to [the original author's] ToS," you are still going to be bound to it via the redistributors license.
There’s no license. OpenAI is not an author of their models’ outputs, and they know it.
OpenAI can’t just start suing random people on the street without any legal basis. That’s how lawyers become not-lawyers.
There’s only a (somewhat dubiously enforceable) ToS contract between OpenAI and the user of OpenAI’s website. This is probably bullshit too - what legitimate interest does OpenAI have in a model output that doesn’t even belong to them, but it’s less obviously bullshit.
> Even if you didn't "consent to [the original author's] ToS," you are still going to be bound to it via the redistributors license.
In the context of the GPL, are there real examples of judgements which bind defendants to a license they never saw or knew anything about, because of the errant actions of an intermediary?
It gives OpenAI a legal basis to launch a law suit if they want to.
Would it succeed? Is it right? Do they care? Eh.
…but, if I as some random reddit user say I might sue you for making a LLM you for training on data that may or may not have my posts in it, you can probably safely ignore me.
If you go and build a massive LLM using high quality data that couldn’t possibly come from anywhere other than openai, and they have a log of all the content that api key XXX generated; they both know and have a legal basis for litigation.
There’s a difference, even if you’re a third party (not the owner of the api key) or don’t care.
(And I’m not saying they would, or even they would win; but it’s sufficient cause for them to be able to make a case if they want to)
Just being able to make a case doesn't mean they will consider the legal fees and resulting judgment to be valuable enough to their business, nor that the suit will even make it into the courts resulting in a final judgment.
A lot of behavior that rides this line is rationalized via a careful cost-benefit analysis.
Sure, I'm just saying that in that cost-benefit analysis the 'risk of case failing and getting nothing from it' is significantly lower; it's your call as a consumer to do your due diligence and decide:
"I don't think they'll be bothered chasing after me"
vs.
"If it came to it I think the court will rule that they don't have a case after we play the pay-the-lawyers-all-the-money game"
vs.
"how screwed am I if they do, I lose and I have clearly, blatantly and provably violated their terms of service"
^
...because, and this is the point I'm making. There is no question; it is very very obvious if you do this, and it's not very difficult for them to prove it.
All they need is to slap you with a discovery order, look at your training data, and compare it to their output logs.
I remember being a "good google querier" before autocomplete rendered that mostly irrelevant. While i think you're right to some degree, you still have to articulate exactly what you want and need from this machine, and no amount of the LLM guessing what the intent was will ever replace specifically and explicitly stating your needs and goals. I see a continuing relationship with the complexity of the task tied to the required complexity of the request.
They still "work" in a sense but it soft ignores them and will also guess at what else you might mean, polluting your results. Some things like site: still work as expected. Been this way for years now.
> They still "work" in a sense but it soft ignores them and will also guess at what else you might mean, polluting your results
Doesn't it usually ask the user if they want to Search instead for "X" and then gives you the results? Its annoying when google thinks it knows best when I use " but it seems to work as it ever has after clicking the Search instead link
Being able to compose a good query is still relevant I think! My peer once asked me for help with a mathematical problem, for which they could not find help online - after not much searching I could find a relevant page, given the same information/problem statement.
Google autocomplete using your query history also reduces the information you learn from suggestions as you do the searching...
While in the past "indexDB.set undefined in " might autocomplete to show safari first, indicating a vendor-specific bug, it'll often now prefill with some noun from whatever you last searched (e.g. "main window") to "help" you.
Haven't found a way to disable that, annoying for understanding bugs, situations/context and root causes.
I can imagine the response including questions. Eg “do you mean X or Y, here are the implications”. That way the machine gets better at helping us clarify our thinking. Right now it’s just blurting out a response that fits your instruction with no attempt to clarify.
And still _a lot of people_ cannot effectively use search engines. It is less about technical capabilities of the search engine, and more about (trained) skills related to finding and filtering information.
To talk to other humans, we literally have a whole writing field, courses to teach how to open to write technical documentation or research grants and so much more.
There’s already a whole industry already on how to talk to the human language model and humans are currently way smarter.
The models are quite good at this already, so while (of course) they will be getting better, the (much) larger gain in performance will be from users giving up more and more privacy ultimately (or allowing local models more access).
Writing better prompts is not as big of a deal people keep making it, and exposes how lazy people really have gotten in light of these new tools. If you ask your friend to make a website and then are mad that they used python on the backend instead of Rust... well - you didn't specify, so it's not really your friends fault. The fact that specifications are needed to fulfill tasks, or that information of what your availability is when you're planning to do something, etc should not be heralded as some sort of amazing "engineering". The term is sickeningly stupid.
Having domain knowledge and expertise helps with communication and how to correctly identify good design - there is nothing interesting really going on here.
I don't think so. It might improve a little but not to the point of making it unnecessary.
The problem is not the LLM, the problem on the other side of the keyboard. No matter how good they are, LLMs can't read minds, they can only guess from what you write, and they won't be able to help you unless you give enough information to express the problem. And it is not an issue specific to LLMs, we already do "prompt engineering" when we are talking to humans. We don't call it that, but it is the same idea: write your message carefully to make sure the person on the other side understands your request.
Maybe, but there is a balance to be found, if the AI asked you for perfect clarity, it would be more like a programming language than a natural language model.
The whole point of GPTs is that they are able to guess what you need based on incomplete prompts and how people usually react to these incomplete prompts. Prompt engineering is the art of telling just enough information for the AI to understand the specificity of your request, but still let it fill in the blanks.
Even as LLMs get better over time at understanding ill-formed prompts, I expect that API prices will still continue to depend on the number of tokens used. That’s an incentive to minimize tokens, so “prompt engineering” might stick around, even if just for cost optimization.
Do you not expect a trend of token prices decreasing over time? There will be business using a less cutting edge model and the difference of how many words a prompt is won't be a big contributing factor to the total spend of the business.
Good point. On the other hand, for every business that sticks to a less advanced model, there might be a competitor around the corner running the cutting-edge one in an attempt to serve customers better.
I was in your place as an opinion a few months ago. I thought this was a short thing, but models will just get easier to use, like many other tools. And that won't stop someone from having to use them with prompt engineering, even if it seems pretty trivial.
It's like the Community Manager, social networks and a creating a community seem to be something very easy to do, but still, someone need to do that.
Not so sure about that. The biggest part of prompt engineering I am seeing is of the kind that sets up context to bootstrap a discussion on a predetermined domain.
As I've said elsewhere, in most knowledge work context is key to getting viable results. I don't think something like this is ever going to get automated away, especially in the cases where the context comes from proprietary knowledge.
It depends on how you define "short term". If you until until AGI, then sure. Until then, however, for anything that is going to potentially generate revenue you will need to consider the points raised by the article to keep costs manageable, to avoid performance regression, etc.
Interesting- my vision of future LLM interface also envisions one where more bits of information per second are required per interaction to operate it to exactly the spec that you want. But for exactly because it’ll just be a plain old engineering problem.
I think that fundamentally the UIs will become more realtime. The models will - because of much lower latencies and more efficient inference throughput - become realtime autosuggest: prompt tuning; i/o feedback at:
(reading wpm)/(”ui wpm”)
in fact it might be interesting to just have a model optimize for “likely comprehensible and as concise as possible” rather than “most alike the human dataset after RLHF alignmen”, just for this bandrate idea
Articulating what you want is prompt engineering. Techniques will adapt to technological progression but the people really getting everything they can out of these systems will still be considered engineers.
> Especially when you take into consideration that they now are trained with RLHF on _real_ users input.
When I started to see people post their amazement at how good the pricing was I began to realize that once again, we are the product right now. We are the new training data and are even paying a nominal fee to be it.
People spent years and years learning how to get the best answers with least possible efforts and search engines evolved with it. Seems pretty insane to me that we have now devolved into asking insanely specific and obtuse questions to receive obtuse answers to any questions.
I expect the exact opposite. As more rules and regulations get put in, prompt engineering is going to be the new software development. "I would like you to pretend i need a lawyer dealing in a commercial lease that..."
How does that make sense? LLM's are machines that produce output from input, the position and distribution of that input in the latent space is highly predictive of the output. It seems fairly uncontroversial to expect some knowledge of the tokens and their individual contribution to that distribution in combination with the others, some intuition of the multivariate nonlinear behavior of the hidden layers, is exactly what would let you utilize this machine for anything useful.
Regular people type all sorts of shit into google, but power users know how to query google effectively to work with the system. Knowing the right keywords is often half the work. I don't understand how the architecture of current LLMs are going to work around that feature.
This nonsense illustrates the most typical way that people misunderstand the word "engineering" in a software context. Software engineering and prompt engineering is not about the self-proclaimed level of rigor or formality that you apply. It's about the actual knowledge and processes used and especially their _effectiveness_ as measured in closed feedback loops.
But the starting point for this is that the term "prompt engineering" is an obvious exaggeration that people are using to promote a skill set which is real and very useful but a big stretch to describe as a whole new engineering discipline.
Regardless of what you call it, like software engineering, it really is a process of trial and error for the most part. With the capabilities of the latest OpenAI models, you should be aiming for a level of generality where most tasks are not going to have a simple answer that you can automatically check to create an accuracy score. EDIT: after thinking about it, there certainly are tasks that you could check for specific answers to create an accuracy score, but I still think it would make more sense in most cases to instead spend time iterating on user feedback rather than trying to think of comprehensive test cases on your own. There are a few things to know, such as the idea of providing examples, the necessary context, and telling the model to work step-by-step.
Actually I would say that there are two major things that could be improved in the engineering described in this article related to actually closing the feedback loops he mentions. He really should at least mention the possibility of coming up with a new prompt candidate after he was done with the first round of tests and also after the users found some problem cases.
I believe it is a bit too much to call the article nonsense. The process described mirrors what we have been doing in Machine Learning for a long time: You setup a training set and validation set, put it through system under test and draw conclusions from the statistical analysis of the results.
writers are now word engineers, artists are image engineers, and politicians are now bullshit engineers.
Prompt engineering is just people putting effort into studying the behaviour of ML models and how input affects the output. They're more like ML psychologists than engineers. Calling themselves engineers just makes them feel better about being glorified prompt testers.
100% agree (especially the politician part). Engineer became a status signum and people apply it to all kinds of stuff, diluting the meaning of the term. Engineering isn’t the right tool for the job in a lot of cases, and there’s no shame in that. It reminds me of “physics envy” in academia where other fields are applying methods and models that aren’t appropriate just because physics has high status.
OTOH, we already have social engineering and reverse engineering, which are now accepted terms, even though the engineering aspects are relatively weak. However, I must say I really still mind the term “data scientist”. Honestly it would be more appropriate with “alchemy”. It would also be quite hilarious. I wouldn’t mind being called a software alchemist.
I object. Please don’t take the word alchemy away from us amateur alchemists before we even figure out an economical way to transmute nuclear waste into gold.
Can we hold off on the cultural dilution until there’s an extant culture to dilute? :’(
Perhaps "prompt doctor", "prompt physician', or "medical prompt professional" is a better term since they're making prompts better. People will probably like that much better since physicians seem to have high status.
It's actually significantly more appropriate as well considering the lack of understanding in cause and effect in both biology (physicians) and NNs (LLMs) compared to the better fundamental understanding from physics which influences design in engineering.
sound like more copium to me. These guys aren't doctors because they don't fix anything. They just study it to use it the best way to accomplish their goal. there's no engineering or doctoring involved. They're just looking for prestigious titles to co-opt for their ultra mundane completely replaceable job.
> These guys aren't doctors because they don't fix anything. They just study it to use it the best way to accomplish their goal.
By that logic, most doctors aren't doctors. The primary job performed by many medical professionals is simply that of convincing the patient to leave the office happier than when they came in, actual medical practice be damned. They "just study" the superficial aesthetics of medicine to "accomplish their goal" of making money and maintaining status (and of course, less cynically, making other people happy).
At the end of the day, these are all silly word games. Titles are meaningless in the search for truth. They exist only to flatter us as the infinitely fallible humans we like to be, and there isn't any point in picking fights over them.
no it means prompt engineering is not engineering, thats just a euphemism used by those people to pretend to be doing something prestigious. They're doing engineering same way subway's sandwich assemblers do sandwich engineering - which is to say they aren't.
Of course, LLMs are like toddlers, prompt engineering is holding their hand so they get from A->B as best as possible. It's providing important context to frame output, guiding it along to "think" out it's processes, and breaking bigger problems down into smaller pieces. Essentially turning prompts into a series of smaller feedback loops - rather than dumping a big query and hoping the baseline toddler is smart enough to figure it out.
It's just about knowing the person you're interacting with extremely well, knowing their quirks and what motivates them, and carefully guiding them along, so they do what you want. Much like social engineering.
Lmao, nobody's equating a prompt engineer with a civil or rail etc engineer.
Stop being pedantic "x engineer" has become a common phrase to describe someone who works on something related to the tech industry, the word engineer has expanded far beyond any original meanings.
The same pedantry could apply to the word "developer", too. It's really a weird sort of gatekeeping of a word.
The techniques in this article are good practice for general model tuning and testing with a correct answer. So for tasks like extraction, labelling, classification, this is a great guide.
The challenge comes when the response is a subjective answer. Tasks like summarization, open question answering generation, search query/question/result generation, are the hard things to test. Those typically will need another manual step in the process to grade the success of each result, and then you need to worry about bias/subjectivity of your expert graders. So then you might need multiple graders and consensus metrics. In short it makes the process very very slow, expensive, and tedious.
I pretty much agree. The "scientific" approach the author pushes for in the article -- running experiments with multiple similar prompts on problems where you desire a short specific answer, and then running a statistical analysis -- doesn't really make much sense for problems where you want a long, detailed answer.
For things like creative writing, programming, summaries of historical events, producing basic analyses of countries/businesses/etc, I've found the incremental, trial-and-error approach to be best. For these problems, you have to expect that GPT will not reliably give you a perfect answer, and you will need to check and possibly edit its output. It can do a very good job at quickly generating multiple revisions, though.
My favourite example was having GPT write some fictional stories from the point of view of different animals. The stories were very creative but sounded a bit repetitive. By giving it specific follow-up prompts ("revise the above to include a more diverse array of light and dark events; include concrete descriptions of sights, sounds, tastes, smells, textures and other tangible things" -- my actual prompts were a lot longer) the quality of the results went way up. This did not require a "scientific" approach but instead knowledge of what characterized good creative writing. Trying out variants of these prompts would not have been useful. Instead, it was clear that:
- asking an initial prompt for background knowledge to set context
- writing quite long prompts (for creative writing I saw better results with 2-3 paragraph prompts)
- revising intelligently
Consistently led to better results.
On that note, this was the best resource I found for more complex prompting -- it details several techniques that you can "overlap" within one prompt:
I recall there used to be a school of thought that argued that making programming languages more like natural language was a futile effort, as the benefits of having a precise, limited, deterministic, if abstract, language for describing our ideas were far superior to any "close enough" approximation we could achieve with natural language. Where have those people gone?
When I step back and think about this LLM craze, the only stance I'm left with is that I find it baffling that people are so excited about what is ultimately a stochastic process, and what will always be a stochastic process. It's like the world has suddenly shifted from valuing deterministic, precise, behaviors to preferring this sort of "close enough, good enough" cavalier attitude to everything. All it took was for something shiny and new to gloss over all our concerns around precision and certainty. Sure, LLMs are great for getting approximations quickly, but approximations are still just approximations. Where have the lovers of certainty and deduction gone? I can't help but think our general laziness and acceptance of "close enough" fast solutions is going to bite us in the end.
I think you're just not thinking hard enough of ways to use it -- use cases where "close enough" can be augmented by deterministic validation, cleanup and iteration to perform real-world work that is "all the way".
I'm currently littering my platform with small, server-side decisions made by LLM prompts and it's doing real work that is working. There are a ton of other people doing this right now. You can be as angry as you want about it, but in a year or two you'll be using the result of this work every day.
Eh, it can be as easily said that you can promote “good enough” all you like, but “good enough” is never going to cut it for critical applications where large money or real lives are at stake. Programming languages need to accommodate such large-scale scenarios, and not only the individual-level use cases that make exactly zero money such as scripts for self-hosted servers that some Joe has in his mom’s basement.
Fair. I don't mean to suggest LLMs have no use, but I do mean to suggest that we are in a period of gross overfitting.
At the end of this all, I think they'll be best suited as massive noise generators. Human beings already generate waaaaaay more data than we know what to do with. With LLMs at our disposal, we'll be able to do so to the Nth degree. It will be the age of spam.
As for real applications, sure it can get you started, but you still have to review and massage the output. In some cases this may help you get past the "fear of the blank page" but in others just doing it yourself to begin with instead of finding clever ways to prompt the LLM to get what you really want would have been more efficient in the long run anyway when you factor in output reworking.
Yeah, I agree with you on the overfitting, but I think when the dust settles the working uses for it will emerge and the rest will be trashed. To your point, I think its fully open-ended uses such as a chatbot will relatively fade. I predict the real value will come out in converting user-intent in some subdomain into a deterministic set of options. The schedule conversion in the article is a common example of this.
If you build these small, domain-specific use cases into a user interface, things just get easier for the user. A big benefit of this is that you already have a ton of context, so the user doesn't have that burden.
Relevant video. I agree with your observations I think they are pretty interesting. Chomsky has an interesting bit here about how the epistemic implications of statistical models like these imply a shift in our values apropos to science and it's aims. For instance, as I understand it he communicates how before we would have considered some theory that made an accurate predictions as satisfying some standard of validity, now it seems people are aiming to see who can simulate the whole thing to the highest precision even if they do it black box style. I also like seeing Steven Pinker squirm a bit, so here's the vid.
Totally, I think one of the major dynamics of our age has been a general shift toward a new "epistemic regime" in which the vast majority of "truth making" procedures have become statistical in nature. If you look at history, we tend to move through different dominant modes of reasoning about the world: deduction, induction....I think our particular contemporary experience is one in which the new dominant mode of reasoning about everything is statistical and probabilistic. In many areas it's a boon, in others it's tragic. (the statistical treatment of human beings is one obvious case; trying to orient education psychological problems according to what the mean demands pretty much ensures everyone loses (gold star students lose out on opportunities, less gifted students start to get some help, but not nearly the case-based, customized help they really ought to)
Deterministic processes are great at dealing with objective data, but less great at dealing with free-form text produced by humans.
Each tool should be used for the right job. Until now, we had only cheap plastic tools for language processing. Suddenly, we have a turbo power tool that can parse through pages of English like a hot knife through butter.
We’re all excited by the shiny new tool in the workshop, and we’re putting everything through it just to see what it can do. Eventually the exuberance will subside and we’ll put it to work where it is the most applicable.
That doesn’t mean we’ll abandon other tools and methods.
I've got some bad news for you: natural language has been used to specify programs since before there were computers to run them on.
Design documents, for one. Specifications. Standards documents. Business requests in emails, bug tickets, etc...
It goes on and on. Fundamentally, the process of programming is to convert something from a squishy human language into a purely mathematical one. Right now, programmers perform this task, essentially as "manual labour". Until now , they've had no power tools. That's what LLMs are. They're not autonomous robots -- not yet. Right now, they're levers for the mind, industrial machinery for developers to let them spit out more lines of code per day.
PS: Just as industrial machinery improved product quality, I'm starting to suspect LLMs will be used in the same way. Imagine a coding-specific variant of GPT 4 as a "pair programmer" constantly reading your code and looking for correctness issues, security vulnerabilities, unexpected gotchas, etc...
Yeah well I’ve got bad news for you too: design documents are not computer programs, and even as they exist today, they are prone to misinterpretation and somewhere in the world there’s always a programmer who’s misunderstanding the meaning of a requirement. Plus, the fact that “prompt engineering” is an emerging topic in HN discourse suggests that you need to structure your prompts in a more rigid way so that LLMs can parse them better and potentially faster, which, again, is what programming languages are: constrained and structured means of communicating logic from human to machine.
> Where have the lovers of certainty and deduction gone? I can't help but think our general laziness and acceptance of "close enough" fast solutions is going to bite us in the end.
They've always been a small group and naturally they congregate in spaces that value certainty and deduction, such as computing.
If we look backward at the history of the sciences, we'll see physics and math figuring prominently. Physics was and still is all about making mathematical models about messy, inconsistent phenomena. Math is certainly "correct" but not deterministic in the same way that computing is and proofs are still largely (but not always anymore) human endeavors of creatively applying theorems, lemmas, and factorings to mathematical objects and relations. The theory and practice of statistics predates computing and is all about reasoning under uncertainty. Underneath all the "1s and 0s" in your computer are very real analog voltages that gates are reacting to to process your logic.
It's pretty much just computing and logics that really value certainty and deduction. So really it's the opposite: the computing craze added value to certainty and deduction when there really wasn't much before.
> It's like the world has suddenly shifted from valuing deterministic, precise, behaviors to preferring this sort of "close enough, good enough" cavalier attitude to everything.
Framing this as a tradeoff is a mistake rather than having two different tools for two different jobs. For ex: there are scenarios where can do part of Google's job or Wikipedia's job sufficiently well, but when being deterministic is what matters then obviously we still have Google and other manual knowledge bank processes.
They aren't going away or being abandoned. They are being supplemented. And since LLMs are so new it's mostly just a matter of the market figuring out where each tool fits into our lives. It's mostly a misconception/marketing that it is a replacement for stuff like search engines or human-backed processes.
But even doing 25% of what Google can do but better + another 25-50% it couldn't do before is a massive business and a huge boon for society.
When the mysterious black box is sufficiently intelligent and its output domain is _also_ natural language, precision and even correctness of the input carries less importance.
You can give an LLM a barely comprehensible query and you stand a decent chance of getting something useful back. Try the same in any conventional programming language and chances are it would not even have a valid AST and thus fail to compile.
So it's a question of necessity--there wasn't one. LLMs are capable enough to deal with the ambiguities of natural language.
That said I've seen some forms of DSL used to program LLMs in the chatbot character.ai scene, so I wouldn't be surprised if a more general DSL purpose built for efficient prompt engineering is discovered (efficient use of tokens, standardized forms, etc).
(I say discovered because at the moment you can for example literally make up your own DSL if you so desire, and the LLM will just roll with it and mostly do what you intend)
Follow the trajectory from manually keying in machine code instructions, to punchcards with ASM, terminals with C, GUIs with Smalltalk… have we not been watching the progression towards natural language?
Formalization of requirements will always be necessary. But allowing for an errant semicolon or misspelled word seems to be the direction we’re headed. That is, explicitly cover all cases and requirements, just don’t worry about exact machine syntax or function names.
I see your point but it's a difference in kind. The general trend in the development of higher level languages has been to allow user to encode common patterns in ever terser sequences of symbols (e.g. no need to write your own iterator, the compiler will do it for you). However, these terse descriptions of a pattern are still deterministic and precise.
This isn't quite the same as using natural languages to try and get a machine to guess at the program you want. I think it'd be a shame if we abandoned the trajectory of ever-higher programming languages in favor of relying on LLM's to generate roughly what we want in existing, "boilerplate heavy" languages. There are quite literally lines of research in programming languages right now that formally and in an incredibly terse fashion, allow you to generate provably correct software from a short, formal meta-specification. To abandon that in favor of LLM generation strategies would be a clear loss in my opinion.
LLM's have their uses of course, but at the moment we're definitely in a period of major overfitting and absolute lack of reason when it comes to the application and development of these tools.
When I read guides like this one, I wonder if "prompt engineering" is a misguided effort to pigeonhole a formal language that by necessity is precise and unambiguous (like a programming language) into natural language, which by necessity has evolved to be imprecise and ambiguous.
It's like trying to fit a square peg inside an irregularly shaped hole, without leaving any space unfilled around the edges of the square.
I use GPT-4 pretty consistently(set up a discord bot for myself). What I found myself doing was tending towards the most simple prompt that the LLM would still understand - If I asked a human expert the types of prompts I was giving GPT, I most likely would've gotten a clarifying question rather than an answer like the LLM was giving me, simply because I'm talking in such short and concise sentences.
I think the interesting thing is that the more concise a message is to a fellow human, the more work needs to be done by the other party in order to actually decode my message, even if it is ultimately understandable. Whereas with LLMs, shorter token length doesn't really matter: matrices of the same size are being multiplied anyways.
I think because a human actually wants to figure out what you want, and a you're just going to keep prompting that ML model until you get something similar to what you want, something that would annoy a human and probably waste their time or make an endeavor extremely expensive.
I don't think its really fundamental to LLMs its just that you don't treat a human the same way you treat an unthinking unfeeling computer system whose transactions are cheap and relatively near instant compared to requesting from a human.
This touches on what I think is a main separator between the GPT models and humans: If a human is unsure of the instructions, they will ask clarifying questions. GPT-4 cannot do that currently.
It can. I’ve had it ask questions without explicitly being told to do so. Combine Chain of Thought and ReAct with question being an action and it will happen regularly.
Maybe should be called Prompt Science or Prompt Discovery or even Prompt Craft.
I have a 40 million BERT-embedding spotify-annoy index that I keep experimenting with to make a better query vector.
One way that I’m doing is getting only the token vectors with the highest sum of the whole vector and averaging the top vectors to use as the query vector.
Another way is zeroing many dimensions randomly on the query vector to introduce diversity.
But after experimenting with “prompt engineering” I found out that prefixing the sentences for the query vectors with “prompts” yield very interesting results.
But I don’t see much engineering. It’s more trial, feedback and trying again. Maybe even Prompt Art. Just like on chatGPT.
Nobody understands the emergent properties of LLMs. Trying to understand how it works is science or research, whereas using it to produce something that’s useful is alchemy.
Even “tuning”, as my sibling comment suggests, is imo a stretch, because it implies some form of finite set of knobs that can be adjusted. Prompts aren’t something that you can simply map to knobs without pushing the analogy beyond reason.
I'm struggling to understand how the two ideas are different in any way other than intent. Sure, I'm not likely to throw an <|endoftext|> into a tailored context, but anybody who, for example, lies about what "assistant" says in the API calls is surely attempting to coerce behavior out of the model that isn't in line with OpenAI's intentions.
I thought you were suggesting renaming "prompt engineering" - the activity of designing prompts to solve specific problems - to "prompt injection", which means deliberately attacking prompts using input designed to subvert their planned behaviour.
To me, that's like rebranding "software engineering" to "exploit engineering" - sure, one is a subset of the other but they are not the same thing.
I don't think "prompt engineering" was ever a clearly-defined practice. The way I see it, it's just some over-eager noobs both prompting and prompt-injecting until they get results close to what they want, and then subsequently pretending like they're engaging in some new branch of mathematical reasoning. Hence why I called the moniker "pretentious".
Personally, I've never liked the title of "software engineer" or even "data engineer" (my own title). However, those are more rooted in engineering-like practices than any of this "prompt engineering" nonsense.
Not sure what counts as a "business problem" for you, but personally I couldn't have gotten as far as I've come with game development without it, as I really struggle with the math and I don't know many people locally who develop games that I could get help from. GPT4 have been instrumental in helping me understand concepts I've tried to learn before but couldn't, and helps me implement algorithms I don't really understand the inner workings of, but I understand the value of the specific algorithm and how to use it.
In the end, it sometimes requires extensive testing as things are wrong in subtle ways, but the same goes for the code I write myself too. I'm happy to just get further than have been possible for the last ~20 years I've tried to do it on my own.
Ultimately, I want to finish games and sell them, so for me this is a "business problem", but I could totally understand that for others it isn't.
Sound like you need to learn to search. There's tons of resources on game dev. I can sort of see the value of using GPT here but have you tried using it in an area you're an expert in ? The rate of convincing bullshit vs correct answers is astonishing. It gets better with Phind/Bing but then it's a roulette that it will hit valid answers in the index fast enough.
My point is - learning with GPT at this point sounds like setting yourself up for failure - you won't know when it's bullshiting you and you're missing out on learning how to actually learn.
By the time LLMs are reliable enough to teach you, whatever you're learning is probably irrelevant since it can be solved better by LLM.
Of course I've searched and tried countless of avenues to pick up this, I'm not saying it's absolutely not possible without GPT, just that I found it the easiest way of learning.
And it's not "Write a function that does X" but more employing the Socratic method to help me further understand a subject, that I can then dive deeper into myself.
But having a rubber duck is infinitive worth, if you happen to a programmer, you probably can see the value in this.
> have you tried using it in an area you're an expert in ? The rate of convincing bullshit vs correct answers is astonishing. It gets better with Phind/Bing but then it's a roulette that it will hit valid answers in the index fast enough.
Yes, programming is my expertise, and I use it daily for programming and it's doing fine for me (GPT4 that is, GPT3.5 and models before are basically trash).
Bing is probably one of the worst implementations of GPT I've seen in the wild, so it seems like our experience already differs quite a bit.
> you won't know when it's bullshiting you and you're missing out on learning how to actually learn.
Yeah, you can tell relatively easy if it's bullshitting and making things up, if you're paying any sort of attention to what it tells you.
> By the time LLMs are reliable enough to teach you, whatever you're learning is probably irrelevant since it can be solved better by LLM.
Disagree, I'm not learning in order to generate more money for myself or whatever, I'm learning because the process of learning is fun, and I want to be able to build games myself. A LLM will never be able to replace that, as part of the fun is that I'm the one doing it.
I have personally found the rubber-ducking to be really helpful, especially for more exploratory work. I find myself typing "So if I understand correctly, the code does this this and this because of this" and usually get some helpful feedback.
It feels a bit like pair programming with someone who knows 90% of the documentation for an older version of a relevant library - definitely more helpful than me by myself, and with somewhat less communication overhead that actually pairing with a human.
>Yeah, you can tell relatively easy if it's bullshitting and making things up, if you're paying any sort of attention to what it tells you.
It's trained on generating the most likely completion to some text, it's not at all easy to tell if it's bullshitting you if you're a newbie.
Agreed that I was condescending and dismissive in my reply, been dealing with people trying to use ChatGPT to get free lunch without understanding the problem recently so I just assume at this point, my bad.
> It's trained on generating the most likely completion to some text, it's not at all easy to tell if it's bullshitting you if you're a newbie.
I don't think many people (at least not myself and others I know who use it) use GPT4 as a source of absolute truth, but more like a "iterate together until solution", taking everything it says with a grain of truth.
I wouldn't decide any life or death decisions on just a chat with GPT4, but I could use it to help me lookup specific questions and find out more information that then gets verified elsewhere.
When it comes to making games (with Rust), it's pretty easy to verify when it's bullshitting as well. If I ask it to write a function, I copy-paste the function and either it compiles or it doesn't. If it compiles, I test it out in the game, and if it works correctly, I write tests to further solidify my own understand and verification it works correctly. Once that's done, even if I have no actual idea of what's happening inside the function, I know how to use it and what to expect from it.
> Sound like you need to learn to search. There's tons of resources on game dev.
I have been making games since / in Flash, HTML5, Unity, and classic consoles using ASM such as NES / SNES / Gameboy: Tons of resources are WRONG, tutorials are incomplete, engines are buggy, answers you find on stackoverflow are outdated, even official documentation can be littered with gaping holes and unmentioned gotcha's.
I have found GPT incredibly valuable when it comes to spitting out exact syntax and tons of lines that i otherwise would have spent hours and hours to write combing through dodgy forum posts, arrogant SO douchebags, and the questionable word salad that is the "official documentation"; and it just does it instantly. What a godsend!
> you won't know when it's bullshiting you and you're missing out on learning how to actually learn.
Have you tried ...compiling it? You can challenge, question, and iterate with GPT at a speed that you cannot with other resources: i doubt you are better off combing pages and pages of Ctrl+F'ing PDFs / giant repositories or getting Just The Right Google Query to get exactly what you need on page 4. GPT isn't perfect but god damn it is a hell of alot better and faster than anything that has ever existed before.
> whatever you're learning is probably irrelevant since it can be solved better by LLM.
Not true. It still makes mistakes (as of Apr '23) and still needs a decent bit of hand holding. Can / should you take what it says as fact? No. But my experience says i can say that about any resource honestly.
>I have found GPT incredibly valuable when it comes to spitting out exact syntax and tons of lines that i otherwise would have spent hours and hours to write combing through dodgy forum posts, arrogant SO douchebags, and the questionable word salad that is the "official documentation"; and it just does it instantly. What a godsend!
IMO if you're learning from GPT you have to double check it's answers, and then you have to go through the same song and dance. For problems that are well documented you might as well start with those. If you're struggling with something how do you know it's not bullshitting you ? Especially for learning, I can see "copy paste and test if it works" flying if you need a quick fix but for learning I've seen it give right answers with wrong reasoning and wrong answers with right reasoning.
I'm not disagreeing with you on code part, my no.1 use case right now is bash scripting/short scripts/tedious model translations - where it's easy to provide all the context and easy to verify the solution.
I'd disagree on the fastest tool part, part of the reason I'm not using it more is because it's so slow (and responses are full of pointless fluff that eats tokens even when you ask it to be concise or give code only). Iterating on nontrivial solutions is usually slower than writing them out on my own (depending on the problem).
Funny enough, I’d been wanting to learn some assembly for my M1 MacBook but had given up after attempts at googling for help as I ran into really basic issues and since I was just messing around and had plenty of actually productive things to work on.
A few sessions with ChatGPT sorted out various platform specific things and within tens of minutes I was popping stacks and conditionally jumping to my heart’s delight.
Yup, ChatGPT is, paradoxically, MOST USEFUL in areas you already know something about. It's easy to nudge it (chat) towards the actual answers you're looking for.
Nontrivial problem solutions are wishful thinking hallucinations, eg. I ask it for some way to use AWS service X and it comes up with a perfect solution - that I spend 10 minutes desperately trying to uncover - and find out that it doesn't exist and I've wasted 15 minutes of my life. "Nudging it" with followups how it's described solutions violate some common patterns on the platform, it doubles down on it's bullshit by inventing other features that would support the functionality. It's the worst when what you're trying to do can't really be done with constraints specified.
It gives out bullshit reasoning and code, eg. I wanted it to shorten some function I spitballed and it made the code both subtly wrong (by switching to unordered collection) and slower (switching from list to hash map with no benefit). And then even claims it's solution is faster because it avoids allocations ! (where my solution was adding new KeyValuePair to the list, which is a value type and doesn't actually allocate anything). I can easily see a newbie absorbing this BS - you need background knowledge to break it down. Or another example I wanted to check the rationale behind some lint warning, not only was it off base but it even said some blatantly wrong facts in the process (like default equality comparison in C# being ordinal ignore case ???).
In my experience working with junior/mid members the amount of half assed/seemingly working solutions that I had to PR in last couple of months has increased and a lot (along with "shrug ChatGPT wrote it").
Maybe in some areas like ASM for a specific machine there's not a lot of newbie friendly material and ChatGPT can grok it correctly (or it's easy to tweak the outputs because you know what it should look like) - but that's not the case for gamedev. Like there are multiple books titled "math for game developers" (OP use case).
Oh, ChatGPT is terrible at actually doing things with ASM in general. It was just good at the boilerplate.
If anyone can get ChatGPT to write the ASM to reverse a string, please show me an example! I’m still having to get out a pad and paper or sit in lldb to figure out how to do much of anything in ASM, same as it has always been!
With respect to writing I've used it for things I know enough to write--and will have to look up some quotes, data, etc. in any case. GPT gives me a sort of 0th draft that saves me some time but I don't need to check every assertion to see if it's right or reasonable because I already know.
But it doesn't really solve a business problem for me. Just saves some time and gives me a starting point. Though on-the-fly spellchecking and, to a lesser degree grammar checking, help me a lot too--especially if I'm not going to ultimately be copyedited.
> By the time LLMs are reliable enough to teach you, whatever you're learning is probably irrelevant since it can be solved better by LLM.
For solving the really common problem of working in a new area LLMs being unreliable isn't actually a big deal. If I just need to know what some math is called or understand how to use an equation, it's often very easy to verify an answer, but can be hard to find it through google. I might not know the right terms to search or my options might be hard to locate documentation or SEO spam
This is fair, using it as a starting point to learning could be useful if you're ready/able to do the rest of the process. Maybe I was too dismissive because it read to me like OP couldn't do that and thought he found the magic trick to skip that part.
I don't particularly have a big problem with math at the level that AIs tend to be useful for, and find that it tends to hallucinate if you ask it anything which is moderately difficult.
There's sort of a narrow area where if you ask it for something fairly common but moderately complicated like a translation matrix that it usually can come up with it, and can write it in the language that you specify. But guarding against hallucinations is almost as much trouble as looking it up on wikipedia or something and writing it yourself.
The language model really needs to be combined with the hard rules of arithmetic/algebra/calculus/dimensional-analysis/etc in a way that it can't violate them and just mash up some equations that its been trained on even though the result is absolute nonsense.
This vibes with my experience as well. In terms of actual long term value, this kind of educational exploration is promising. It acts, in a way, as a domain expert who understands your language and can help you orient yourself in a domain which is otherwise difficult to penetrate.
I also am happy that language-as-gate-keeping, which is plaguing so many fields both in academia and business, is going to quickly be “democratized”, in the best sense of the word. LLMs can help you decipher text written in eg law speak, and it also can translate your own words into a form that will get grand poobah to take you seriously. Kind of like a spell/grammar checker on steroids.
ChatGPT with GPT4 has made me much better and faster at solving programming problems, both at work and for working on personal projects.
Many people are still sleeping on how useful LLMs are. There's a lot of related things to be skeptical about (big promises, general AI, does it replace jobs, all the new startups that are basically dressed up API calls...) but if you do any kind of knowledge work, there's a good chance that you could it much better if you also used an LLM.
I'm really trying to do the same, for both my work, and personal projects. But the type of answers I need for work (enterprise software, large codebase built over 20+ years) requires a ton of context that I simply cannot provide to ChatGPT, not only for legal reasons, but just due to the amount of code that would be required to provide enough context for the LLM to chew on.
Even personal projects, where I'm learning new languages and libraries, I've found that the code that gets generated in most cases is incorrect at best, and won't compile at worst. So I have to go through and double-check all of its "work" anyway- just like I'd have to do if I had a junior engineer sidekick who didn't know how to run the compiler.
I think for the work problems, if our company could train and self-host an LLM system on all of our internal code, it would be interesting to see if that could be used to assist building out new features and fixes.
This is the use-case that OpenAI on Azure is trying to solve. It supports finetuning, too.
Not cheap, though. I think most companies will end up hosting an LLM on local code rather than using OpenAI on Azure, but for now, there is no equivalently-capable replacement.
An eye opening example for me was that I was working with a Ruby/Rails class that was testing various IP configurations [1] and I was able to just copy and paste it into chatgpt and say "write some tests for this".
It wasn't really anything I couldn't have written in a half hour or so but it was so much faster. The real kicker is that by default chatgpt wrote Rspec and I was able to say "rewrite that in minitest" and it worked.
Both ChatGPT and StackOverflow suffer from content becoming outdated. So some highly-upvoted answer on StackOverflow has been out of date since 2011, and now ChatGPT is trained on it.
I see the future as writing test cases (perhaps also with ChatGPT), and separately using ChatGPT to write the implementation. Perhaps we will just give it a bunch of test cases and it will return code (or submit a PR) that passes those tests.
For fun I tried asking chatgpt to create a simple using an opensource project I maintain. The generated answer was sort of correct but not correct enough to copy and paste. It missed including a plugin, used a version of the project that doesn't exist yet, and generated data that wasn't valid datetimes.
Yep exactly. I guess I haven't hit the 25 messages in 3 hours limit, but whenever there's an API or library I'm not familiar with. I can get my exact example in about 10 seconds from ChatGPT 4
I've found Copilot useful when writing greenfield code, but very unhelpful generating code that uses APIs not popular enough to have significant coverage on StackOverflow. Even if I have examples of correct usage in the same file it still guesses plausible but wrong types.
I haven't bought GPT 4 but I'm curious if it's much better at this.
If you don't mention a library by name it is liable to make something up by picking a popular library in another language and converting the syntax to the language you asked for.
If you ask for something impossible in a library it will also frequently make up functions or application settings. If you ask for something obscure but hard to do, it might reply that it's impossible but it is possible if you know how and teach it.
I sort of compare prompt engineering to Googling - you sometimes have to search for exactly the right terms that you want to appear in the result in order to get the answer you're looking for. It's just that the flexibility of ChatGPT in writing a direct response sometimes means it will completely make up an answer.
There's also a limitation that the web interface doesn't actually let you upload files and has a length limit for inputs. For Copilot, I'm looking forward to Copilot X: https://www.youtube.com/watch?v=3surPGP7_4o
Maybe a difference between prompts or v3.5 vs v4. I asked ChatGPT v4 and it gave an answer that used capnproto::serialize_packed::read_message. I asked it about the difference between read_message and deserialize_message and this is the response it gave (truncated, as it got some parts wrong):
> In summary, you should use read_message when working with packed messages and deserialize_message when working with unpacked messages. Make sure you choose the appropriate serialization and deserialization functions based on the format in which your messages are stored or transmitted.
Google was surprisingly little help on the topic. At best it pointed me to https://capnproto.org/encoding.html#packing which covers the general idea but glosses over parts.
The problem with ChatGPT is you can never be certain or confident that its answers are correct. It’s as useful as rolling dice in guessing a number sometimes. With the training from the internet, the dice are loaded, but the answer is still likely to be wrong because it just assembles words together.
It speeds up one off throw away software work that I need to do. Here's a concrete example from work recently:
> I need help writing a SQL statement. I'm using Postgresql and the database has a table called `<name>` with dozens of columns such as: `..._id`, `..._id`, `..._id`, `..._id`, etc. The columns are of type uuid. Separately I also have a list of thousands of uuids and I want to check if any values in my list are used in any of the fields on the `<name>` table. Is there a compact SQL statement that allows me to do this? I would like to avoid listing out all the columns by name.
GPT-4 responed:
> Yes, you can achieve this by querying the information_schema.columns to get the list of columns dynamically and then use a combination of string_agg, EXECUTE, and FORMAT functions to build and execute the SQL query. Here's an example:
I ran it as written and it worked great!
Could I have done this on my own? Sure, but it would have taken a few google searches, reading, trial and error to build up the query, probably 30 or 40 minutes more than just asking GPT-4 to write the query.
Actually that model seems particularly good at writing SQL queries FYI, I could totally see a chat layer on top of a relational database writing all the SQL with no humans in the loop, just natural language --> LLM --> SQL.
Name one business problem solved with any tool that can only be solved by that tool and nothing else.
It's not about uniqueness, the name of the game is efficiency/scaling already solvable problems by multiple or skilled humans and reducing one of those dimensions.
The most actually useful results I've gotten from ChatGPT4 is essentially a replacement for calling software or hardware tech support.
E.g., in a CAD system with which I'm already familiar with the system & searching it's docs, I still find it helpful to ask obscure or complicated situations such as 'I remember reading it can do something like X, what is that and what is the command?' or 'what is a way to find and repair this kind of failure?' — the sorts of things for which I'd call tech support (if it didn't involve long hold times, etc.).
ChatGPT4 has saved me already a number of long multi-dead-end hunts for finding the correct obscure command or operation name, or for building a sequence to debug and repair a model issue. Sometimes the ChatGPT4 answer is just another BS hallucinated waste of time, but it often enough does point me in the right direction far more quickly than I would have on my own.
As any coding AI should. GPT allows me to add some detailed context, with some chain of thought back and forth, to make sure the comments are "on topic". I can give more detailed instructions, especially about things like type hints, and when they should be avoided.
But there are fairly good models for doing NER that are not LLMs. Models that are open source and you can even run on a CPU, with parameter counts in the hundred of millions, not billions.
The article you linked says that GPT4 performed better than crowdsourced workers, not than experts. The experts performed better than GPT4 in all but 1 or 2 cases. And in my experience with Mechanical Turk, the workers from MT are often barely better than random chance.
While true, GPT-4 kinda just gets a lot of the classic NLP tasks, such as NER, right with zero fine-tuning or minimal prompt engineering (or whatever you want to call it). I haven't done an extensive study, but I do NLP daily as part of my current job. I often reach for GPT-4 now, and so far it does a better job than any other pretrained models or ones I've trained/fine-tuned, at least for data I work on.
But what about cost? There was a recent article saying that Doordash makes 40 billion predictions per day, which would result in 40 million dollars per day if using GPT4.
Sure, GPT4 is great for experimenting with and I often try it out, but at the end of the day, for deploying a widely used model, the cost benefit analysis will favor bespoke models a lot of the time.
Aha, for the time being (and probably going into the future without major architectural changes to transformers) prompt engineering is important.
For a one-shot question, it's probably fine. But when asking a transformer to create a completion for a conversation history, it seems to be very important to:
* Start with a clear prompt establishing the conversation, rules and goals, format of responses
* Continue to re-establish these rules throughout the text to remind the transformer
They say "attention is all you need" but transformers seem to be like a child hopped up on sugar, too much input and it starts forgetting things from the start of the input. Hell even from one sentence (or message in a chat history) to another it can forget, presumably because the last chunk of input has keywords mentioned several times through the text and therefore stronger than certain rules (that are only mentioned once, at the start).
I'll ask it to play a role playing game and it'll be all like: "The man X walks up to you an introduces himself, he's wearing a green tunic"
Then after I interact it's all suddenly: "X looks own at his red tunic and also suddenly he's a dragon and actually a woman".
I think it's because attention correctly identifies all the various topics and keywords of the conversation, but it doesn't seem to create a hierarchical tree of "attention"...I wonder if the same principles of finding similarity between embeddings could also apply to the attention mechanism.
If transformers had something like that, it would be nice to reliably set an outline that forms the root node of an attention hierarchy, with the current topic at hand (towards the end of the input) as a leaf node.
This article reads into so much nonsense, I would not be surprised to see that some of the content has been generated by ChatGPT. I mean just look at this:
> Citations required! I'm sorry, I didn't cite the experimental research to support these recommendations. The honest truth is that I'm too lazy to look up the papers I read about them (often multiple per point). If you choose not to believe me, that's fine, the more important point is that experimental studies on prompting techniques and their efficacy exist. But, I promise I didn't make these up, though it may be possible some are outdated with modern models.
This person appears like they are just in the hype phase of the LLM and prompt mania and attempting to justify this new snake oil with all this jargon that not even they understand the inner workings of a AI model when it hallucinates frequently.
"Prompt Engineering" and "Blind Prompting" is different branding of the same snake oil.
Problem with this is that it requires the software to know what the target is when question is asked and I don’t see it as reliable as there are many ways to ask and could have many targets
I don’t really understand your criticism but I’d be happy to continue a dialog to find out why you mean!
There’s probably a little too much going on with that project, including generating datasets for fine-tuning, which is the reason for comparing with a known answer.
It is very similar to the approach used by the Toolformer team.
But by teaching an agent to use a tool like Wikipedia or Duck Duck Go search it dramatically reduces factual errors, especially those related to exact numbers.
Here’s a more general overview of the approach:
From Prompt Alchemy to Prompt Engineering: An Introduction to Analytic Augmentation
Because I wanted to run it in the browser and have a document object in context of the LLM response being eval’d!
Also, having the exemplars typed has saved me from sending broken few-shots many times!
That is, I keep all of the few-shot exemplars in TypeScript and then compile them into an array of system/user/assistant message strings at some point before making any calls to an LLM.
Nice, although you did that so you can avoid having an API when running some web application you make for yourself or am I misunderstanding you sorry?
Because the other distribution paradigms are...
- sharing your key with the user on client side is risky, so you have API side requests.
- one day LLMs might be local and can then run off-browser
The approach I’ve been using is to keep the API requests server-side and to expose a client interface, thus keeping the keys safe, but the response is eval’d client-side so when OpenAI starts referencing document.body in a completion it affects the browser runtime directly.
Yeah it's a smart idea I see now, you can use it as a universal database as such for all clients, like having a Python dict for all the outputs or something, but you can also easily spin up the UIs enabling your cool examples.
I almost always prefer the old school way of prompting, keywords and commands only. Has been working well for fine-tuning Google search results for the last 20 years. Why do I have to Tak with computers with natural languages suddenly?
I’ll bet using ChatGPT for this will be more accurate at this point than human analysis and assumptions. So as to not waste time and effort, here you go; I’m sure this is much faster, efficient and I wouldn’t be surprised, others are already doing this:
Using a therapist ai as an example:
ChatGPT, I want you to help me create a series of prompts that I can use to better tune your model in order to serve humans.
The role I am trying to instill for your model would be that of an ai friend who cares about the interactive user. You will be friendly, caring, supportive but most of all, serve as a teacher in order for the user to grow intelligence.
Prompt engineering is a kind of prompting that is (in a sense) a kind of an engineering, but it's impossible to understand in what sense it is a kind of what engineering, that's why it is so hard to understand _coherent texts_ if they are not completely meaningful.
All this is so exciting and it promises many new jobs that require accelerated education of prompt engineers.
"Engineering" makes it sound as if there was a solid understanding of of LLMs actually works - there isn't. We're still finding out how LLMs work and what they are capable of.
Prompt engineering" is just reasonable guesses and trial and error.
Isn't much of engineering applying tried and true principles and then working with a trial and error/experimentation mindset when combining or extending these principles in untried ways?
You speak as if no EE has ever had to respin an updated set of PCBs, or a ME never had to CNC another part using a stronger material.
Given that most available models are instruction tuned now (they directly respond to instructions rather than just predicting text), not much "prompt engineering" is necessary anymore. It's remarkable how quickly this has been forgotten, the article doesn't even mention it.
There's also no instances of "sample" or "random" in the article.
If you say "can be developed based on real experimental methodologies" but don't talk about randomness or temp (or top_p, though personally I haven't played with that one as much) then I'm going to be very skeptical.
Once you're beyond the trivial ("a five sentence prompt worked better than a five word one") then if you get different things for the same prompt then you need to do a LOT of work to be sure that your modified prompt is "better" than your first, vs just "had a better outcome that time."
Author here. As I noted in the post, this is an elementary post to help people understand the very basics. I didn't want to bring in anything more than a "101"-level view.
I do mention output sampling briefly (Cmd-F "self-consistency"). And yes, there are a lot of good techniques on the validation set too. At the most basic, you can sample, of course, but you can also perform uncertainty analysis on each individual test case so that future tests sample either the most uncertain, or a diverse set of uncertain and not test cases. I also didn't go into few-shot very much, since choosing the exemplars for few-shot are a whole thing unto itself. And this benefits from "sampling" (of sorts) as well. But again, a whole topic on its own. And so on.
As for top_p, for classification this is a very good tool, and I do talk about top_p as well (Cmd-F "confusion matrix")! I again, felt it was too specific or too advanced to dive in more deeply in this blog post, but I linked to various research if people are interested.
To the grandparent re: temperature: when I first tweeted about this, I noted in a tweet that I ran all these tests with some fixed parameters (i.e. temp) but in a realistic environment and depending on the problem statement, you'd want to take those into account as well.
There's a lot that could be covered! But the post was getting long so I wanted to keep this really as... baby's first guide to prompt eng.
Thanks for this 101 article! The entire LLMOps field is developing so fast and is being defined as we speak.
Somehow, this time feels to me like the early days of computer science, when Don Knuth was barely known and a Turing award was only known to Turing award winners. I met Don Knuth in Palo Alto in March and we talked about LLMs. His take: „Vint Cerf told me he was underwhelmed when he asked the LLM to write a biography on Vinton Cerf.“
There are also tools being built and released for Prompt engineering [1]. Full transparency: I work at W&B
LangChain and other connecting elements will vastly increase the usability and combinations of different tools.
Try following the links in the article. They give much more detailed information. For example, your temperture explination can be found here[1] (Ctrl+F), which is also linked in the article.
My personal next step with LLMs is to use them as completion engines versus just asking them questions. Few shot prompting is another intermediate skill I want to incorporate more.
It's a good start. Also, it's good to use a toy problem for explaining how to do it. It would be great if more people published the results of careful experiments like this, perhaps for things that aren't toy problems? It would be so much better than sharing screenshots!
However, when you do have such a simple problem, I wonder if you couldn't ask ChatGPT to write a script to do it? Running a script would be a lot cheaper than calling an LLM in production.
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."
Ok, but in that case it would be better not to post anything—not to add an additional post that lacks substance.
I don't know what you mean by gatekeeper but I'm a moderator here and it's my job to tell people this stuff and try to persuade them to follow HN's rules: https://news.ycombinator.com/newsguidelines.html. In case it helps at all, such moderation posts are even more tedious to write than they are to read.