A simple example might be the problem of "pick a color". Even the best natural-language interface is going to suck about as much as if you're trying to ask another human to do it for you, even if that assistant is capable of displaying 1-5 color swatches in their replies.
Instead of just seeing the entire palette and choosing, you need to say "I want a gold color", "lighter than that" and "darker" and "less like urine" etc.
> People fundamentally don’t know how to prompt. There are no better examples than Stable Diffusion prompts.
You know, this reminds me of the Good Old Days of internet search engines, where a little expertise in choosing terms/operators was very powerful, before the advanced-case was cannibalized to help the average-case.
I think that confuses doing versus delegating. Delegation is easy to do via a text-box, because you're just kicking the interactive complexity-can down the road to someone else, often in a way which can be problematic even with actual humans.
For example, a project-manager or executive could verbally delegate "make a new registration page for the site" and "needs more rounded corners", either to an AI or to an employee or offshore contractor.
However that's not the same as trying to program exclusively by typing (or dictating) prose to a text-box. ("Page down more. Go to method pee-reg underscore apply. Show me its caller methods. Go to caller method two. Type the following into line 7 position 43...")
There might be some parallels we can draw with the last few decades of "programming business logic will be replaced by drawing diagrams" predictions.
This is exactly it, thanks for putting it down clearly.
Which is also why news of the death of programming as a profession are greatly exagerated. You're not being paid to write code, you're being paid to make decisions. Code is easy, or at least much easier than natural language.
You're also paid to tease out the REAL requirements out of PMs/management/users/etc.
Most of the time, what is being asked for, on its face, is not what is actually wanted, not as simple as spelled out, has some A-B tradeoffs to decide, or maybe not worth it given the side effects.
If a developer isn't asking multiple questions per feature, they deserve to be replaced by an LLM.
They won’t be replced by an LLM but by another person using an LLM, most likely a dev and possibly the same person who is asked to use an LLM to increase productivity. I see more and more companies/institutions adopting LLMs and training their workforce to use em. Interesting to know how all this will play out.
Seriously. I think posts/articles/etc related to replacement of Software Engineering jobs to AI are exaggerated and probably driven by jealousy or sadism. Just ignore those and move on.
(It was very depressing to believe that Software Engineers will lose jobs)
Views. I'm inundated with AI content but most of it lacks any substance. It's mostly "wow GPT is really dumb and can't behave like this supergod AGI I just made up" to "wow GPT will take over all our jobs in 3 years, it's so powerful".
Perhaps worse than the vacillation between getting terrible answers and great answers: When you simply can't tell which kind of answer it is, not until you've sunk a bunch of effort validating or implementing it. (Perhaps finding that the system invented some core fake APIs, non-existent citations, or algebra errors.)
Almost an echo of P/NP categorizations: It's tough when the effort of fully verifying a proposed answer is too close to the effort of just solving it normally.
The common occurrence of hallucinations makes it hard for me to believe anyone will be using LLMs to produce code anywhere outside of shops who really don't care about errors. Until they fix that, code is a use case where even slight errors make the output useless.
I have been using Dall-e, and testing the dall-e chatgpt plugin. Even tough both are supposedly natural language interfaces, I find I approach the dall-e prompt more like writing a formula than real language. Using the gpt plugin is like delegating to a designer to write the prompt for you. personally I don't like the results of that compared to what I would make myself.
I don't think they see it the same way. At least not given that instructions in the style GP mentions:
> Instead of just seeing the entire palette and choosing, you need to say "I want a gold color", "lighter than that" and "darker" and "less like urine" etc.
have meme status among designers, and not in a positive way. Some years ago, when I hanged out with a couple designers, I was introduced to Facebook groups exchanging examples of "briefs" and rework requests. Groups with names like "what the psyche of a graphic designer endures".
Stripped of all the banter, I'd say their complaints are the same as ours: vague requirements coming from people who don't know what they want. And like with software, "good results" come as much in spite of, as thanks to, natural language communication.
They (and we) should be glad that we get these vague requests from people who don’t know exactly what they want. If they knew what they wanted in precise enough detail, they wouldn’t need programmers or designers. Much value is added (and paid for) in turning the vague/abstract into precise, concrete, finished artifacts, whether designs or systems.
In the deep past, I considered design to be an industry full of poets and philosophers. I suspect I got this impression from home improvement shows where they bring in an interior decorator to toss throw pillows around. Then I ended up working with three high quality designers in a row.
At this point I consider the design industry to be cousins or even siblings to the software engineering industry. All those incomprehensible design decisions that pop up in popular software don't come from designers debating faux marx or freud in coffee shops at 3am. They show up from management and other stakeholders who at the 11th hour decide that suddenly everything has to be flat because they read something in a magazine.
The bad decisions are fought by designers tooth and nail and the fact that anything looks halfway descent at all is due to their herculean efforts. If anything they deserve more sympathy than we do because we can always retreat into low level communication protocols or type theory when we need to get the muggles off our backs. But everyone has an opinion on how that button looks.
Yes, this matches what I heard and saw when hanging around the designers I mentioned.
BTW, quoting from the penultimate panel of that excellent Oatmeal piece (which drives home just how similar are the experiences of designers and programmers):
> You are no longer a web designer. You are now a mouse cursor inside a graphics program which the client can control by speaking, emailing and instant messaging.
This gains a new meaning, or at least becomes an interesting parallel, with LLMs in the picture. Many of us - myself included - already use GPT-4 as, paraphrasing, "a keyboard inside an editor program, which you can control by instant messaging". Ignoring that diffusion models can spit out parts of the design wholesale, someone is bound to eventually hook GPT-4 up to Photoshop or Gimp and get a graphics program you can drive by texting it.
... just remembered, I think someone already did that to Blender, made easy thanks to Blender being able to eat Python code and spit out 3D graphics.
Tangentially, earlier talk a "text-box interface" made me think of Blender's "type the name of the immediate action you know should be possible but can't quickly find in hierarchical menus" box--a feature also present in some IDEs--and I'd like to emphasize that those things are (A) totally different than all this AI stuff and (B) generally awesome.
Agreed. Unlike the AI stuff with its "empty textbox" problem, fast incremental search is capital-A Awesome! Pretty much my favorite UI paradigm ever, at least out of those that gained adoption after I started using computers.
The best incremental search UIs are those that respond near-instantly, and have a stable list of candidates that is (or at least feels like) being filtered, and not like every keystroke re-runs some search from scratch. Prime example, which made me love this UI paradigm, is Foobar2000 - even back in early 2000s, I could have hundreds or thousands of entries in the music library, and then I would type into the magic textbox and watch that huge list (or tree) get instantly trimmed with each keystroke.
> A simple example might be the problem of "pick a color".
People still underestimate the power of LLMs. You ask it to show you a color picker, it generates HTML code for a color picker, you copy that into your browser and you can pick your color, which you can then copy&paste back into the LLM for further processing.
This already works and no human had to code a color picker into ChatGPT for this (and this is why LLMs are scary).
More broadly speaking I find the idea of "LLM apps" a bit problematic, it's basically the modern Microsoft Bob. The LLM itself is already the most powerful app you can think of. Trying to hide it with a UI that looks a little more than what you are already familiar with is removing its expressive power.
>> There are no better examples than Stable Diffusion prompts.
StableDiffusion prompts are a terrible example for the power of modern LLMs, as StableDiffusion has extremely primitive understanding of language, unlike ChatGPT. With StableDiffusion you are really just laying keywords and concepts together hoping that something interesting will happen. The moment you ask it for anything even remotely complex it falls apart. Ask it to generate "blue hair" and it might give you blue hair, but it will also paint random other objects in the image blue. Even simple attributes don't stick to the objects you assigned them to. Complex actions or expressions don't work at all. You have to use ControlNet, in-painting and other tricks to create complex images. The language model of StableDiffusion just can't handle it and the image generation itself is also lacking in generalization (i.e. you need custom trained models for specific styles or topics). It also doesn't allow the iterative refinement that you can do in ChatGPT, you only get a single prompt.
Prompt engineering is a short term workaround for the limitations of the current models. But that is going away. After all you have a LLM at your finger tips and guess what that's good for: generating text, which includes prompts.
> You ask it to show you a color picker, it generates HTML code for a color picker, you copy that into your browser and you can pick your color, which you can then copy&paste back into the LLM for further processing.
This is slower, more awkward, and less efficient than just picking a colour from an existing colour picker.
The point is that nobody had to program this. Nobody had think up front "Will the user need a color picker?". Nobody had to find a spot in the UI to place it. You can just will it into existing as a user with nothing but the power of the LLM. No classic app has anywhere near that amount of expressiveness.
Future versions of chatbots will of course have support for <iframe> or similar to display this kind of stuff inline, that should be obvious.
I don't think that's the point; it's not that no one had to program the colour picker - it's that no one did. The workaround shows that there was a need for it.
Having to copy and paste code to get a colour picker that you can then use and then paste the output back into the chatbox is less efficient than using a colour picker. LLMs can work as general interfaces, but the trade-off is that they're less efficient than a specific one.
One duty of the programmer and product manager is to think about the likely uses of the program and to build a UI to enable it. If users wanted a blank slate they could write the program themselves, or have chatGPT write it.
Maximum expressiveness is not the goal, because it comes with a price. There is a balance to be struck between expressiveness, and economy of effort and cognition.
> You ask it to show you a color picker, it generates HTML code for a color picker, you copy that into your browser and you can pick your color, which you can then copy&paste back into the LLM for further processing.
Better yet! If you're not happy with the LLM, you ask to speak to its manager. The LLM then downloads the internet, the source code for some random LLM project found on Github, starts training a new model, and creates a chat where both you and the two LLMs interact.
Quickly, the two LLMs start arguing with each other, and the manager LLM finds a few security flaws in the company's infrastructure, hacks into the company's AWS account and deprovisions the original LLM to "fire" it.
After a few more back-and-forths, the manager LLM gets tired of you, starts calling you a Karen, creates an account on Twitter and posts images of your conversation logs. The topic starts trending. Eventually, the LLM picks a fight with Elon Musk and gets banned from Twitter.
> After all you have a LLM at your finger tips and guess what that's good for: generating text, which includes prompts.
Reminds me of the boom in voice assistants, when we were told it was the interface of the future.
I’d ask my Google Home what the temperature was going to be today. It would tell me. I’d ask what the temperature was yesterday. It would tell me it didn’t understand the question.
ChatGPT etc obviously aren’t quite that bad but the core problem remains discoverability. It isn’t at all clear what these interfaces can and cannot do, all you can do is keep trying different things until you get a result.
Funny anecdote, but I have literally have started programming by doing “smart” assistants like the former, and I don’t think they were much worse. What I did was to lemmatize the words (my native tongue is agglutinative so it was somewhat harder than with English), and simply look at a fixed set of “commands” like “play it”, and pass the rest of the words as parameters when needed (I searched youtube for a video in this case).
Unfortunately I have seem to lost these “beautiful php” codes, even though I would be very curious how bad that code was :D
The best voice assistant I ever had was the one I DIY-ed around 2007, using Microsoft Speech API and a cheap piezoelectric mike I soldered to a long cable, hung off the wardrobe, and plugged into PC. The code itself was a mashup of MS SAPI demo of "controlled language" interface and some tutorials for how to control WinAMP with WM_USER messages in WinAPI. I designed a little tree of commands, maybe 3 level deep, wrote the magic XML for it, some trivial C++ logic for driving the voice recognizer and reacting to identified commands (including one that held the recognizer two levels deep in the command tree, so I could issue multiple commands from a subtree without having to repeat two extra words for each).
The result of this couple afternoons of working on this (instead of learning for my maturity exams, as I was supposed to), was a system that, I kid you not, was more reliable and delivered more value to me than any of the current voice assistants. For one, its recognition was flawless. The typical interaction would look like:
$ Computer!
> <appropriate beep from Star Trek: TNG, because of course
doing this was 90% of my reason for building the program>
$ Music, Playlist Alpha
> <appropriate confirmation beep, WinAMP begins to play>
I had commands for usual play/pause/resume, next/previous, four playlist (alpha through delta), and volume control at different granularity ("mute", "one quarter", "two quarters", "three quarters", "full", plus "louder" and "quieter" for IIRC +/- 5% or +/- 10% jumps). Plus some stubs for non-music thing that IIRC I never eventually implemented.
Here's the thing: it worked flawlessly. It heard me across the room. It heard me through music so loud that it was uncomfortable to talk in. It never self-triggered (except that one person who managed to make a swear word be read as the wake word, a single case out of many who tried). It worked fast - I could complete the whole command chain in less time than Google Assistant takes to start listening after "OK Google". The secret? Constrained grammar and training.
In order to use speech recognition in Windows back then, you had to turn it on and let it analyze a sample of your voice (offline! those were the days!), based on a recording of you reading some calibration text it gave you. This process was additive - you could repeat it to improve recognition accuracy. But a little known fact was that you could also supply your own text - and that was the other half that made the magic happen.
I created myself a training text, consisting of individual command words and their sequences, and trained the Windows speech recognition on it multiple times, under varying conditions. Specifically, I run:
{three locations in the room} x ({no background} + ({classical music, pop music, whatever was on FM radio} x {quiet playback, normal playback, very loud playback}))
training sessions. That's 30 sessions of repeating the same text. Each one took maybe a minute or less, so I was done with it in about an hour. And after that training, no matter where I was standing in the room and what I was doing, the voice control system worked with near-zero false positives and near-zero false negatives. I say "near" because I had maybe two or three cases of each, over months of continued use. And yes, I could play music so loud you couldn't talk in the room, and I could scream out commands, outshouting the music, and it would work. Try that with Google Assistant.
To recap: I had a system I hacked together in couple evenings, whose software was a relatively small tweak to a default example project (but done with love!) and hardware was hand-soldered from cheapest, locally-sourced parts, that did everything I wanted from a voice assistant, did it flawlessly, much faster than any of the voice assistants on the market today, completely off-line, in 2007, on a mid-range PC, without noticeably taxing its resources. This is why I occasionally rant that voice assistants are bloated and done backwards - all because they're designed to suit vendor needs first, user needs second.
--------
But hey, I know a way Google, Apple, Samsung (!) et al. could fix the shitty performance of their voice assistants and dictation software. They need to fine-tune a LLM on a dataset made of target words/sentences, and transcripts of them being misheard in great many ways. Then they need to feed the output of their voice-to-text pipeline through that LLM, so it can correct the text wholesale. That, or maybe, you know, do whatever Microsoft was doing in 2007 that made dictation work well and offline.
Cool project! Just to chime a bit into the topic at hand, I guess part of the reason why our experiences with our self-made creations were overly positive is that we were familiar with what could it do, and what were the “magic prompts” for achieving that. Today’s systems are expected to handle a much more diverse input space (though with LLMs it should be absolutely feasible).
My experience with siri’s “hey siri” recognition is not bad though, the restrictions here are energy efficiency, so that a special always-on part has to listen to these commands and wake the CPU for the part that comes after.
- get a weather forecast for today
- set an alarm or timer
- control smart light bulbs
- play a song
IMO a huge input space is a negative feature. Either the input space should be explicitly limited and known, or it should be almost totally complete (which isn't really feasible). Attempting to cover a large input space without completeness just means that it's really unreliable for new inputs.
I think today’s LLMs more than fit the bill for the latter — most of what I might ask from Siri are easily answered more intelligently by ChatGPT. And I say that as someone who is overall quite skeptical of LLMs, and think they are way overhyped — this is a niche they could easily and competently fit.
> we were familiar with what could it do, and what were the “magic prompts” for achieving that.
That's the thing though: in my system, there were no "magic prompts". What Speech API gave me, instead, is the ability to use "controlled language" - constrain the set of possible words at any given moment. That, and as a user, to train the living hell out of them in Windows settings.
Yes, today's systems "are expected to handle a much more diverse input space". But maybe they shouldn't be, since they all seem to suck at it. My knowledge of Siri, Alexa and Cortana is purely anecdotal (don't have devices with the first two, somehow was always region-locked-out of the last one), but I have first-hand experience with Google and Samsung assistants and dictation tools. And that experience is really, really bad. Neither can understand me very well in English, even if I try to speak very carefully. Both get randomly triggered (sometimes resulting in funny situations - like the GA on my mom's phone self-triggering while she had it in her jacket, and before she fished it out of the pocket, the assistant managed to misinterpret some overheard conversation and apologize for perhaps being annoying). There's no obvious way for me to calibrate them for my voice. Both run recognition in the cloud, making any attempted conversation slow and annoying. And despite claims to the contrary, Google Assistant can't handle multiple languages - not just in a single voice query, but even across separate sessions. Whenever I try, I have it randomly decide to either parse Polish as English, or unilaterally decide to switch languages, changing its own response language and voice, and then fail trying to parse English as Polish.
I could list more and more bad experiences, but my overall point is: while I recognize different and broader challenges current voice assistants face, my little teenage evening project from 15 years ago serves as a POC, demonstrating that 2007-era tech could handle 90+% of my use cases[0] for voice assistant flawlessly, much faster, and offline. Surely there must be some middle ground somewhere.
--
[0] - Really, all it would take is to expand my command language grammar XML file with a couple extra subtrees for other topics, such as timers or system settings. Remaining <10% are the parts actually requiring unconstrained speech recognition, e.g. to transcribe the search query I want to run. I haven't tested that much back in 2007, but even if it failed completely, the totality would still be way more useful than Google Assistant is to me today.
False positives matter a lot in this use case: most of my anger at Google Assistant is less about it not understanding me >50% of the time - it's mostly about how more than 50% of misunderstandings cause it to loudly read out long texts, call a random contact, or launch a random YouTube video.
i think by "magic prompts", what the previous post meant was that you knew all the possible commands (e.g., "Music, playlist alpha").
in theory, you could just look at a manpage for the speech api and know every keyword. there's no manpage for siri/alexa so you don't know what the commands are -- you just have to guess and when it works it supposedly "feels like magic"
And this is a mistake, IMHO. I mean, it sort of works with ChatGPT - now, and only somewhat reliably since past 3 months. It didn't work and doesn't work with voice assistants.
There is fun in exploration, in discovering new and useful or interesting functionality on your own. At least, when you're young and have ample free time for it. For adults... well, between blog posts and in-app examples, they gave us scattered map of the language anyway. Might have just compiled it into a reference guide from the start.
After all, those voice assistants still have a command grammar, similar to that of my system. Users end up having to learn that grammar anyway. Hiding the grammar, adding some fuzziness in command matching, and then putting an unconstrained voice-to-text engine in front... didn't really improve anything, and only made the problem much, much harder. A self-goal. And the only way it "feels like magic" is that it feels like your phone's being haunted by an angry poltergeist.
Current Voice assistants aren't much more than hardcoded GOFAI software. The only relevant ML involved is in the speech-to-text model. Modern language models are on a completely different level. Problem is that they need absurd amounts of VRAM, so running them locally on a phone is out of the question.
> It isn’t at all clear what these interfaces can and cannot do, all you can do is keep trying different things until you get a result.
You would ask an LLM based assistant what it can do, in the same way you can ask ChatGPT what it can do.
> You would ask an LLM based assistant what it can do
But this has the same problem that it's trying to solve in the first place: the LLM's behavior is unpredictable, and that includes its answers to questions like this. There's no guarantee that it won't hallucinate.
Maybe this can be ameliorated by giving it access to some hard-coded and highly vetted list of capabilities?
Add to this that voice assistants fundamentally got worse and worse as their makers tried to monetise, increase usage etc. it took cool tech and made it annoying.
While I agree with the title, I find it very lackluster that it entirely focuses on some specific AI interfaces.
First of all, ChatGTP implies by its name that it's designed to chat. It's a sophisticated chatbot, but a chatbox nonetheless, you are not going to have a chat by filling a form. Then the examples given for StableDiffusion are not natural language. In this case, there can probably be a better interface than a single textbox, but the issue is the textbox, not that it expects natural language (it doesn't).
Other types of interface for AI stuff do exist. Copilot is also an LLM, but it takes surrounding code as input, not a natural language prompt. Plenty (most) of models take whatever format is convenient for the task and output an adapted format (image/video/text classification, feature detection...).
On the other hand, there are some interfaces that force natural language processing where it is one of the most unnatural and ineffective option, and no AI whatsoever is involved. Anyone who ever tried to book a train ticket in France in the past couple of years know what I mean[0]. Having to spell out an itinerary and date information is very, very confusing and error prone.
I'd like to take issue with the characterization of copilot as not needing textual prompting.
To get it to work, you need to use comments. The more the better. You can put huge amounts of context and information in them. This involves writing and description. It's exactly the same as what the author is arguing against.
It can only be good when the interface already knows a massive amount of context. For example the LLM in a future use case knows all your email and you just ask it to draw a time line of this thread and etc.
If you are asking it to do stuff from scratch like most of us do on ChatGPT, it’s quite a pain
Exactly. It doesn't need extra buttons in the UI, it needs a giant context window to contain all the important info about your life or a business and the tools for gathering and maintaining that context. Then merely saying that you are seek becomes enough to autocomplete the rest of the actions including sending an email.
"There is nothing, I repeat nothing, more intimidating than an empty text box staring you in the face."
Talk about a hyperbolic opening line.
Is it really that intimidating to have an empty text box on Whatsapp or your favorite SMS app? No, as you expect to have an appropriate response coming from the other side, pretty much regardless of what your input is.
As a frequent user of ChatGPT, I've come to expect the same in there. And it works great, without me having to study any "prompt engineering". In fact, as it gets updated, I get frustrated less and less often — unlike my experience using Bard, which can be better for a few tasks but often returns opaque errors that do feel frustrating. The solution here is clearly for the model to improve, and one doesn't even need a leap of faith — just look at what OpenAI is already delivering!
Talking to a competent LLM is nothing like talking to bash or dos. I also get frustrated when I sometimes have to ask for the same thing in a slightly different way... but that's still almost always faster than searching for the right button or submenu in most creation-oriented software. Whoever is waiting for Word or Google Docs to add a "write this in business-formal email tone" dropdown menu to the UI clearly hasn't grokked the true shift we're about to go through in computing.
Incidentally, I am often using ChatGPT to help me do more advanced / rarely used tasks in software from Avid Pro Tools to Adobe Premiere. And I can't remember a single time when doing this was slower or more frustrating than reaching out to either Google or the software's own "help" section.
Of course we'll have more input options. It makes tons of sense for things like image or video generation. I bet the models will also soon be outputting more and more "interactive elements" that will aid in refining results. But I have a feeling the opening text box (or, better yet, the open ears of a friendly audio assistant) is here to stay.
> or, better yet, the open ears of a friendly audio assistant
It’s interesting you mention this. I’ve been wondering this for a while now - there have been made leaps recently in LLMs, speech synthesis and speech recognition. There are sophisticated language models, computer voices that are hard to distinguish from real humans, and software that can reliably understand even the worst recording of someone speaking.
Yet still, those three components have not yet been integrated in a next generation Alexa yet. But why? It doesn’t even sound particularly complicated (on the scale of all the prior art necessary).
BingChat already takes voice input and gives voice replies, but still requires the push of a button in the UI to start, it still can't run as a voice assistant in the background.
Principal–agent problem! Previous generation assistants have been frozen in time by managerial capitalism. This is evident in literally all the incumbents that matter in the western world: Google, Amazon, Apple, Microsoft and Samsung.
It took founder-led OpenAI to kick everyone in the butt. Thankfully the wheels are moving again to get to what you're describing, an inevitability in the very near future.
Sam Altman is the principal. The agents are MAMAA middle managers, who are often also smart (though I'm clearly biased here, having been one of those in my previous life) but highly incentivized to be obedient and risk-averse.
> Is it really that intimidating to have an empty text box on Whatsapp or your favorite SMS app? No, as you expect to have an appropriate response coming from the other side, pretty much regardless of what your input is.
Yes, very much yes - sure, I can expect to get an appropriate response, but that doesn't change the fact I don't know what to write about to start the conversation. Empty text box does indeed scare me - if I want to write something, but have no good idea what to write (or more than couple competing ideas that feel equivalent), my mind simply goes blank.
Yes, this applies to ChatGPT too. There's million of things I want to bounce off GPT-4. But when I have the time, none of those things come to mind.
Surely the author and I are not the only ones in this. There's a reason the "fear of empty page" is a term amongst writers. There's a reason you may occasionally hear of the "fear of empty text editor" in context of programming.
Of course, if I know what I want, then it's all fine - except, I find myself constantly constrained by my own typing speed. Doubly so now, with the recently improved response time of OpenAI's GPT endpoints.
Do you go to google.com when you have nothing to search for?
When you do, does the current opening UI feel that inadequate?
There's a simple reason they haven't added tons of UI elements for things like advanced search operators: the vast majority of the queries and the vast majority of users simply don't need them to get what they want from the tool.
FWIW, they have no reliable way of measuring what "vast majority of users" "want from the tool", or when they get it. Users navigating to one of the search results and abandoning the search may mean they found what they want - or it may just mean they gave up.
## Python class called Car
def Car(make, model, year):
"""This function creates a new car object"""
car = {"make": make, "model": model, "year": year}
return car
This article isn't about natural language or AI, IMHO.
Typing into a keyboard is not natural. Adding buttons makes it even less so.
We don't have AI avatars today. But we will within my lifetime. I'll be able to converse with most any historical figure in VR. We will see each others facial and hand expressions. Let's not go backwards and add buttons. Put your efforts into going forwards.
Considering we move away from natural language whenever we have the chance to do so (many early formal languages were meant to target the humans who would in turn program computers!), I don't think this is really the case except for extremely trivial cases.
Natural languages are terrible programming languages for many of the same reasons programming languages are terrible natural languages.
> Natural language is the ultimate programming language. We use it to program us humans all the time
Unreliably, with mixed results.
Just yesterday, I said something loudly from my basement so that my family upstairs could hear, and the response was for them to ask why I was angrily yelling nonsense at them from the dungeon. YMMV.
However, I think the title is misleading. Maybe I am too much of a technical person and take interface too broadly. But natural language is probably the most natural interface as humans learn this really early and use it every day.
Also, the word "natural" may not be suitable for this discussion. Maybe the author has this in mind already but I think we should rather talk about intuitive, efficient, motivating, … interfaces.
In general, I guess HCI is not about natural but rather useful.
We have been using natural language interface where I work for 5+ years (pre LLM) and honestly if applied correctly it can be very sticky and effective.
For example...Your application may have multiple capabilities serving multiple user types.
Rather than smack an empty text box in the front page you can try embedding the box into a specific capability and develop parsers focused on the particular domain of that capability. This limits the scope and thus chances for not delivering on the users intention.
Most of the arguments I hear from my mates-who-code about adding deterministic old school things like buttons and "ui" on top of LLMs is "sometimes it's faster and more specific"... Ok, sure, but go to voice to text, and give the llm in a voice version, to a grandma. They can just use it.
and if you, the programmer need specifics "I want to do root cause analysis of these 20 incidents, break them up into mobile and web app, then draw common threads in each type of RCS".. then write/say a better prompt that exactly details what you want. Might take 2 or 3 goes, but you will get there. Adding metadata tabs like some vector-db franken-sharepoint seems like a step back. At least from a Rich Sutton world view. Let the LLM work it out- and if it fails, improve it for a many uses cases, not just one.
Strong disagree. These are straw man arguments about the limitations of models. Senior executives, politicians and people in high leadership positions can see their wildest dreams realized using little more than natural language and discussions with the world’s leading experts. Satya Nadella or Joe Biden don’t enjoy and special abstractions and yet they are able to get everything they want using natural language.
The ability to express oneself through natural language is tangentially related to its suitability as the foundation of a user interface.
A natural language should let you express your thoughts effectively. A user interface should let you accomplish the task you want effectively, within the limited capabilities of the program you are using. And it should expose those limits, not make you discover them through trial and error, as opaque, conversational interfaces do.
Or maybe they use natural language because they have no choice. If Nadella or Biden could just push some buttons to make things happen, they probably would. I know I would.
The problem is that most people don't know how to express what they want (and often they dont know what they want). But it's a bit myopic to stick to language interfaces, the LLMs are already integrated with images, and over time they will be integrated with entire GUIs.
I'm really confused by "Anecdotally, most people use LLMs for ~4 basic natural language tasks" and "Most LLM applications use some combination of these four".
I'm not sure about the `ELI5` use-case, and feels like this is only true for a very limited type of use-cases people currently use LLMs for.
For conversational FAQ-type use-cases like the ones described by OP perhaps a few basic prompts suffice (although anything requiring the agent to have "agency" in its replies would necessarily require prompt engineering) - but what about all the other ways that people can use LLMs to analyze, transform and generate data?
I don't know about you, but for me on a personal level ChatGPT is a SERIOUS paradigm shift. I work on OSX, I use an app (MacGPT) that is always a keyboard shortcut away, and more often than not the responses are several times better than what Google will give me.
I know there are a lot of areas with high friction, but those will go away eventually and compared to what I was doing before, the friction is much much less.
Not to mention that it is a tool _I didn't have before_ and for me in some cases - specially those tasks that are unavoidable, you have to generally start from zero, and are a burden - my productivity has increased 3X
I'll take an unnatural interface any day if that's the end result.
for an app like MacGPT, how much do you end up paying each month in API fees? I'd love to try it out, but I know that I'd constantly be second-guessing myself and saying "nah, i guess I don't really need to look that up" because I'd that pay-per-token API fee would always be sitting in the back of my mind.
TBH I have no idea [1] but I have quite a low limit of expenses using the API (10 bucks) and I'm yet to reach it.
Besides that, you can as well log in via web (That I haven't done, but it's possible)
As a side note, even thought the friction might be too high I couple it with the same programmer's MacWhisper, so that when I need a large text I just speak to the computer.
[1] Geez. I just checked, and my cumulative expenses for June are a staggering 0.27 USD
I think on the input side natural language is a pretty reasonable interface. For me the output side is much more problematic. Natural language makes search output so uniform not only can't you tell whether something's real or not, if it is you can't tell whether it came from the Encyclopedia Britannica or the Youtube comment section.
Taking the ability to discern sources yourself away turns me off from using these systems anywhere where that is relevant.
Also of course not outputting structured data greatly diminishes having AI systems interface with other traditional automated systems. Natural language is not a great format for processing or analysis of any kind.
I don't particularly agree with the title but agree that there's a present awkwardness in the way that users are expected to derive correct insights from LLMs. In the same way that tokenization exists to overcome a would-be shortcoming of present computing resources and architectures, but may eventually become unnecessary: a more streamlined interface would be helpful in tiding us over this hump of awkwardness even if it too eventually becomes unnecessary.
I started reading and I found the text not engaging, perhaps it's me being not in the right mood, but while deciding to stop reading, Substack showed a popup. What the hell. Enough enshittification. I dropped the thing and started to complain here. (shrugs)
A simple example might be the problem of "pick a color". Even the best natural-language interface is going to suck about as much as if you're trying to ask another human to do it for you, even if that assistant is capable of displaying 1-5 color swatches in their replies.
Instead of just seeing the entire palette and choosing, you need to say "I want a gold color", "lighter than that" and "darker" and "less like urine" etc.
> People fundamentally don’t know how to prompt. There are no better examples than Stable Diffusion prompts.
You know, this reminds me of the Good Old Days of internet search engines, where a little expertise in choosing terms/operators was very powerful, before the advanced-case was cannibalized to help the average-case.