Every post that claimed using ChatGPT to achieve non-trivial tasks turned out to have non-trivial human intervention.
> (from the original article) In fact, I found it better to let ChatGPT generate a toy-ish version of the code first, then let it add things to it step-by-step. This resulted in much better output than, say, asking ChatGPT to generate production-quality code with all features in the first go. This also gave me a way to break down my requirements and feed them one at a time - as I was also acting as a code-reviewer for the generated output, and so this method was also easier for me to work with.
It takes a human who really knows the area to instruct ChatGPT and review the output, point out silly mistakes in the generated non-sense, and start next iteration. This kind of curated posts always cut off the most part of the conversations and the failed attempts, and then concatenate successful attempts with outputs of quality. Sure, it will be helpful as a super-IntelliSense. But not as helpful as the post suggested.
I've tried to do something like in the post, but I was quickly bored with waiting output, reviewing, all the iterations. One important aspect about programming is that reading code may not be easier than writing code. And in my case, it's more painful.
IMO this leaves out some salient details. For example, I'd say ChatGPT is a very, very good junior developer. The kind of junior developer that loves computer science, has been screwing around with miscellaneous algorithms and data structures its whole life, has a near-perfect memory, and is awake 24/7/365, but has never had to architect a data-intensive system, write future-proof code, or write code for other developers. Of course, these last three things are a big deal, but the rest of the list makes for a ridiculously useful teammate.
It also has a very broad knowledge of programming languages and frameworks. It's able to onboard you with ease and answer most of qour questions. The trick is to recognize when it's confidently incorrect and hallucinating API calls.
What do you mean when you say this? Most people use hallucinate to mean "writes things that aren't true". It clearly and demonstrably is able to write at least some code that is valid and write some things that are true.
These models don't have a frame of reference to ground themselves in reality with, so they don't really have a base "truth". Everything is equally valid if it is likely.
A human in a hallucinogenic state could hallucinate a lot of things that are true. The hallucination can feature real characters and places, and could happen to follow the normal rules of physics, but they are not guaranteed to do so. And since the individual has essentially become detached from reality, they have no way of knowing which is which.
It's not a perfect analogy, but it helps with understanding that the model "writing things that aren't true" is not some statistical quirk or bug that can be solved with a bandaid, but rather is fundamental to the models themselves. In fact, it might be more truthful to say that the models are always making things up, but that often the things they are making up happen to be true and/or useful.
Precisely, the model is just regurgitating and pattern matching using a large enough training set where the outputs happen to look factual or logical. Meaning that we're just anthropomorphizing these concepts onto statical models, so it's not much different than Jesus Toast.
I think this is a great way to think about it. Hallucinations are the default and an LLM app is one that channels hallucinations rather than avoids them.
Yeah, the junior dev analogy misses on the core capabilities. The ability to spit out syntactically correct blocks of code in a second or two is a massive win, even if it requires careful review.
Yup, it's been a help for me. Had a buddy who asked me if I could automate his workflow in wordpress for post submissions he had to deal with. I asked chatgpt with a little prodding to create me a little workflow. I cleaned it up a bit and threw it in AWS lambda for literally 0$. He was super thankful and hooked me up with a bunch of meat (his dad is a butcher) and I spent maybe an hour on it.
The whole thing is a really accurate expansion on the analogy. It even extends further to explain how it tends to forget certain requirements it was just told and tends to hallucinate at times.
Well, besides the prose, ChatGPT generates a perfectly valid looking code mashup of e.g. Qt, wxWidgets and its hallucinations on top of that. Humans don't do that :)
I'm actually not so sure about that statement. For example, knowing if the code will be executed on a raspberry pi, a HPC with 10TB RAM and 512 CPUs, or a home desktop with 128GB RAM and 8 core CPU will greatly affect how the task may be done. Also, if code aesthetics are important with dependencies that allow it, or fewer dependencies are required, or if performance is more important, or if saving disk space is paramount, etc.
All of these considerations (or if the need to run on any of them easily) heavily change the direction of what should be written, even after the language and such have been chosen.
So, yeah - effectively you do need to specify quite a bit to a senior dev, if you want specific properties in the output - so it's obvious that this needs to be specified to a linguistic interface to coding like these LLMs.
I guess it depends on how you'd define "senior" in this context, someone who knows lots of techstack or someone who has an idea. Of course that doesn't directly map to people's skills because most people develop skills in various dimensions at once.
> Every post that claimed using ChatGPT to achieve non trivial tasks turned out to have non trivial human intervention.
That means full autonomy reached in 0% of applications. How do we go from 0 to 1? By the way, until we remove the human from the loop the iteration speed is still human speed, and number of AI agents <= number of human assistants.
The productivity boost by current level AI is just 15%, as reported in some papers, percentage of code written by Copilot is about 50% it just helps writing out the easy parts and not much for debugging, designing, releasing, etc which take the bulk of the time, so it's probably back to 15% boost.
I don't think so. Ultimately there's not enough information in prompts to produce "correct" code. And any attempt to deliver more information will result in a worse programming language, or as it is now, more iterations.
Many high quality human programmers could go off and make a very good program from a simple description/prompt. I see no reason an LLM couldn’t do the same.
On top of that, there’s no reason an AI couldn’t ask additional questions to clarify certain details, just like a human would. Also as this tech gets faster, the iteration process will get more rapid too, where a human can give small bits of feedback to modify the “finished product” and get the results in seconds.
English is a programming language now. That is what is being demonstrated here. Code is still being written; it just looks more like instructions given to a human programmer.
Eventually, human languages will be the only high-level programming languages. Everything else will be thought of the way we currently think of assembly code: a tool of last resort, used only in unusual circumstances when nothing else will do.
And it looks like "Eventually" means "In a year or two."
English is a programming language once you stop looking at or storing the output of the LLM. Like a binary. I'm not seeing anybody store their prompts in a source repo and hooking it directly up to their build pipeline.
The point is that the roles are reversed not that you give ChatGPT to the stakeholders. ChatGPT is a programmer you hire for $30/month and you act as its manager or tech lead.
This is pointless to argue though since it’s apparent there are people for which this just doesn’t fit into their workflow for whatever reason. It’s like arguing over whether to use an IDE.
But when the code doesn't meet the requirements, the AI needs to know what's incorrect and what changes it needs to make, and that still requires a human. Unless you just put it into a loop and hope that it produces a working result eventually.
So what if you don't "just put it into a loop and hope" but actually make a complex AI agent with static code analysis capabilities, a graph DB, a work memory etc?
I'm doing just that and it works surprisingly well. Currently it's as good as people with 2-3 years of experience. Do you really believe it's not going to improve?
Now I'm making a virtual webcam so it has a face and you can talk to it on a Zoom meeting...
This is not really the same, but may be interesting to some: I subscribe to ChatGPT plus for a month to check out GPT-4. The rate limits were cumbersome though and it can be easy to waste a prompt, so I started to bootstrap:
I would explain my problem to 3.5 and ask it to suggest comprehensive prompts to use with 4 to maximize my limited quota. It worked very well.
In the long years to come the most advance AIs may become so far removed from us that the best intermediaries will be their less advanced brethren.
I use GPT-4 with ChatGPT daily for coding. Here’s what I’ve found works for making the most out the limited prompts OpenAI gives you.
- When you first start, tell ChatGPT about the limitation. Example - “I’m only allowed to send you 25 prompts every 3 hours. As a result, I need you to keep track of the messages I send you. At the end of each response you give me, please tell me how many prompts I have left, ex: this ‘X/25’”
- Combine multiple questions into one prompt, and tell ChatGPT how to handle it. Example “I need help doing X, Y and Z. Please provide all of the details I need for each task, separating your answer for each with ‘—————‘“. Your prompts can be quite long, so don’t hesitate to jam a lot into a single prompt. Just be sure to tell ChatGPT how to handle it, or you’ll end up with shallow answers for each.
- Provide an excruciating amount of detail in your prompts to avoid having to waste other prompts clarifying. For example, let’s say I’m encountering an issue in an app I’m building. I will tell ChaGPT what the issue is, share the relevant parts of my code (striping out any details I don’t want to share), and tell it the expected behavior. This could be dozens of lines of code, multiple files, etc. It can all be one prompt.
I didn't know you could make a request like that. However for my own personal privacy and my particular use case in my more professional I can't use it for certain things. It is also highly limited in the ChatGPT interface: If I give it a small psuedo data set with plausbible date (6 columns, 8 rows) and ask it to do something basic like format it into a table it will start to do that but cut off when it gets to row 6, column 5.
I then remind it that it didn't complete the table, it apologizes, reattemps, and still truncates things. It's output is still far shorter than what I receive in some other prompts so it's not purely a length issue. I'd need to have control to tweak its parameters through the API to get it to respond appropriately.
However, while it may not format the table well, it will still keep the full data set in memory and answer questions on it. For example, I told it the data context. I then asked it why one category (row) saw a decline for a specific year (the columns) and it gave a cogent insightful answers that only my boss and I, the domain experts in my organization, would be able to identify so quickly.
I then asked it to make a projection to a following year, where the data was not given, based on the observed data. I did so. The values were reasonable, but I didn't know why it used them, so I asked it where it got them from, why it chose them:
The answers where incredibly cogent, again on the level that an entry level junior domain expert would give. Here's one nearly verbatim for one of them from my memory. It said "I noticed that the values were consistent for most years but dropped significantly for one year, but the most recent year recovered to an even higher level. So I projected a value that was less than the most recent year but still mostly approximated the average outside of the anomalous year. It gave a bullet point explanation like this for each of the eight rows.
I asked in WHY the drop may have occured that year, and again I had told it the data context so it knew what the data was about, and it have 5 bullet pointed paragraphs that, again, would be very solid answers for a junior practitioner in my area of work.
I asked it was specific formula calculations it used for the projections it mentions. It then apologized for not clarifying earlier (lol, it does that a lot) and then proceeded to tell me that its initial projections were qualitative in nature base on observational criteria. That alone is amazing. In then went further though, without more prompting to say something very much like "However if you would like a more quantitative approach the following formula would be a reasonable approach"
It then went on to describe in great detail the formula I could use to calculate the difference from year to year for each row of data and apply that average to the next projected year's #, along with explaining its reasoning in detail for each step it took.
ChatGPT 3.5 gave very basic answers that I suppose might be useful for a basic user looking for basic possibly trends, as long as they understood that it could be total BS and they needed to vet the answers. GPT-4's analysis was spot on
I can't fully express the extreme utility of this. Giving GPT-4 a pre-aggregated data set to have it give some decent insights automatically could save me hours of work reviewing & finding some of the most obvious trends that would be obvious to me when I see them but would require me to look through 10 to 50 columns of data across 20 to 100 rows of data (keep in mind it would be predigested, cleaned and validated data so work has to be done to get to that point.)
But then GPT-4's preliminary observations would bootstrap my ability to digest the rest of it and have a jumping off point to perform more complex analysis. Then I could give it bullet points and have it summarize my findings in a digestible way for my less data-literate audience, all told saving me hours and getting me out from under a huge backlog of work.
It's a use case that would in no way threaten my job, but make me more productive. And I have enough work that I'm hiring a junior data analyst, and would need to do so even with this increased productivity, and so it would not deprive them of a job either.
It truly would (will!) be a game changer in my day to day work. But I do fully acknowledge that it would, in some other areas of work, reduce the # of employees required to fill the available work. And also that to my fear, future versions could make me less relevant as well, though I think that's further off. Domain expertise is an enormous part of my job.
I see this as an example of the reverse: AI is still stupid enough that it takes humans a degree of skill to craft a request which generates the desired output.
Kolmogorov complexity would like a word! It seems intractable for AI to read minds, there should always be some degree of skill involved in prompt writing.
Agreed. But I think we may reach a point where it is difficult to prompt the most advance AI (which may are may note be an LLM, though maybe contain an LLM mode) in a productive way, especially due to computational demands on resources. And so AI's with lesser, cheaper capabilities may be reasonably competent in "understanding"-- a term I user very loosely-- the problem of dealing with more advanced but resource constrained systems and assist in the best practices in prompting them.
I don't know how true it is vs how much PR it is, but Khan Academy's use of LLMs was interesting in that they apparently craft a prompt from the AI itself. Ie a two step process where the AI generates the steps, and then the AI reasons about the result of the steps to attempt and judge the accuracy of the data. This was just a blurb from the Ted talk[1], but i'd be interested in seeing a slightly more in depth explanation about this strategy.
Made me wonder if you could have an recursive cycle of thinking. Where the "AI" prompts itself, reasons on the output, and does that repeatedly with guards such that it will stop if it doesn't judge anymore advancement.
From what I've seen, the problem is not with the GPT models, but the people themselves. Almost everyday I get multiple unstructured, unclear and often without required context requests from people that I need to iterate back and forth with additional questions to get a clear view what they're trying to achieve. The only thing GPT is "stupid enough" is that it's not prompted to ask questions back to clarify the request.
That’s my point though, us humans get better at working with abysmal directions the more we encounter it. Current AI, on the other hand, requires humans to improve to meet AI cannot still take those inputs literally, warts and all.
Thats a great idea. I'm going to start doing this. For me it also seems GPT-4 just prints out slower. I find I can get most done with 3.5 and its faster to achieve what Im looking for. Then when I'm not satisfied with the 3.5 response I can clean it up and feed over into 4.
I kick off with the 4 as there's no time to waste, and solely utilize 3.5 in API mode for my apps. It's way speedier, and if you're certain the task is doable, it's a no-brainer to employ it. Scripted uses are often like that.
Not exactly sure why you bring this up but tangentially this is actually a really good prompt to use with GPT. Ask it a question but tell it to list the known knowns, known unknowns and unknown unknowns before replying. The unknown unknowns part usually generates some interesting follow up questions.
Cool -- I did something similar with the goal: Imagine and simulate an instrument that doesn't exist and ended up with this -- it even created the assets or prompts for other AIs to make assets where it couldn't, including the model
I've had GPT3.5 teach me how to make a watchOS application to control a BLE device, and after a couple evenings I have one working on my watch right now.
On the bright side, it gave some concrete hands-on guidance that I just wasn't been able to get from Apple documentation. It quickly gave me enough pointers to make things click in my head. While I've never wrote a single line for any Apple ecosystem before, I have a fair amount of experiences in various other niches, so I just needed a crash-course introduction showing me the ropes until I start to see the similarities with things I already know about - and it gave me exactly what I wanted. Unlike beginner "write your first app" blog articles, this crash course was tailored specifically to my requirements, which is incredibly helpful.
However, on the downside, the code it wrote was awful, lacking any design. It gave me enough to get familiar with Swift syntax and how to setup CoreBluetooth, but the overall architecture was non-existent - just some spaghetti that kind-of worked (save for some bugs that were easier to fix on my own than explain to GPT) in my described happy scenario. It was like a junior developer would've grabbed some pieces from StackOverflow. In my attempts to bring things to order, I've hit a knowledge cutoff barrier (it has no clue about latest Swift with new its async syntax or Xcode 14 specifics - as a few things were moved around) and heavy hallucinations (ChatGPT actively trying to refer to various classes and methods that never existed).
Still, I'm impressed. Not a replacement for a developer, but definitely a huge boon. Wonder what others' experiences and opinions are.
My experience was similar. I am most comfortable coding in python, and my html/css/js skills are not strong. I don't have the patience to stay current with frontend tools or to bash my way through all the stack overflow searching that it would require for me to code frontend stuff.
So with that context, I was able to have chatgpt do all of the heavy lifting of building a webapp. I was able to build my wish list of features like immersive reading with synchronized sound and word highlighting. Stuff that I probably wouldn't have had enough motivation to complete on my own.
But the architecture of the code is poor. It repeats itself and would be hard to maintain. I think much of this could be improved by spending more time asking gpt to refactor and restructure. The same way a senior developer might provide feedback to a junior dev's first coding project. I did some of this along the way, but that code base needs more if it were to be an ongoing project.
The README in that repo tells the story of building that particular app. I recently got access to gpt-4, and the tooling I've built has become much more reliable. I will likely polish it up and put it onto GitHub sometime soon.
It's good at writing new code, with sufficient prompting. But the big open question as of now for engineering orgs is - can it edit existing code like developers, just by instructions. Is there any hands on experience anyone has on copilot-x?
Copilot without -x does this relatively well. But it's a hands-on process. I can't just give it some source files and say "go". But it can easily make you 10x. I often spend more time tab-completing than writing.
I'm still trying to figure out how to do this. Copilot is extremely helpful. But more often than not, it suggests completions which would steer the project in another direction.
It is extremely good once the code has been organised and an overall structure exists. But until then it can be a distraction.
How are you guys enjoying copilot? For me, it wastes time so often. It blocked my editor auto importing functionality and times where I really have some easy suggestion that I need, maybe 50% of the time it doesn’t suggest the right thing.
I'm sorry but that's just too hard to believe. Even if it was a perfect reasoning/coding engine the fact that it's context is so limited guarantees it will get stuff wrong.
I'm a fan of copilot - but no way in hell is it a 10x tool. It's probably 1.2x - which is a huge gain for such a cheap tool.
One of the most frustrating about this approach is that it feels asymptotic, you’re always approaching but never arriving at a solution. You start seeing diminishing returns on further prompting.
It’s great for scaffolding, but not that great for non-trivial end-to-end projects.
I've also made six small apps completely coded by ChatGPT (with GitHub Copilot contributing a bit as well). Here are the two largest:
PlaylistGPT (https://github.com/savbell/playlist-gpt): A fun little web app that allows you to ask questions about your Spotify playlists and receive answers from Python code generated by OpenAI's models. I even added a feature where if the code written by GPT runs into errors, it can send the code and the error back to the model and ask it to fix it. It actually can debug itself quite often! One of the most impressive things for me was how it was able to model the UI after the Spotify app with little more than me asking it to do exactly that.
WhisperWriter (https://github.com/savbell/whisper-writer): A small speech-to-text app that uses OpenAI's Whisper API to auto-transcribe recordings from a user's microphone. It waits for a keyboard shortcut to be pressed, then records from the user's microphone until it detects a pause in their speech, and then types out the Whisper transcription to the active window. It only took me two hours to get a working prototype up and running, with additions such as graphic indicators taking a few more hours to implement.
I created the first for fun and the second to help me overcome a disability that impacts my ability to use a keyboard. I now use WhisperWriter literally every day (I'm even typing part of this comment with it), and I used it to prompt ChatGPT to write the code for a few additional personal projects that improve my quality-of-life in small ways. If people are interested, I may write up more about the prompting and pair programming process, since I definitely learned a lot as I worked through these, including some similar lessons to the article!
Personally, I am super excited about the possibilities these AI technologies open up for people like me, who may be facing small challenges that could be easily solved with a tiny app written in a few hours tailored specifically to their problem. I had been struggling to use my desktop computer because the Windows Dictation tool was very broken for me, but now I feel like I can use it to my full capacity again because I can type with WhisperWriter. Coding now takes a minimal amount of keyboard use thanks to these AI coding assistants -- and I am super grateful for that!
Spot on, right! Glad you achieved all of the above. By design, tech advances to enhance human's ability to create. In your case, the AI tech (LLMs) truly augment your own capabilities. Therefore reaching the comfort that others enjoy freely. Hope to see more use cases, like yours, brought forward to inspire some anxious humans who are terrified by the rapid advancement of AI tech.
As someone who always wanted to work on multiple projects but was lacking the time and manpower. GPT has truly made that possible. Looking forward to GPT5.
You don’t mention it explicitly, but I assume you’ve manually copied and pasted all the code, as well as the various patches with updates? In my experience, that quickly makes new suggestions from ChatGPT go out of sync with the actual state of the code. Did you occasionally start the conversation over and pasted in all the code you currently had, or did this not turn out to be an issue for you?
I have found with GPT4, this isn't an issue, but you do have to watch it and let it know if it makes up some new syntax. Scolding it will usually get it to correct itself.
The bigger issue for me has been hallucinated libraries. It will link to things that don't exist that it insists do. Sometimes I've been able to get it to output the library it hallucinated though.
It also makes more syntax errors than a person in my experience, but that is made up for by it being really good at troubleshooting bugs and the speed it outputs.
Indeed, I manually copied the outputs. If the network lost context (or I ran out of GPT-4 credits and reverted to GPT-3) or for some reason I needed to start a new chat, I would start by also feeding in the other modules' docstrings to re-build context. sometimes I had to pass these again after a few prompts.
A good example looks like:
```
I am trying to model associative memory that i may attach to a gpt model
here is the code for the memory:
....
can we keep the input vectors in another array so we can fetch items from it directly instead of having to reconstruct?
```
I asked GPT to write a program which displays the skeleton of a project, i.e. folders, files, functions, classes and methods. I put that at the top of the prompt.
I had a spooky experience with a project that was written almost entirely by GPT. I gave it the skeleton and one method and asked it for a modification. It gave it and also said "don't forget to update this other method", and showed me the updated code for that too.
The spooky part is, I never told it the code for that method, but it was able to tell from context what it should be. (I told it that it itself had written it, but I don't know if that made any difference: does GPT know how it "would" have done things, i.e. can predict any code it knows that it wrote?)
It's very good at guessing from function names and context. This is very impressive the first few times, and very frustrating when working with existing codebases, because it assumes the existence of functions based on naming schemes. When those don't exist, you can go ask it to write them, but this is a rabbit hole that leads to more distractions than necessary (oh now we need that function, which needs this one). It starts to feel like too much work.
Often, it will posit the existence of a function that is named slightly differently. This is great for helping you find corners of an API with functionality you didn't know existed, but insanely frustrating when you're just trying to get code working the first time. You end up manually verifying the API calls.
It only works well with codebases that are probably quite well represented in its training data ime. For more obscure ones one is better off just doing it by themselves. Finetuning may be a way to overcome this, though.
Combine TDD and Self debugging into a workflow and you almost have a new paradigm of software development where entire applications can be developed with a series of prompts. Software programmers have finally programmed themselves out of jobs! It's kind of poetic justice that LLMs trained on open source code is replacing us.
We should have never listened to Richard Stallman. /s
Writing code is such a small part of software development though. It’s an even smaller portion of software engineering (which involves tons of other skills such as translating user requirements/complaints into features/bug reports, recognizing trade offs and knowing when to make which one, having historical knowledge about the project such as which approaches to a particular problem have been tried and discarded and why, and so on).
If you really do feel that way. Honest non troll question. Why not quit? There still seems plenty of software work even if there was a feedback loop in the workflow. However if you feel we’ve wrote our selves out why not do something more physical. Not to sound snarky.
Good preview of the near future of software dev, but I’m also wondering of the trend of companies forbids use of AI generated code do to copyright ambiguity. I suppose pressure to reduce cost as and move faster will overcome that concern.
Besides copyright ambiguity, I think a big problem will be security: someone that knows the tech stack and the business model in detail of a bank app, for example, that was generated with ChatGpt would be able to use ChatGpt to generate for him the same (or similar) code. This would turn any ChatGtp app almost in an open source app, no? And once you know the details of the implementation it's easier to find the security holes in it.
>It was more like handholding a fresh grad who had absorbed all of human knowledge but needed someone to tie various parts of that knowledge to create something useful.
I experiment every so often with ChatGPT, usually having it create a simple multiplayer browser-based app with a server-side backend. The most recent being a collaborative pixel art app similar to /r/place.
Usually by the time ChatGPT generates something that actually works, after some guidance and generally minimal code edits (usually due to its context loss), I could've written a far more optimized version myself. Its capabilities are still extremely impressive nonetheless and I look forward to future iterations of this technology. Super nice tool to have for mundane code generation and it'll only get better.
Really wish I could use anything like this at work to generate tests... it's really good at that.
When you actually work with a high level engineer they can do a lot automouusly and can cut through ambiguous instructions based on experience, but they also require interactions that clarify important decision points and there are many. Gpt-x is miles away from this outcome
The dev gave the initial idea to the LLM. That's the creative process. Everything after that, arguably, is just technical details in order to realize the idea. Sure, implementation requires plenty of creativity, but of different kind.
Believe me, only a dev can get this working. Maybe in the future, LLM wizards will conjure all our technology, but at this point, having a working knowledge of all APIs from 2021 is an assistive technology, not a magical code-machine.
I've used LLM to generate a lot of code recently on side projects. It's a 10x jump in productivity, but it can only reliably do 50-80% of the work, and the last tail needs editing, verification, setup with infrastructure, etc.
It won't read your mind, you need to iterate, re-create, and guide. And each of those 3 relies on a working knowledge of software, libraries, tech, and user experience to get right.
Exactly this. I doubt a non-programmer would be able to produce similar output of similar quality and completeness. Like I said, I am not losing my job yet. Maybe next year...
I hope so, speaking from the perspective of someone who wants to keep their job. But at the same time I feel it's not trivial to bring good arguments against LLMs taking over without resorting to "they can't take into account all the context and might make tiny mistakes". But maybe people can be trained to be verifiers/testers as opposed to code writers.
This is a non-example, because it is not how interaction with LLMs to write code works right now. You can check the linked page to see that this took the author several hours spanned within 3 weekends to implement, where they had lengthy back and forth discussions with chatGPT building the code.
And you still need to have some model of the code structure, you need to understand what is going on to go back and forth with it. It takes care of a part of the work that is quite standard so you work in a different level, the analogy is with a compiler taking care of certain optimisation part so that you do not have to invent it every time. If you think that you can build anything meaningful with chatGPT with one line prompts I would suggest you try to engage yourself in the context of a language/paradigm you are only a novice with to see how it works and the parts that can be a good learning experience, engaging and entertaining. Do not use some obscure language and libraries because it will start hallucinating a lot though.
If you see toward the end where it generates Einstein's field equations, I had to start from real numbers (dedekind cuts) to real manifolds to pseudo-Reimannian manifolds to the curvature metric to the final thing.
"As an AI language model, I'm unable to create or prove the existence of a polynomial-time algorithm for the Traveling Salesman Problem (TSP), because it is an NP-hard problem. This means that it is unlikely that such an algorithm exists, unless P = NP, which is an unsolved question in computer science.
However, I can suggest some approximation algorithms or heuristics that provide good solutions to the problem, even though they don't guarantee an optimal solution:"
> It was more like handholding a fresh grad who had absorbed all of human knowledge but needed someone to tie various parts of that knowledge to create something useful. Also ChatGPT is bad at dealing with abstractions beyond 2 layers.
> ChatGPT is definitely a productivity multiplier. I think it is rather a differential productivity multiplier, as it would enhance more the capabilities of those who already know more. If I did not understand deep learning and FAISS, or how projects are structured, I don't think I would have been able to pull this off. On the other hand, it also has some sort of a leveling effect—I have not worked on PyTorch in a while, have no idea of FAISS's new APIs, etc., but these gaps were filled in by ChatGPT.
Until bird shits on the camera, your kid vomits in the car, tire is punctured, somebody breaks window, police hails to stop, you're choking with peanut, there is a crash nearby or crash with your car and all other kind of edge cases.
I think the OP meant "I dont need a dedicated driver to drive my car [because I can drive it on my own]".
The process is simplified so you can do it yourself if you have the right tool, instead of relying on dedicated professionals. The process can be traveling or designing and writing an app.
I don't understand the point you're trying to make with those edge cases, especially choking with a peanut, but driving your own car is extremely popular, despite those.
Fun fact: People didn't like cake mixes when they first came out, precisely because it wasn't "really cooking". Then someone (Betty Crocker?) changed the mix (and the instructions) so that the person had to add an egg, not just water. Then the humans felt like they were actually cooking, and cake mixes were more accepted.
A really smart AI would leave enough for the humans to do that they feel like they're still in charge.
I kind of agree about the cake mixes, but not the developer. It's pretty clear to me the value the developer provided, as he elaborated about it at length near the end and said the LLM isn't getting his job anytime soon.
The cake mix really isn't "cooking," by my standards. Neither is microwaving popcorn. But it's an arbitrary line, I wouldn't defend it very hard.
TBH, programming has lacked creativity since complilers got within 90% as good as hand-rolled assembly.
Hand-rolled assembly wasn't really fun because you could type it and get a response instantly, rather than the creative good old days of mailing in punch cards and waiting weeks for a result.
Punch cards weren't fun either, because using a computer wasn't creative. Doing math by hand was.
Ad nauseum.
If you think that really fucking excellent portal-opening tools don't enable creativity, you just have a dim view of what creativity is.
Is anyone letting an LLM code and run its code by itself, then iteratively fix any bugs in it without human intervention until it e.g. passes some black box tests?
Would it be possible to significantly improve an LLM using such unsupervised sessions?
Yeah at some point gpt4 loses track and just consistently is wrong.
Lately I can't feed it too much info, the longer the context the more issues.
With your suggestion, it doesn't know which part of the iteration is correct at the moment. For us iteration is logically but for chatgpt I think it's just more variables that make the chance of being wrong larger. So you need to build that in somehow that it can iteratively filter and prompt
That's been my experience. At some point it can't "un-learn" its mistakes because it keeps including the "wrong" bits in scope.
I have some success saying "no, undo that," waiting for it to return the corrected version, and only then continuing.
Oobabooga's UI is better at this, since you can remove erroneous outputs from the context and edit your previous input to steer it in the right direction.
Given that OpenAI mines conversations for training data it seems to align with their interests to make you give up and start a new prompt. More abandoned prompts = more training data.
I don't know what the issue is. Been happening more lately. Before would ask a lot and would manage, now often 4-5 prompts in seems to just answer without previous context
Exactly! During the process, it seemed like if there were like two GPTs self-playing to both generate the proper prompts iteratively and the other generates like the output, all triggered by one concise command from a human - say write tests and dont stop iterating till the tests pass - basically automating the human out of the loop - could get rid of the loops fixing tests, but also take away control.
I haven’t had much experience with AGPT but all the “AGI is here” and “people use this to make money” posts on their news feed makes my “meat brain” suspicious ha.
>It was more like handholding a fresh grad who had absorbed all of human knowledge but needed someone to tie various parts of that knowledge to create something useful.
10 INT but 0 WIS, that's a good mental model for LLMs.
I've found the best way to pair program with ChatGPT is with GPT4 API through a VSCode extension by @jakear [1] that uses the Notebook interface. Instead of setting a language for each cell, you set roles like "system", "user", or "assistant" and when you run a cell it sends the cells as chat messages.
A huge benefit of this format is that you can delete cells, edit the responses from GPT4 to incorporate changes from future queries, and even rearrange or add mock assistant messages to prime the conversation. As ChatGPT suggests changes, I incorporate them into the main code cells and replace the old queries/feedback with new queries feedback. Since the old changes are incorporated into the parent cells, it loses track a lot less and I can also touch it up to use the right file paths, APIs, etc when it messes up.
You can go a step further and monitor the llm file with inotify and extract assistant messages, infer the file path from the responses, and automatically write them to file as you update the notebook. That eliminates the back and forth copy pasting.
It'd be nice to extend that interface to include Jupyter notebook cells so we can use ChatGPT to generate notebook cells that can be parsed and executed in the interface directly.
Edit to add another tip: I use a variation of the below system prompt for working on larger sessions. Each user message begins with a file path and contains a code block with the contents of the file. After each user message containing a file, I manually add an assistant message that just says "continue", which allows adding several files at different paths. The last user message, the one I actually execute, contains the <request> tokens and the description of the modifications I want in the code. I incorporate the suggested changes into the messages then rinse and repeat. Prompt (sadly I forgot to record where I found it):
You are a Rust AI programming assistant. The user will send you the relevant code over several requests. Please reply "continue" until you receive a message from the user starting with the tokens "<request>". Upon receiving a message from the user starting with the tokens "<request>" please carry out the request with reference to the code that the user previously sent. Assume the user is a senior software engineer who needs minimal instruction. Limit your commentary as much as possible. Under ideal circumstances, your response should just be code with no commentary. In some cases, commentary may be necessary: for example, to correct a faulty assumption of the user or to indicate into which file the code should be placed.
I thought of that as a kind of dropout for memory, so that the network also learns to build resilience around what it stores. I was also unsure of how to manage the memory
- should I reset it every iteration? resulting in it behaving more like working memory maybe
- should I reset it every epoch? would that pollute the memory from previous iterations and what would happen if it got full?
- finally, why not maybe delete say 0.01% of the memory or maybe 1 cell per iteration randomly, this would imply the memory would not be that reliable as biological memory behaves and the neural net has to build resilience to use it effectively (by storing multiple copies of what is really useful (hyppocampal novelty detection type?)).
The technological singularity is approaching fast.
It's pretty clear that, given terminal access and an appropriate outer loop, GPT models can iteratively create new GPT models (either by writing and executing Pyhton code, or later versions trained on LLM weights may even be able to output new weights directly). If the inner workings of the loop are sufficiently obfuscated (in the code-based version), it wouldn't necessarily be clear to us what had changed, and the model weights/architecture on their own are not interpretable. That's very close to a singularity definition (machines that self-improve faster than what our understanding can keep up with), the only open question is whether the new versions would actually be improvements. But that sounds solvable too (except that it wouldn't be easy for us to tell for sure).
Unless this singularity is able to construct for itself sensors and actuators that can refine existing knowledge and acquire new knowledge, it will forever be hamstrung by the incomplete and static knowledge of the universe that was used to train the initial model. This is where regulation is desperately needed. We must be very careful with the kinds of sensors and actuators that are integrated with these systems. Perhaps none should be allowed.
Well, if the terminal has internet access, it can download whatever new data has been created and train a new model on that. In theory, it could also hack into a 3D printing workshop, or get some funds (eg steal credit card data, or perhaps even legally) and just issue regular orders. This is clearly very speculative thinking, but I think we are making a similar point actually.
You’re clear in your use of a singularity definition. I’ve always taken the (I think more popular?) stance that the singularity is better defined by AGI surpassing average or best human intelligence. This definition is still far off. The lacking aspects of end-to-end functionality in LLMs may or may not ever see completion.
There are paradigms shifts necessary for many applications.
It seems like there is some sort of "viral singularity" that is quite likely. Still a "paperclip maximizer" where paperclip is something unexpected. In a sense life, and humanity, are types of paperclip maximizers.
Could you explain in detail how that causes the singularity? I’m lost. I saw a cool tool used to make another cool tool. You saw the singularity. Where is this logic coming from? Why would terminal access which I’m fairly certain I’ve seen in autogpt, change much.
The cool tool makes another cool tool, which in turn makes another cool tool, faster and faster, until we really don't understand at all what the latest cool tool is doing. But it just keeps getting smarter/faster/more effective/whatever it's optimizing for.
I don’t see that in this specific case for one thing or how we are closer from before. Code generation and auto correction getting closer is very cool, the singularity needs to be spelled out for me how a tool making a tool will get us there. How does that solve all of the hurdles that go into AGI or the singularity. How does it change a large language model.
What never was so clear to me in that vision is, how the version n actually makes sure the version n+1 is actually faster and better.
Initially there might be the easy option of using more tokens/memory/<whatever easily measurable and beneficial factor you can imagine>. But when that becomes impractical, how will the "dumber AI" select between x generated "smarter AIs"? How will it ensure that the newly generated versions are better at all (if there are no easily measurable parameters to be increased)?
You have it optimize towards reasoning tests, not tokens or memory. It’s like if you were optimizing a race car. You make the explicit goal time around the track, not horsepower or weight.
That could end up in getting better and better special-purpose expert systems. How do you create better and better general-purpose AIs this way? What is more, when the AIs are more advanced, it might be challenging to create meaningful tests for them (outside of very specialized domains).
My point with the race analogy was you score it on the outcome (time) not a factor (horsepower). For your concern, just make the outcome broader. Someone will still say it’s not AGI as it does 100 different remote jobs, but just plan on ignoring them.
Something that scares me is people don’t want to believe this is where we’re headed or it’s even possible. I say this because I think your concerns are easy enough to address, it makes me think you didn’t try answering your own questions.
The idea in this scenario is it’s self optimizing. No reason it can’t make it’s own more specific tests to the general test of “become as smart as possible”. And people can make tests that they’re not smart enough to pass. You just make them as a team and take more time. It could also discover new things and then verify if they’re true, which is easier.
The "easy" optimization targets I have mentioned (tokens/memory/etc.) seem to confuse you. Ignore them for the main point - how do you create good tests for a more advanced AI than you yourself are (or have control of)? For so many human fields of knowledge and ability it is already extremely hard to find meaningful tests to rank humans. No amount of letting an AI optimize on games with fixed and limited rules will automatically improve them on all the other abilities, which are not easily testable or even measurable. And that difficulty pertains for human intelligence trying to measure or test human intelligence. How much more difficult would it be for humans to measure or test super-human intelligence?
> how do you create good tests for a more advanced AI than you yourself are (or have control of)?
I think I answered that. You use more resources to create it than answer it. When creating it, you can have a team take lots of time, use all sorts of tools, etc. When taking it, you constrain time and tools. That’s the general pattern of how to make a question harder than you can answer.
There are also lots of trapdoor questions, especially in math. An overly simple example is factoring numbers.
I think that can only work with very small, incremental steps in intelligence, though. Will a ten year old human be able to create a meaningful test for an adult with high intelligence? No matter the resources you give the young one, they usually will not be able to.
There also might be thresholds, barriers, were small increments in intelligence are not possible. But that is speculative, I will admit.
The cycle could look something like this: LLMv1 > prompt "get me the code for a better LLM" > paste output to terminal, execute > LLMv2 > repeat.
There are lots of different singularity definitions on wikipedia, most focus on 'intelligence' (eg "an upgradable intelligent agent will eventually enter a 'runaway reaction' of self-improvement cycles, each new and more intelligent generation appearing more and more rapidly"). I think focusing on what 'intelligence' really means, or whether any model is 'truly' intelligent can be be a bit of a distraction. So I just emphasized capabilities in a rather generic sense instead.
It's clear that LLMs can compete with humans on many tasks (coding, creative writing, medical diagnosis, psychotherapy,...), and it is conceivable that they may surpass most humans on those tasks some day. If we have models that can outperform most or all humans on important professional and everyday tasks, and also produce new models that perform even better, maybe at an accelerating rate, through means that are ultimately not interpretable for us, I'd say that's pretty close to a singularity, regardless of whether they're 'truly' intelligent. Even more so if they pass a duck test (walks like a duck, quacks like a duck,...).
What does a better LLM mean? We again… already have terminal usage in AutoGPT. It describes a little what you are talking about. What measurement do you use to test if a language model 1 is better than 2? Accuracy? Self reliance? Is that what a Transformer should be doing? Sure an LLM can compete with humans on all of those tasks. Fantastic. So are we at the singularity now as it’s defined? Do I quit my job now?
Like I said in the original post, clearly telling which of two LLMs is better is difficult, and probably one of the things holding us back from having a singularity right now. But that doesn't seem to be an insurmountable problem. Most people who have used both seem to agree that GPT4 is 'better' than GPT3, so maybe we can formalize that intuition somehow.
I personally think the same as the other comment. Do you really think money is the bottleneck? Let’s assume Sam Altman felt that way. He couldn’t drum up that cash tomorrow if he wanted??? For the pinnacle of human invention. Twitter was purchased for way way more than 50 mil.
Yes? He's invested in GPT-4 which just came out. He needs to retrain a new model with higher context limit as well as build out memory systems probably with vector databases. OpenAI needs to continuously raise money and buy more hardware. I'm not sure what OpenAI is doing but they're obviously resource constrained, else they wouldn't be outpaced by competitors in text2img.
He probably doesn't want to give away more of his company since it's such an obvious winner.
With comments like "lmao" and "brain-like subsystems", I'd go back to hitting the proverbial books (or more likely reddit in your case) rather than empty boasts.
The interpretability of LLMs comes in chain of thought. When a human explains a process, they talk through it. You don’t measure their neurons. The interpretability of neuron weights is a sub-symbolic red herring. Causal emergence says, the agent is what matters, not the individual synapse. Think like a manager with AI, not a cellular neurobiologist
Calm down. It is a language model. People have figured out how to predict the next word for a given prefix. That's very cool and it will definitely have a significant impact on software. But it is not artificial intelligence.
I am becoming somewhat of a broken record, but sigh..
To predict the next token you must reason or have some process that approximates it.
“Given all these various factors, the most likely resolution to our conundrum is: …”
Good luck doing that with any kind of accuracy if you lack intelligence of any kind.
Language is a distraction. These things reason (badly, atm). It is totally unclear how far this goes. It could fizzle out, it could become our overlord.
Does the following satisfy your requirement for "a famous riddle with a paradox change"? Because GPT-4 aces it most of the time.
"Doom Slayer needs to teleport from Phobos to Deimos. He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?"
Furthermore, it will reason if you tell it to reason. In this case it is not necessary, but in general, telling GPT to "think it out loud before giving the answer" will result in a more rigorous application of the rules. Better yet, tell it to come up with a draft answer first, and then self-criticize by analyzing the answer for factual correctness and logical reasoning in a loop.
People will see patterns in this riddle and claim it is “just” altering those. “It’s just a bunch a patterns where you can switch the names, like templates”.
Isn’t everything like that?
“Uhh…”
I had the same discussions about chess.
“It has just memorized a bunch of high level patterns and juggles them around”.
I agree, but now I’m curious what you think chess is.
“Chess is not intelligence.”
Goalposts? Anyway, we move on to Go, the game. Same response. Programming, same, but the angle of the response is different now because programming is “clearly” intelligence incarnate.
“It programs and sometimes correctly, but it is a mirage. It will never attain True Programming.”
I’m sitting on the bench riding this one out. We’ll see.
I fed GPT-4 some really old fashioned spatial reasoning questions (inspired on SHRDLU), which it passed. Then when questioned about unstable configurations (which IIRC SHRDLU could not handle) it passed those too.
So it seems like it is definitely capable of some forms of reasoning. Possibly we both tested it in different ways, and some forms of reasoning are harder for it than others?
What does it has to do that will qualify it as artificial intelligence ?
In my opinion, all the ingredients are there for artificial intelligence. Somebody just need to stitch everything up. It can understand text, reason about it, identity next steps, can write code to execute the steps, understand error messages and fix the code. That feels like AI.
We seem to keep moving the goal posts as to what is intelligence. I'm not sure is that ego? Or instead when we describe a thing and then see it, we say, “That is not what I meant at all; That is not it, at all.”
Well I did at one point compare the outputs for two biographies of not very well known people, 3.5 made up half of the data, 4 only said it doesn't know anyone by that name for both. I don't think I ever tried asking about any places or theorems specifically.
As for APIs, well I try to always provide adequate context with docs, otherwise it may still on occasion make up some parameter that doesn't exist or uses another library by the same name. Half the time it's really my fault by asking for something that just isn't possible in a last ditch effort to see if it can be done in some convoluted way. It sort of assumes "the customer is always right" even if it contradicts with what it knows is wrong I guess. It gets it right usually when it's at least mostly straightforward to implement something though.
God damn the down votes. I agree with your overall thesis I don’t personally know what intelligence means. However it’s just a tool even if it is AI or whatever you want to label it. It’s extremely cool and powerful too. It scares the shit out of me as well. However I think we are also taking the hype to 11/10 when it should be much lower out of 10 than that…
I have no idea. Probably not. The philosophy is a deep rabbit hole. A fun one to ponder, and I like having that discussion. Maybe the more cynical pragmatic old man swe that I am sees a super powerful calculator kind of. It’s very very cool, and obviously I can’t hold a conversation with a calculator but my analogy is more to say my calculator can do really complex integrals and even show its steps kind of! Especially for my handy TI-89. It was my best friend for all of engineering undergrad. I see chatgpt as a steroid version of that for all of Language. Code is another language, writing is a language, painting in some ways is a language.
So you are 100% certain that no emergent properties can develop from a LLM that transcends the limitations of LLMs. I haven’t read any LLM literature, so I am honestly asking, do you know of anything close to a proof that such emergent properties cannot develop?
If course, there can be emergent properties. It will be fascinating to watch this research. But it is not going to develop another model that is even more capable.
> (from the original article) In fact, I found it better to let ChatGPT generate a toy-ish version of the code first, then let it add things to it step-by-step. This resulted in much better output than, say, asking ChatGPT to generate production-quality code with all features in the first go. This also gave me a way to break down my requirements and feed them one at a time - as I was also acting as a code-reviewer for the generated output, and so this method was also easier for me to work with.
It takes a human who really knows the area to instruct ChatGPT and review the output, point out silly mistakes in the generated non-sense, and start next iteration. This kind of curated posts always cut off the most part of the conversations and the failed attempts, and then concatenate successful attempts with outputs of quality. Sure, it will be helpful as a super-IntelliSense. But not as helpful as the post suggested.
I've tried to do something like in the post, but I was quickly bored with waiting output, reviewing, all the iterations. One important aspect about programming is that reading code may not be easier than writing code. And in my case, it's more painful.