Question for the group here: do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?
I lead a team exploring cutting edge LLM applications and end-user features. It's my intuition from experience that we have a LONG way to go.
GPT-4o / Claude 3.5 are the go-to models for my team. Every combination of technical investment + LLMs yields a new list of potential applications.
For example, combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team.
If you now give it some predictive / simulation capability - eg: simulate the execution of a task or project like creating a github PR code change, and test against an expert bot above for code review, you can have LLMs create reasonable code changes, with automatic review / iteration etc.
Similarly there are many more capabilities that you can ladder on and expose into LLMs to give you increasingly productive outputs from them.
Chasing after model improvements and "GPT-5 will be PHD-level" is moot imo. When did you hire a PHD coworker and they were productive on day-0 ? You need to onboard them with human expertise, and then give them execution space / long-term memories etc to be productive.
Model vendors might struggle to build something more intelligent. But my point is that we already have so much intelligence and we don't know what to do with that. There is a LOT you can do with high-schooler level intelligence at super-human scale.
Take a naive example. 200k context windows are now available. Most people, through ChatGPT, type out maybe 1500 tokens. That's a huge amount of untapped capacity. No human is going to type out 200k of context. Hence why we need RAG, and additional forms of input (eg: simulation outcomes) to fully leverage that.
> potential applications
> if you ...
> for example ...
Yes there seems to be lots of potential. Yes we can brainstorm things that should work. Yes there is a lot of examples of incredible things in isolation. But it's a little bit like those youtube videos showing amazing basketball shots in 1 try, when in reality lots of failed attempts happened beforehand. Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.
Show me the things you / your team has actually built that has decent retention and metrics concretely proving efficiency improvements.
LLMs are so hit and miss from query to query that if your users don't have a sixth sense for a miss vs a hit, there may not be any efficiency improvement. It's a really hard problem with LLM based tools.
There is so much hype right now and people showing cherry picked examples.
> Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.
This has been my team's experience (and frustration) as well, and has led us to look at using LLMs for classifying / structuring, but not entrusting an LLM with making a decision based on things like a database schema or business logic.
I think the technology and tooling will get there, but the enormous amount of effort spent trying to get the system to "do the right thing" and the nondeterministic nature have really put us into a camp of "let's only allow the LLM to do things we know it is rock-solid at."
> "let's only allow the LLM to do things we know it is rock-solid at."
Even this is insanely hard in my opinion. The one thing that you would assume LLM to excel at is spelling and grammar checking for the English language, but even the top model (GPT-4o) can be insanely stupid/unpredictable at times. Take the following example from my tool:
5 models are asked if the sentence is correct and GPT-4o got it wrong all 5 times. It keeps complaining that GitHub is spelled like Github, when it isn't. Note, only 2 weeks ago, Claude 3.5 Sonnet did the same thing.
I do believe LLM is a game changer, but I'm not convinced it is designed to be public-facing. I see LLM as a power tool for domain experts, and you have to assume whatever it spits out may be wrong, and your process should allow for it.
Edit:
I should add that I'm convinced that not one single model will rule them all. I believe there will be 4 or 5 models that everybody will use and each will be used to challenge one another for accuracy and confidence.
I do contract work on fine-tuning efforts, and I can tell you that most humans aren't designed to be public-facing either.
While LLMs do plenty of awful things, people make the most incredibly stupid mistakes too, and that is what LLMs needs to be benchmarked against. The problem is that most of the people evaluating LLMs are better educated than most and often smarter than most. When you see any quantity of prompts input by a representative sample of LLM losers, you quickly lose all faith in humanity.
I'm not saying LLMs are good enough. They're not. But we will increasingly find that there are large niches where LLMs are horrible and error prone yet still outperform the people companies are prepared to pay to do the task.
In other words, on one hand you'll have domain experts becoming expert LLM-wranglers. On the other hand you'll have public-facing LLMs eating away at tasks done by low paid labour where people can work around their stupid mistakes with process or just accepting the risk, same as they currently do with undertrained labor.
I have a side point here - There is a certain schizoid aspect to this argument that LLMs and humans make similar mistakes.
This means that on one hand firms are demanding RTO for culture and team work improvements.
While on the other they will be ok with a tool that makes unpredictable errors like humans, but can never be impacted by culture and team work.
These two ideas lie in odd juxtaposition to each other.
We aren't talking about skilled knowledge work in Silicon Valley campuses. We are talking about work that might already have been outsourced so some cube-farm in the Philippines. Our routine office work that probably could already have been automated away by a line of business app in the 1980s, but is still done in some small office in Tulsa because it doesn't make sense to pay someone to write the code when 80% of the work is managing the data entry that still needs to be done regardless.
This more marginal labor is going to be more easy to replace. Also plenty of the more "elite" type labor will too, as it turns out it is more marginal. Already glue and boilerplate programming work is going this way, there is just so much more to do, and the important work of figuring out what should be done, that it hasn't displaced programmers yet. But it will for some fraction. WYSIWG type websites for small business has come a long way and will only get better, so there will be less need for customization on the margin. Or light design work (like take my logo and plug into into this format for this charity tournament flyer).
That’s a lot of weight on RTO and why it’s being implementing. A company is fully able to have you RTO, maybe even move, and fire you next day/month/year and desiring increased teamwork is not mutually exclusive of preparing for lay offs. Plus, I imagine at these companies there are multiple hands all doing things for their own purpose and metrics without knowing what the other hand is doing.Mid level Jan’s Christmas bonus depends on responding to exit interviews measurements showing workers leaving due to lack of teamwork, Bobs bonus depends on quickly implementing the code.
> While LLMs do plenty of awful things, people make the most incredibly stupid mistakes too
I am 100% not blaming the LLM, but rather VCs and the media for believing the VCs. Once we get over the hype and people realize there isn't a golden goose, the better off we will be. Once we accept that LLM is not perfect and that it is not what we are being sold, I believe we will find a place for it that will make a huge impact. Unfortunately for OpenAI and others, I don't believe they will play as big of a role as they would like us to believe/will.
> "I see LLM as a power tool for domain experts, and you have to assume whatever it spits out may be wrong, and your process should allow for it."
this gets to the heart of it for me.
I think LLMs are an incredible tool, providing advanced augmentation on our already developed search capabilities. What advanced user doesnt want to have a colleague they can talk about their specific domain capacity with?
The problem comes from the hyperscaling ambitions of the players who were the first in this space. They quickly hyped up the technology beyond want it should have been.
Welcome to capitalism. The market forces will squeze max value out of them. I imagine that Anthropic and OpenAI will be in the future fully downsized and acquired by their main investors (Microsoft and Amazon) and will simply becoming part of their generic and faceless AI & ML Teams once the current downwards stage of the hype cycle completes it closure in the next 5-8 years.
Those Apple engineers stated in a very clear tone:
- every time a different result is produced.
- no reasoning capabilities were categorically determined.
So this is it. If you want LLM - brace for different results and if this is okay for your application (say it’s about speech or non-critical commands) then off you are.
Otherwise simply forget this approach, and particularly when you need reproducible discreet results.
I don’t think it gets any better than that and nothing so far implicated it will (with this particular approach to AGI or whatever the wet dream is)
There’s another option here though. Human supervised tasks.
There’s a whole classification of tasks where a human can look at a body of work and determine whether it’s correct or not in far less time than it would take for them to produce the work directly.
As a random example, having LLMs write unit tests.
Which is a good example, because accuracy can be improved significantly with even minor human guidance in task like unit tests. Human augmentation is extremely valuable.
No sadly they just voicing the opinion already voiced by (many) other scientists.
My masters was text-to-sql and I can tell you hundreds of papers conclude that seq2seq and the transformer dérivâtes suck at logic even when you approach logic the symbolic way.
We’d love to figure production rules of any sort emerge with scale of the transformer, but I’m get to read such paper.
It's also true that Apple hasn't really created any of these technologies themselves; afaik they're using a mostly standard LLM architecture (not invented by Apple) combined with task specific LORAs (not invented by Apple). Has Apple actually created any genuinely new technologies or innovations for Apple Intelligence?
I wouldn't expect an LLM to be good at spell checking, actually. The way they tokenize text before manipulating it makes them fairly bad at working with small sequences of letters.
I have had good luck using an LLM as a "sanity checking" layer for transcription output, though. A simple prompt like "is this paragraph coherent" has proven to be a pretty decent way to check the accuracy of whisper transcriptions.
> I do believe LLM is a game changer, but I'm not convinced it is designed to be public-facing.
I think that, too, is a UX problem.
If you present the output as you do, as simple text on a screen, the average user will read it with the voice of an infallible Star Trek computer and be irritated by every mistake.
But if you present the same thing as a bunch of cartoon characters talking to each other, users might not only be fine with "egg in your face moments", as you put it, they will laugh about them.
The key is to move the user away from the idealistic mental model of what a computer is and does.
As mentioned earlier, unless you have a 6th sense for what is wrong, you won't know. If the message was "make sure to double check our response" then they get a pass, but they know people will just say "why shouldn't i just use google."
I was using an LLM to help spot passive voice in my documents and it told me "We're making" was passive and I should change it to "we are making" to make it active.
Leaving aside "we're" and "we are" are the same, it is absolutely active voice
In the process of developing my tool, there are only 5 models (the first 5 in my models dropdown list) that I would use as a writing aide. If you used any other model, it really is a crapshoot with how bad they can be.
> It keeps complaining that GitHub is spelled like Github, when it isn't
I feel like this is unfair. That's the only thing it got wrong? But we want it to pass all of our evals, even ones the perhaps a dictionary would be better at solving? Or even an LLM augmented with a dictionary.
My reason for commenting wasn't to say LLM sucks, but rather we need to get over the honeymoon phase. The fact the GPT-4o (one of the most advanced, if not the most advanced when it comes to non programming tasks) hallucinated "Github" as the input, should give us pause.
LLM has its place and it will forever change how we think about UX and other things, but we need to realize you really can't create a public facing solution without significant safe guards, if you don't want egg on your face.
I believe the honeymoon face has loong been finished. Even in the mainstream, last year of the AI year. 2024 has seen nothing substantially good and the only notesworthy thing is this article finally hitting into the public consciousness that we are past of the AI peak and beyond the plateau and freefalling has already begun.
LLM investors will be reviewing their portfolios and will likely begin declining further investments without clear evidence of profits in the very near future. On the other side, LLM companies will likely try to downplay this and again promise the Moon.
We have built quite a few highly useful LLM applications in my org that have reduced cost and improved outcomes in several domains - fraud detection, credit analysis, customer support, and a variety of other spaces. By in large they operate as cognitive load reducers but also handle through automation the vast majority of work since in our uses false negatives are not as bad as false positives but the majority of things we analyze are not true positives (99.999%+). As such the LLMs do a great job at anomaly detection and allow us to do tasks it would be prohibitively expensive with humans and their false positive and negative rates are considerably higher than LLMs.
I see these statements often here about “I’ve never seen an effective commercial use of LLMs,” which tells me you aren’t working with very creative and competent people in areas that are amenable to LLMs. In my professional network beyond where I work now I know at least a dozen people who have successful commercial applications of LLMs. They tend to be highly capable people able to build the end to end tool chains necessary (which is a huge gap) and understand how to compose LLMs in hierarchical agents with effective guard rails. Most ineffectual users of LLMs want them to be lazy buttons that obviate the need to think. They’re not - like any sufficiently powerful tool they require thought up front and are easy to use wrong. This will get better with time as patterns and tools emerge to get the most use out of them in a commercial setting. However the ability to process natural language and use an emergent (if not actual) abductive reasoning is absurdly powerful and was not practically possible 4 years ago - the assertion such an amazing capability in an information or decisioning system is not commercially practical is on the face absurd.
>We have built quite a few highly useful LLM applications in my org that have reduced cost and improved outcomes in several domains
Apps that use LLMs or apps made with LLMs? In either case can you share them?
>which tells me you aren’t working with very creative and competent people
> In my professional network beyond where I work now I know at least a dozen people who have successful commercial applications of LLMs.
Apps that use LLMs or apps made with LLMs? In either case can you share them?
No one doubts that you can integrate LLMs into an application workflow and get some benefits in certain cases. That has been what the excitement and promise was about all along. They have a demonstrated ability to wrangle, extract, and transform data (mostly correctly) and generate patterns from data and prompts (hit and miss, usually with a lot of human involvement). All of which can be powerful. But outside of textual or visual chatbots or CRUD apps, no one wants to "put up or shut" a solid example that the top management of an existing company would sign off on. Only stories about awesome examples they and their friends are working on ... which often turn out to be CRUD apps or textual or visual chatbots. One notable standout is generative image apps can be quite good in certain circumstances.
So, since you seem to have a real interest and actual examples of this, I am curious to see some that real companies would gamble that company on. And I don't mean some quixotic startup, I mean a company making real money now with customers that is confident on that app to the point they are willing to risk big. Because that last part is what companies do with other (non LLM) apps. I also know that people aren't perfect and wouldn't expect an LLM to be, just want to make sure I am not missing something.
really agree with this and I think it's been the general experience: people wanting LLMs to be so great (or making money off them) kind of cherry picking examples that fit their narrative, which LLMs are good at because they produce amazing results some of the time like the deluxe broken clock that they are (they're right many many times a day)
at the end of the day though, it's not exactly reliable or particularly transformative when you get past the party tricks
In education at least, we've actively improved efficiency by ~25% across a large swath of educators (direct time saved) - agentic evaluators, tutors and doubt clarifiers. The wins in this industry are clear. And this is that much more time to spend with students.
I also know from 1-1 conversation with my peers in large-finance world, and there too the efficiency improvements on multiple fronts are similar.
They are partially hype though. That's what people here are arguing. There are benefits but their valuation is largely hype driven. AI is going to transform industries and humanity, yes. But AI does not mean LLM (even if LLM means AI). LLM raw potential was reached last year with GPT-4. From here on, the value will lie on exploiting the potential we already have to generate clever applications. Just like the internet provided a platform for new services, I expect LLMs to be the same but with a much smaller impact
We’ve found that the text it generates in our RAG application is good, but it cocks up probably 5-10% of the time doing the inline references to the documents which users think is a bug and which we aren’t able to fix. This is static rather than interactively generated too
This is why we are only at the start of exploring the solution space. What applications don't require 100% accuracy? What tooling can we build that enables a human in the loop to choose between options? What options do we have to better testing or checking accuracy? There is a lot more to be done to invest hybrid systems that use other types of models or novel training date or heuristics or human workflows in novel ways that shore up the shortcomings ... but in aggregate allow us to do new things. It will take many years for us to figure where this makes the most sense.
I don't think we've even started to get the most value out of current gen LLMs. For starters very few people are even looking at sampling which is a major part of the model performance.
The theory behind these models so aggressively lags the engineering that I suspect there are many major improvements to be found just by understanding a bit more about what these models are really doing and making re-designs based on that.
I highly encourage anyone seriously interested in LLMs to start spending more time in the open model space where you can really take a look inside and play around with the internals. Even if you don't have the resources for model training, I feel personally understanding sampling and other potential tweaks to the model (lots of neat work on uncertainty estimations, manipulating the initial embedding the prompts are assigned, intelligent backtracking, etc).
And from a practical side I've started to realize that many people have been holding on of building things waiting for "that next big update", but there a so many small, annoying tasks that can be easily automated.
The reason people are holding out is that the current generation of models are still pretty poor in many areas. You can have it craft an email, or to review your email, but I wouldn't trust an LLM with anything mission-critical. The accuracy of the generated output is too low be trusted in most practical applications.
Glib but the reality is that there are lots of cases where you can use an AI in writing but don’t need to entrust it with the whole job blindly.
I mostly use AIs in writing as a glorified grammar checker that sometimes suggests alternate phrasing. I do the initial writing and send it to an AI for review. If I like the suggestions I may incorporate some. Others I ignore.
The only times I use it to write is when I have something like a status report and I’m having a hard time phrasing things. Then I may write a series of bullet points and send that through an AI to flesh it out. Again, that is just the first stage and I take that and do editing to get what I want.
>> have something like a status report and I’m having a hard time phrasing things
I believe the above suggested that this type of email likely doesn't need to be sent. Is anyone really reading the status report? If they read it, what concrete decisions do they make based on it. We all get in this trap of doing what people ask of us but it often isn't what shareholders and customers really care about.
Google became a billion dollar company creating the best search and indexing service at the time and putting ads around the results (that and YouTube). The didn't own the answer of the question.
A support ticket is a good middle ground. This is probably the area of most robust enterprise deployment. Synthesizing knowledge to produce a draft reply with some logic either to automatically send it or have human review. There are both shitty and ok systems that save real money with case deflection and even improved satisfaction rates. Partly this works because human responses can also suck, so you are raising a low bar. But it is a real use case with real money and reputation on the line.
> I've started to realize that many people have been holding on of building things waiting for "that next big update"
I’ve noticed this too — I’ve been calling it intellectual deflation. By analogy, why spend now when it may be cheaper in a month? Why do the work now, when it will be easier in a month?
Moores law became less of a prediction and more of a product road map as time went on. It helped coordinate investment and expectations across the entire industry so everyone involved had the same understanding of timelines and benchmarks. I fully believe more investment would’ve ‘bent the curve’ of the trend line but everyone was making money and there wasn’t a clear benefit to pushing the edge further.
Or maybe it pushed everyone to innovate faster than they otherwise would’ve? I’m very interested to hear your reasoning for the other case though, and I am not strongly committed to the opposite view, or either view for that matter.
Afaiu “sampling” here, it is controlled with (not only?) topk and temp parameters in e.g. “text generation web ui”. You may find these in other frontends probably too.
This ofc implies local models and that you have a decent cpu + min 64gb of ram to run above 7b-sized model.
> holding on of building things waiting for "that next big update", but there a so many small, annoying tasks that can be easily automated.
Also we only hear / see the examples that are meant to scale. Startups typically offer up something transformative, ready to soak up a segment of a market. And that’s hard with the current state of LLMs. When you try their offerings, it’s underwhelming. But there is richer, more nuanced hard to reach fruits that are extremely interesting - but it’s not clear where they’d scale in and of themselves.
CAN anything be done? At a very low level they’re basically designed to hallucinate text until it looks like something you’re asking for.
It works disturbingly well. But because it doesn’t have any actual intrinsic knowledge it has no way of knowing when it made a “good“ hallucination versus a “bad“ one.
I’m sure people are working at piling things on top to try and influence what gets generated or catch and move away from errors errors other layers spot… but how much effort and resources will be needed to make it “good enough“ that people don’t worry about this anymore.
In my mind the core problem is people are trying to use these for things they’re unsuitable for. Asking fact-based questions is asking for trouble. There isn’t much of a wrong answer if you wanted to generate a bedtime story or a bunch of test data that looks sort of like an example you give it.
If you ask it to find law cases on a specific point you’re going to raise a judge‘s ire, as many have already found.
Semantic search without LLMs is already making a dent. It still gives traditional results that need to be human processed, but you can get "better" search results.
And with that there is a body work on "groundedness" that basically post-processes output to compare it against its source material. It still can result in logic errors and has a base error it self, but can ensure you at least have clear citations for factual claims that match real documents, but doesn't fully ensure they are being referenced correctly (though that is already the case even with real papers produced by humans).
Also consider the baseline isn't perfection, it is a benchmark against real humans. Accuracy is getting much better in certain domains where we have a good corpora. Part of assessing the accuracy of a system is going to be about determining if the generated content is "in distribution" of its training data. There is progress being made in this direction, so we could perhaps do a better job at the application level of making use of a "confidence" score of some kind maybe even taking that into account in a chain of thought like reasoning step.
People keep finding "obviously wrong" hallucinates that seem like proof things are still crap. But these system keep getting better on benchmarks looking at retrieval accuracy. And the benchmarks keep getting better as people point out deficiencies it them. Perfection might not be possible, but consistently better than average human seems in reach, and better than that seems feasible too. The challenge is the class of mistakes might look different even if the error rate overall is lower.
what do you want done about it? Hallucination is an intrinsic part of how LLMs work. What makes a hallucination is the inconsistency between the hallucinated concept and the reality. Reality is not part of how LLMs work. They do amazing things but at the end of the day they are elaborate statistical machines.
Look behind the veil and see LLMs for what they really are and you will maximise their utility, temper your expectations and save you disappointment
Would you have any suggestions on how to play with the internals of these open models? I don't understand LLMs well, and would love to spend some experimenting, but I don't know where to start. Are any projects more appropriate for neophytes?
Exactly, I think the current crop of models is capable of solving a lot of non-first-world problems. Many of them don't need full AGI to solve, especially if we start thinking outside Silicon Valley.
The scaling laws may be dead. Does this mean the end of LLM advances? Absolutely not.
There are many different ways to improve LLM capabilities. Everyone was mostly focused on the scaling laws because that worked extremely well (actually surprising most of the researchers).
But if you're keeping an eye on the scientific papers coming out about AI, you've seen the astounding amount of research going on with some very good results, that'll probably take at least several months to trickle down to production systems. Thousands of extremely bright people in AI labs all across the world are working on finding the next trick that boosts AI.
One random example is test-time compute: just give the AI more time to think. This is basically what O1 does. A recent research paper suggests using it is roughly equivalent to an order of magnitude more parameters, performance wise. (source for the curious: https://lnkd.in/duDST65P)
Another example that sounds bonkers but apparently works is quantization: reducing the precision of each parameter to 1.58 bits (ie only using values -1, 0, 1). This uses 10x less space for the same parameter count (compared to standard 16-bit format), and since AI operatons are actually memory limited, directly corresponds to 10x decrease in costs: https://lnkd.in/ddvuzaYp
(Quite apart from improvements like these, we shouldn't forget that not all AIs are LLMs. There's been tremendous advance in AI systems for image, audio and video generation, interpretation and munipulation and they also don't show signs of stopping, and there's possibility that a new or hybrid architecture for the textual AI might be developed).
There are way too many personal definitions of what "Moore's Law" even is to have a discussion without deciding on a shared definition before hand.
But Goodhart's law; "When a measure becomes a target, it ceases to be a good measure"
Directly applies here, Moore's Law was used to set long term plans at semiconductor companies, and Moore didn't have empirical evidence it was even going to continue.
If you say, arbitrarily decide CPU, or worse, single core performance as your measurement, it hasn't held for well over a decade.
If you hold minimum feature size without regard to cost, it is still holding.
What you want to prove usually dictates what interpretation you make.
That said, the scaling law is still unknown, but you can game it as much as you want in similar ways.
GPT4 was already hinting at an asymptote on MMLU, but the question is if it is valid for real work etc...
Time will tell, but I am seeing far less optimism from my sources, but that is just anecdotal.
You are missing the economic component.. it isn't just how small can a transistor be.. it was really about how many transistors can you get for your money. So even when we reach terminal density, we probably haven't reached terminal economics.
I didn't say we have currently reached a limit. I am saying that there obvious is a limit (at some point). So, scaling cannot go forever. This is a counterpoint to the dubious analogy with deep learning.
The limits are engineering, not physics. Atoms need not be a barrier for a long time if you can go fully 3D, for example, but manufacturing challenges, power and heat get in the way long before that.
Then you can go ultra-wide in terms of cores, dispatchers and vectors (essentially building bigger and bigger chips), but an algorithm which can't exploit that will be little faster on today's chips than on a 4790K from ten years ago.
I think you're playing a different game than the Sam Altmans of the world. The level of investment and profit they are looking for can only be justified by creating AGI.
The > 100 P/E ratios we are already seeing can't be justified by something as quotidian as the exceptionally good productivity tools you're talking about.
It seems you are missing a lot of "ifs" in that hypothetical!
Nobody knows how things like coding assistants or other AI applications will pan out. Maybe it'll be Oracle selling Meta-licenced solutions that gets the lion's share of the market. Maybe custom coding goes away for many business applications as off-the-shelf solutions get smarter.
A future where all that AI (or some hypothetical AGI) changes is work being done by humans to the same work being done by machines seems way too linear.
> you are missing a lot of "ifs" in that hypothetical
The big one being I'm not assuming AGI. Low-level coding tasks, the kind frequently outsourced, are within the realm of being competitive with offshoring with known methods. My point is we don't need to assume AGI for these valuations to make sense.
Current AI coding assistants are best at writing functions or adding minor features to an existing code base. They are not agentic systems that can develop an entire solution from scratch given a specification, which in my experience is more typcical of the work that is being outsourced. AI is a tool, whose full-cycle productivity benefit seems questionable. It is not a replacement for a human.
> they are not agentic systems that can develop an entire solution from scratch given a specification, which in my experience is more typcical of the work that is being outsourced
If there is one domain where we're seeing tangible progress from AI, it's in working towards this goal. Difficult projects aren't in scope. But most tech, especially most tech branded IT, is not difficult. Everyone doesn't need an inventory or customer-complaint system designed from scratch. Current AI is good at cutting through that cruft.
There have been off the shelf solutions for so many common software use cases, for decades now. I think the reason we still see so much custom software is that the devil is always in the details, and strict details are not an LLMs strong suit.
LLMs are in my opinion hamstrung at the starting gate in regards to replacing software teams, as they would need to be able to understand complex business requirements perfectly, which we know they cannot. Humans can't either. It takes a business requirements/integration logic/code generation pipeline and I think the industry is focused on code generation and not that integration step.
I think there needs to be a re-imaging of how software is built by and for interaction with AI if it were to ever take over from human software teams, rather than trying to get AI to reflect what humans do.
This, code is written by humans for humans. LLMs cannot compete no matter how much data you throw at them. A world in which software is written by AI will likely won't be code that will be readable by humans. And that is dangerous for anything where people's health, privacy, finances or security is involved
There are a number of agentic systems that can develop more complex solutions. Just a few off the top of my head: Pythagora, Devin, OpenHands, Fume, Tusk, Replit, Codebuff, Vly. I'm sure I've missed a bunch.
Are they good enough to replace a human yet? Questionable[0], but they are improving.
[0] You wouldn't believe how low the outsourcing contractors' quality can go. Easily surpassed by current AI systems :) That's a very low bar tho.
I don't know what's your experience with outsourcing. But people outsource full projects not the writing of a couple of methods. With LLMs still unable to fully understand relatively simple stuff, you can't expect them to deliver a project whose specification (like most software projects) contains ambiguities that only an experienced dev can detect and ask deep questions about the intention and purpose of the project. LLMs are nowhere near that. To be able to handle external uncertainty and turn it into certainty, to explain why technical decisions were made, to understand the purpose of a project and how it matches the project. To handle the overall uncertainties of writing code with other's people's code. All this is stuff outsourced teams do well. But LLMs won't be anywhere near good for at least a decade. I am calling it
if the AI business is a bit more mundane than Altman thinks and there's diminishing returns the market is going to be even more commodified than it already is and you're not going to make any margins or somehow own the entire market. That's already the case, Anthropic works about as well, there's other companies a few months behind, open source is like a year behind.
That's literally Zucc's entire play, in 5 years this stuff is going to be so abundant you'll get access to good enough models for pennies and he'll win because he can slap ads on it, and openAI sits there on its gargantuan research costs.
I'm not sure about that. NVIDIA seems to stay in a dominant position as long as the race to AI remains intact, but the path to it seems unsure. They are selling a general purpose AI-accelerator that supports the unknown path.
Once massively useful AI has been achieved, or it's been determined that LLMs are it, then it becomes a race to the bottom as GOOG/MSFT/AMZN/META/etc design/deploy more specialized accelerators to deliver this final form solution as cheaply as possible.
Yeah they're the shovel sellers of this particular goldrush.
Most other businesses trying to actually use LLMs are the riskier ones, including OpenAI, IMO (though OpenAI is perhaps the least risky due to brand recognition).
Nvidia is more likely to become CSCO or INTC but as far as I can tell, that's still a few years off - unless ofcourse there is weakness in broader economy that accelerates the pressure on investors.
Right. I've been saying for a while that if all LLM development stopped entirely and we were stuck with the models we have right now (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1/2, Qwen 2.5 etc) we could still get multiple years worth of advances just out of those existing models. There is SO MUCH we haven't figured out about how to use them yet.
> There is SO MUCH we haven't figured out about how to use them yet.
I mean, it's pretty clear to me they're a potentially great human-machine interface, but trying to make LLMs - in their current fundamental form - a reliable computational tool.. well, at best it's an expensive hack, but it's just not the right tool for the job.
I expect the next leap forward will require some orthogonal discovery and lead to a different kind of tool. But perhaps we'll continue to use LLMs as we knownthem now for what they're good at - language.
One of the biggest challenges in learning how to use and build on LLMs is figuring out how to work productively with a technology that - unlike most computers - is inherently unreliable and non-deterministic.
It's possible, but it's not at all obvious and requires a slightly skewed way of looking at them.
This really reminds me of a trend years ago to create probabilistic programming constructs. I think it was just a trend way ahead of its time. Typical software engineers tend to be very ill-suited to think in probabilities and how to build reasonably reliable systems around them.
My team and I also develop with these models every day, and I completely agree. If models stall at current levels, it will take 10 (or more) years for us to capture most of the value they offer. There's so much work out there to automate and so many workflows to enhance with these "not quite AGI-level" models. And if peak model performance remains the same but cost continues to drop, that opens up vastly more applications as well.
> combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team
It's been a while though, we've had great models now for a 18 months plus. Why are we still yet to see these type of applications rolling out on a wide scale?
My anecdotal experience is that almost universally, 90-95% type accuracy you get from them is just not good enough. Which is to say, having something be wrong 10% or even 5% of the time is worse than not having at all. At best, you need to implement applications like that in an entirely new paradigm that is designed to extract value without bearing the costs of the risks.
It doesn't mean LLMs can't be useful, but they are kind of stuck with applications that inherently mesh with human oversight (like programming etc). And the thing about those is that they don't really scale, because the human oversight has to scale up with whatever the LLM is doing.
> you can have LLMs create reasonable code changes, with automatic review / iteration etc.
Nobody who takes code health and sustainability seriously wants to hear this. You absolutely do not want to be in a position where something breaks, but your last 50 commits were all written and reviewed by an LLM. Now you have to go back and review them all with human eyes just to get a handle on how things broke, while customers suffer. At this scale, it's an effort multiplier, not an effort reducer.
It's still good for generating little bits of boilerplate, though.
> Question for the group here: do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?
Certainly not.
But technology is all about stacks. Each layer strives to improve, right up through UX and business value. The uses for 1µm chips had not been exhausted in 1989 when the 486 shipped in 800nm. 250nm still had tons of unexplored uses when the Pentium 4 shipped on 90nm.
Talking about scaling at the the model level is like talking about transistor density for silicon: it's interesting, and relevant, and we should care... but it is not the sole determinent of what use cases can be build and what user value there is.
I have tried a few AI coding tools and always found them impressive but I don't really need something to autocomplete obvious code cases.
Is there an AI tool that can ingest a codebase and locate code based on abstract questions? Like: "I need to invalidate customers who haven't logged in for a month" and it can locate things like relevant DB tables, controllers, services, etc.
Cursor (Claude behind the scenes) can do that, however as always, your mileage may vary.
I tried building a whole codebase inspector, essentially what you are referring to with Gemini's 2 million token context window but had troubles with their API when the payload got large. Just 500 error with no additional info so...
I've played around with Claude and larger docs and it's honestly been a bit of a crapshoot, it feels like only some of the information gets into the prompt as the doc gets larger. They're great for converting PDF tables to more usable formats though.
The main difference between GPT5 and a PhD-level new hire is that the new hire will autonomously go out, deliver and take on harder task with much fewer guidance than GPT5 will ever require. So much of human intelligence is about interacting with peers.
I don't know how many team meetings PhD students have, but I do know about software development jobs with 15 minute daily standups, and that length meeting at 120 words per minute for 5 days a week, 48 weeks per year of a 3 year PhD is 1.296.000 words.
1. The person I'm replying to is hypothesising about a future, not yet existent, version, GPT5. Current quality limits don't tell you jack about a hypothetical future, especially one that may not ever happen because money.
2. I'm not commenting on the quality, because they were writing about something that doesn't exist and therefore that's clearly just a given for the discussion. The only thing I was adding is that humans also need guidance, and quite a lot of it — even just a two-week sprint's worth of 15 minute daily stand-up meetings is 18,000 words, which is well beyond the point where I'd have given up prompting an LLM and done the thing myself.
Rumours have been in abundance since GPT-4 came out due to on the lack of clarity, but that lack of clarity seems to also exist within the companies themselves.
OpenAI and Anthropic certainly seem up be doing a lot of product stuff, but at the same time the only reason people have for saying OpenAI not making a profit is all the money they're also spending on training new models — I've yet to use o1, it's still in beta and is only 2 months old (how long was gmail in "beta", 5 years?)
I also don't know how much self-training they do, training on signals from the model's output and how users rate that output, only that (1) it's more then none, that (2) some models like Phi-3 use at least some synthetic data[0], and (3) that making a model to predict how users will rate the output was one of the previous big breakthroughs.
If they were to train on almost all their own output, and estimaing API costs as approximately actual costs, and given the claimed[1] public financial statements, that's in the order of a quadrillion (1e15) tokens, compared to the mere ~1e13 claimed for some of the larger models.
[1] I've not found the official sources nor do I know where to look for them, all I see are news websites reporting on the numbers without giving citations I can chase up
Yes, but literally anybody can do all those things. So while there will be many opportunities for new features (new ways of combining data), there will be few business opportunities.
HN always says this, and it's always wrong. A technical implementation that's easy, or readily available, does not mean that a successful company can't be built on it. Last year, people were saying "OpenAI doesn't have a moat." 15 years before that, they were saying "Dropbox is just a couple of chron jobs, it'll fail in a few months."
The meaning here is different. What I'm saying is that big companies like OpenAI will always strive to make a generic AI, such that anyone can do basically anything using AI. The big companies therefore will indeed (like you say) have a profitable business, but few others will.
I am definitely not an expert, nor do I have inside information on the directions of research that these companies are exploring.
Yes, existing LLMs are useful. Yes, there are many more things we can do with this tech.
However, existing SOTA models are large, expensive to run, still hallucinate, fail simple logic tests, fail to do things a poorly trained human can do on autopilot, etc.
The performance of LLMs is extremely variable, and it is hard to anticipate failure.
Many potential applications of this technology will not tolerate this level of uncertainty. Worse solutions with predictable and well understood shortcomings will dominate.
I think there’s a long way to go also. I think people expected that AI would eventually be like a “point and shoot” where you would tell it to go do some complicated task, or sillier yet, take over someone’s entire job.
More realistically it’s like a really great sidekick for doing very specific mundane but otherwise non deterministic tasks.
I think we’ll start to see AI permeate into nearly every back office job out there, but as a series of tools that help the human work faster. Not as one big brain that replaces the human.
I have no data, but I whole-heartedly agree. Well, perhaps not “scam”, but definitely oversold. One of my best undergrad professors taught me the adage “don’t expect a model to do what a human expert cannot”, and I think it’s still a good rule of thumb. Giving someone an entire book to read before answering your question might help, but it would help way, way more to give them a few paragraphs that you know are actually relevant.
In my experience, the reality of long context windows doesn’t live up to the hype. When you’re iterating on something, whether it's code, text, or any document, you end up with multiple versions layered in the context. Every time you revise, those earlier versions stick around, even though only the latest one is the "most correct".
What gets pushed out isn’t the last version of the document itself (since it’s FIFO), but the important parts of the conversation—things like the rationale, requirements, or any context the model needs to understand why it’s making changes. So, instead of being helpful, that extra capacity just gets filled with old, repetitive chunks that have to be processed every time, muddying up the output. This isn’t just an issue with code; it happens with any kind of document editing where you’re going back and forth, trying to refine the result.
Sometimes I feel the way to "resolve" this is to instead go back and edit some earlier portion of the chat to update it with the "new requirements" that I didn't even know I had until I walked down some rabbit hole. What I end up with is almost like a threaded conversation with the LLM. Like, I sometimes wish these LLM chatbots explicitly treated the conversion as if it were threaded. They do support basically my use case by letting you toggle between different edits to your prompts, but it is pretty limited and you cannot go back and edit things if you do some operations (eg: attach a file).
Speaking of context, it's also hard to know what things like ChatGPT add to it's context in the first place. Many of times I'll attach a file or something and discover it didn't "read" the file into it's context. Or I'll watch it fire up a python program it writes that does nothing but echo the file into it's context.
I think there is still a lot of untapped potential in strategically manipulating what gets placed into the context window at all. For example only present the LLM with the latest and greatest of a document and not all the previous revisions in the thread.
> Question for the group here: do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?
IMO we've not even exhausted the options for spreadsheets, let alone LLMs.
And the reason I'm thinking of spreadsheets is that they, like LLMs, are very hard to win big on even despite the value they bring. Not "no moat" (that gets parroted stochastically in threads like these), but the moat is elsewhere.
I want to stuff a transcript of a 3 hour podcast into some LLM API and have it summarize it by: segmenting by topic changes, keeping the timestamps, and then summarizing each segment.
I wasn’t able to get it do it with Anthropic or OpenAI chat completion APIs. Can someone explain why? I don’t think the 200K token window actually works, is it looking sequentially or is it really looking at the whole thing at once or something?
I have yet to see LLMs provide a positive net value in the first place. They have a long way to go to weigh up for its negative uses in the form of polluting the commons that is the web, propaganda use, etc.
Beyond just RAG, I'm fairly bullish on finetuning. For example, Qwen2.5-Coder-32B-Instruct is much better than Qwen2.5-72B-Instruct at coding... Despite simply being a smaller version of the same model, finetuned on code. It's on par with Sonnet 3.5 and 4o on most benchmarks, whereas the simple chat-tuned 72B model is much weaker.
And while Qwen2.5-Coder-32B-Instruct is a pretty advanced finetune — it was trained on an extra 5 trillion tokens — even smaller finetunes have done really well. For example, Dracarys-72B, which was a simpler finetune of Qwen2.5-72B using a modified version of DPO on a handmade set of answers to GSM8K, ARC, and HellaSwag, significantly outperforms the base Qwen2.5-72B model on the aider coding benchmarks.
There's a lot of intelligence we're leaving on the floor, because everyone is just prompting generic chat-tuned models! If you tune it to do something else, it'll be really good at the something else.
The premise of deep learning is the automated 'absorption' of knowledge.
If we're back to curating it by hand and imparting it by writing code manually, how exactly are these systems an improvement on the 80's idea of building expert systems?
To a certain extent I think we get a better understanding what llms can do, and my estimation for the next ten years is more like best UI ever rather than llms will replace humanity. Now best UI ever is something that can certainly deliver a lot of value, 80% of all buttons in a car should be replaced by actually good voice control, and I think that is were we are going to see a lot of very interesting applications: Hey washing machine, this is two t-shirts and a jeans. (The washing machine can then figure out it's program by itself, I don't want to memorize the table in the manual.)
I think buttons should not be replaced, but rather augmented with voice control. I certainly want to be able to adjust air conditioning or use my washing machine while listening music or having otherwise noisy environemnt.
To each their own, but I don’t look forward to having my kids yelling, a podcast in my ears and having to explain to my tumbler that wool must be spun at 1000 RPM. Humans have varying preferences when it comes to communication and sensing, making our machine interactions favor the extroverted talkative exhibitionists is really only one modality.
Sure, there's going to be a lot of automation that can be built using current GPT-4 level LLMs, even if they don't get much better from here.
However, this is better thought of as "business logic scripting/automation", not the magic employee-replacing AGI that would be the revolution some people are expecting. Maybe you can now build a slightly less shitty automated telephone response system to piss your customers off with.
In my view, an escape hatch if we are truly stuck would be radical speed ups (like Cerebras) in compute time. If we get outputs in milli-seconds instead of seconds and at much lower costs, it would make backtracking viable. This won't allow AGI, but can make a new class of apps possible.
The current models are very powerful and we definitely didn't get most out of them yet. We are getting more and more out of them every week when we release new versions of our toolkits. So if this is it; please make it faster and take less energy. We'll be fine until the next AI spring.
No, we have not even scratched the surface of what current-gen LLMs can do for an organization which puts the correct data into them.
If indeed the "GPT 5!" Arms race has calmed down, it should help everyone focus on the possible, their own goals, and thus what AI capabilities to deploy.
Just as there won't be a "Silver Bullet" next gen model, the point about Correct Data In is also crucial. Nothing is 'free' not even if you pay a vendor or integrator. You, the decision making organization, must dedicate focus to putting data into your new AI systems or not.
It will look like the dawn of original IBM, and mechanical data tabulation, in retrospect once we learn how to leverage this pattern to its full potential.
Well I have a question for you: do you think this format of AI can actually think?
I.e. can it ruminate on the data it's ingested, and rather than returning the response of highest probability, return something original?
I think that's the key. If LLMs can't ultimately do that, there's still a lot to be gained from utilising the speed and fluidly scalable resources of computers.
But like all the top tech companies know, it's not quantity of bodies in seats that matters but talent, the thing that's going to prevail is raw intelligence. If it can't think better than us, just process data faster and more voluminously but still needing human verification, we're on an asymptotic path.
I think there's a ton to be tapped based on the current state of the art.
As a developer, I'm making much more progress using the SOTA (Claude 3.5) as a Socratic interrogator. I'm brainstorming a project, give it my current thoughts, and then ask it to prompt me with good follow-up questions and turn general ideas into a specific, detailed project plan, next steps, open questions, and work log template. Huge productivity boost, but definitely not replacing me as an engineer. I specifically prompt it to not give me solutions, but rather, to just ask good questions.
I've also used Claude 3.5 as (more or less) a free arbitrator. Last week, I was in a disagreement with a colleague, who was clearly being disingenuous by offering to do something she later reneged on, and evading questions about follow up. Rather than deal with organizational politics, I sent the transcript to Claude for an unbiased evaluation, and it "objectively" confirmed what had been frustrating me. I think there's a huge opportunity here to use these things to detect and call out obviously antisocial behavior in organizations (my CEO is intrigued, we'll see where it goes). Similarly, in our legal system, as an ultra-low-cost arbitrator or judge for minor disputes (that could of course be appealed to human judges). Seems like the level of reasoning in Claude 3.5 is good enough for that.
> For example, combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team.
I'd love to hear about this. I applied to YC WC 25 with research/insight/an initial researchy prototype built on top of GPT4+finetuning about something along this idea. Less powerful than you describe, but it also works without the human moderated KG.
Nowhere near, but the market seems to have priced in that scaling would continue to have a near linear effect on capability. That’s not happening and that’s the issue the article is concerned with.
Looks you independently arrived at the original context that language models existed in as interfaces for deeper knowledge system in chatbots.
But the knowledge system here is doing the grunt of the work, and progressing past it's own limitations goes right hack to the pitfalls of the rules based AI winter. That's not a engineering problem, it's a foundational mathematics problems that only a few people are seriously working on.
Voice for LLMs is surprisingly good. I'd love to see LLMs used in more systems like cars and in-home automation. Whatever cars use today and Alexa in the home simply are much worse than what we get with ChatGPT voice today.
The context is a strict limitation if you work with data analysis or knowledge bases. Embeddings work, but the products we know get left and right mostly do not offer such capabilities at all. In that case most of these products remain decent chat bots.
For coding LLMs certainly are helpful, but I prefer local models instead of anything on offer right now. There is just much more potential here.
Great question. Im very confident in my answer, even though it’s in the minority here: we’re not even close to exhausting the potential.
Imagine that our current capabilities are like the Model-T. There remains many improvements to be made upon this passenger transportation product, with RAG being a great common theme among them. People will use chatbots with much more permissive interfaces instead of clicking through menus.
But all of that’s just the start, the short term, the maturation of this consumer product; the really scary/exciting part comes when the technology reaches saturation, and opens up new possibilities for itself. In the Model-T metaphor, this is analogous to how highways have (arguably) transformed America beyond anyone’s wildest dreams, changing the course of various historical events (eg WWII industrialization, 60s & 70s white flight, early 2000s housing crisis) so much it’s hard to imagine what the country would look like without them. Now, automobiles are not simply passenger transportation, but the bedrock of our commerce, our military, and probably more — through ubiquity alone they unlocked new forms of themselves.
For those doubting my utopian/apocalyptic rhetoric, I implore you to ask yourself one simple question: why are so many experts so worried about AGI? They’ve been leaving in droves from OpenAI, and that’s ultimately what the governance kerfluffle there was. Hinton, a Turing award winner, gave up $$$ to doom-say full time. Why?
My hint is that if your answer involves less then a 1000 specialized LLMs per unified system, then you’re not thinking big enough.
FYI, I find this line of reasoning to be unconvincing both logically and by counter-example ("why are so many experts so worried about the Y2K bug?")
Personally, I don't find AI foom or AI doom predictions to be probable but I do think there are more convincing arguments for your position than you're making here.
Fair enough, well put to both of these responses! I’m certainly biased, and can see how the events that truly scare me (after already assessing the technology on my own and finding it to be More Important Than Fire Or Electricity) don’t make very convincing arguments on their own.
For us optimistic doomers, the AI conversation seems similar to the (early-2000s) climate change debate; we see a wave of dire warnings coming from scientific experts that are all-to-often dismissed, either out of hand due to their scale, or on the word of an expert in an adjacent-ish field. Of course, there’s more dissent among AI researchers than there was among climate scientists, but I hope you see where I’m coming from nonetheless — it’s a dynamic that makes it hard to see things from the other side, so-to-speak.
At this point I’ve pretty much given up convincing people on HackerNews, it’s just cathartic to give my piece and let people take it or leave it. If anyone wants to bring the convo down from industry trends into technical details, I’d love to engage tho :)
I've written (and am writing) extensively why I think AGI cant be as bad as everyone thinks, from a first principles (i.e physics and math) standpoint:
Long story short its easy to get enamored with an agent spitting out tokens out but reality and engineering are far far more complex than that (orders of magnitude)
There are all sorts of valuable things to explore and build with what we have already.
But understanding how likely it is that we will (or will not) see a new models quickly and dramatically improve on what we have "because scaling" seems valuable context for everyone in ecosystem to make decisions.
it's the equivalent of the "we overestimate the impact of technology in the short-term and underestimate the effect in the long run" quote.
everyone is looking at llm scores & strawberry gotchas while ignoring the trillions of market potential in replacing existing systems and (yes) people with the current capabilities. identifying the use cases, finetuning the models and (most importantly) actually rolling this out in existing organizations/processes/systems will be the challenge long before the base models' capabilities will be
it is worth working on those issues now and get the ball rolling, switching out your models for future more capable ones will be the easy part later on.
> do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?
I know we absolutely have not, but I think we have reached the limit in terms of the Chatbot experience that ChatGPT is. For some reason the industry keeps trying to force the chatbot interface to do literally everything to the point that we now have inflated roles like "Prompt Engineers". This is to say that people suck at knowing what they want off the rip, and LLMs can't help with that if they're not integrated in technology in such a way where a solid foundation is built to allow the models to generate good output.
LLMs and other big data models have incredible potential for things like security, medicine, and the power industry to name a few fields. I mean I was recently talking with a professor about his research in applying deep learning to address growing security concerns in cars on the road.
Your hypothesis here is not exclusive of the hypothesis in this article.
Name your platform. Linux. C++. The Internet. The x86 processor architecture. We haven't exhausted the options for delivering value on top of those, but that doesn't mean the developers and sellers of those platforms don't try to improve them anyway and might struggle to extract value from application developers who use them.
I lead a team exploring cutting edge LLM applications and end-user features. It's my intuition from experience that we have a LONG way to go.
GPT-4o / Claude 3.5 are the go-to models for my team. Every combination of technical investment + LLMs yields a new list of potential applications.
For example, combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team.
If you now give it some predictive / simulation capability - eg: simulate the execution of a task or project like creating a github PR code change, and test against an expert bot above for code review, you can have LLMs create reasonable code changes, with automatic review / iteration etc.
Similarly there are many more capabilities that you can ladder on and expose into LLMs to give you increasingly productive outputs from them.
Chasing after model improvements and "GPT-5 will be PHD-level" is moot imo. When did you hire a PHD coworker and they were productive on day-0 ? You need to onboard them with human expertise, and then give them execution space / long-term memories etc to be productive.
Model vendors might struggle to build something more intelligent. But my point is that we already have so much intelligence and we don't know what to do with that. There is a LOT you can do with high-schooler level intelligence at super-human scale.
Take a naive example. 200k context windows are now available. Most people, through ChatGPT, type out maybe 1500 tokens. That's a huge amount of untapped capacity. No human is going to type out 200k of context. Hence why we need RAG, and additional forms of input (eg: simulation outcomes) to fully leverage that.