All the snark contained within aside, I'm reminded of that ranting blog post from the person sick of AI that made the rounds a little ways back, which had one huge, cogent point within: that the same companies that can barely manage to ship and maintain their current software are not magically going to overcome that organizational problem set by virtue of using LLMs. Once they add that in, then they're just going to have late-released, poorly made software that happens to have an LLM in it somewhere.
> Once they add that in, then they're just going to have late-released, poorly made software that happens to have an LLM in it somewhere.
Hell, you can even predict where by looking at the lowest-paid people in the organization. Work that isn't valued today won't start being valued tomorrow.
Speaking as someone who is very critical of the general "raise infinite money and grow grow grow" approach to startups (it's a big part of why my company is bootstrapped [1]), I think there's a really important difference between making money and raising money.
Raising money is about convincing investors that you can solve someone else's problem. Or even more abstractly, about convincing investors that you can convince other investors that you can convince other investors that you can solve someone else's problem. That's enough layers of indirection that signaling games start to overtake concrete value a lot of the time, and that's what gets you your Theranoses and your FTXes and the like. You get into "the market can remain irrational longer than you can remain solvent", or its corollary, "the market can remain irrational long enough for you to exit before it wakes up".
But making money means you are solving a problem for your customer, or at least, that your customer thinks you're solving a problem with no additional layers of indirection. And if we take "making money" to mean "making a profit", it also means you're solving their problem at less cost than the amount they're willing to pay to solve it. You are actually creating net value, at least within a sphere limited to you and your customer (externalities, of course, are a whole other thing, but those are just as operative in non-profitable companies).
I think this is one of the worst things about the way business has done today. Doing business, sustainably and profitably, is an excellent way to keep yourself honest and force your theories to actually hold up in a competitive market. It's a good thing for you and for your users. But business has become so much about gathering sufficient capital to do wildly anticompetitive things and/or buy yourself preferential treatment that we're losing that regulating force of honesty.
[1] see my HN profile for more on that if you care
Incidentally, this just led me to the first case I've found of a wide variety of AIs actually generating the same answer consistently. They almost always said a close synonym for "now I have a solution".
For some AIs, if you ask them to complete it for "Java" and/or "Regexes" first, then they give realistic answers for "AI". But others (mostly, online commercial ones) are just relentlessly positive even then.
Prompting to complete it for "Python" usually remains positive though.
We aren’t good at creating software systems from reliable and knowable components. A bit skeptical that the future of software is making a Rube Goldberg machine of black box inter-LLM communication.
I could see software having a future as a Rube Goldberg machine of black box AIs, if hardware is cheap enough and the AIs are good enough. There was a scifi novel (maybe "A Fire Upon the Deep"?) where there was no need to write software because AI could cobble any needed solution together by using existing software and gluing it together. Throwing cycles at deepening layers was also something that Paul Graham talked about in the hundred year language (https://paulgraham.com/hundred.html).
Now, whether hardware is cheap enough or AI is smart enough is an entirely different question...
As someone who makes HW for a living, please do make more Rube Goldberg machines of black box LLMs. At least for a few more years until my kids are out of college. :)
Here's a practical in this vein but much simpler - if you're trying to answer a question with an LLM, and have it answer in json format within the same prompt, for many models the accuracy is worse than just having it answer in plaintext. The reason is that you're now having to place a bet that the distribution of json strings it's seen before meshes nicely with the distribution of answers to that question.
So one remedy is to have it just answer in plaintext, and then use a second, more specialized model that's specifically trained to turn plaintext into json. Whether this chain of models works better than just having one model all depends on the distribution match penalties accrued along the chain in between.
I wrap the plaintext in quotes, and perhaps a period, so that it knows when to start and when to stop, you can add logit biases for the syntax and pass period as a stop marker to chatgpt apis.
Also you don't need to use a model to build a json from plaintext answers lol, just use a programming language.
Use Moore's law to achieve unreal battery life and better experiences for users... or use Moore's law to throw more piles of abstractions on abstractions where we end up with solutions like Electron or I Duck Taped An AI on it.
Reading through this, I could not tell if this was a parody or real. That robot image slopped in the middle certainly didn't help.
except the computer is the one both writing AND using the abstractions, so the human cost is essentially zero. and thus is absolutely not even similar.
as a general rule, virtually any analogy that involves anthropomorphizing LLMs is at best right for the wrong reasons— a stopped clock— leads to conclusions ranging from misleading to actively harmful.
I'll stay out of the inevitable "You're just adding a band aid! What are you really trying to do?" discussion since I kind of see the author's point and I'm generally excited about applying LLMs and ML at more tasks.
One thing I've been thinking about is if an agent (or collection of agents) can solve a problem initially in a non-scalable way through raw inference, but then develop code to make parts of the solution cheaper to run.
For example, I want to scrape a collection of sites. The agent would at first apply the whole HTML to the context to extract the data (expensive but it works), but then there is another agent that sees this pipeline and says "hey we can write a parser for this site so each scrape is cheaper", and iteratively replaces that segment in a way that does not disrupt the overall task.
Well, the standard advice for getting off the ground with most endeavours is “Do things that don’t scale”. Obviously scaling is nice, but sometimes it’s cheaper and faster to brute force it and worry about the rest later.
The unscalable thing is often like “buy it cheap, buy it twice” but it’s also often like “buy it cheap, only fix it if you use it enough that it becomes unsuitable”. Makers endorse both attitudes. Knowing when which applies is the challenging bit
Cool idea. It's a bit like what happens in human brains when we develop expertise at something too: start with general purpose behaviors/thinking applied to new specialized task—but if that new specialized task is repeated/important enough you end up developing "specialized circuitry" for it: you can perform the task more efficiently, often without requiring conscious thought.
RAG doesn’t necessarily give the best results. Essentially it is a technically elegant way to semantic context to the prompt (for many use cases it is over-engineered). I used to offer RAG SQL query generations on SQLAI.ai and while I might introduce it again, for most use cases it was overkill and even made working with the SQL generator unpredictable.
Instead I implemented low tech “RAG” or “data source rules”. It’s a list of general rules you can attach to a particular data source (ie database). Rules are included in the generations and work great. Examples are “Wrap tables and columns in quotes” or “Limit results to 100”. It’s simple and effective - I can execute the generate SQL again my DB for insights.
RAG without preprocessing is almost useless, because unless you are careful, RAG gives you all the weaknesses of vector DBs, with all the weaknesses of LLMs!
The easiest example I can come up is imagine you just dump a restaurant menu into a vector DB. The menu is from a hipster restaurant and instead of having "open hours" or "business hours" the menu says "Serving deliciousness between 10:00am and 5:00pm"
Naïve RAG queries are going to fail miserably on that menu. "When is the restaurant open?" "What are the business hours?"
Longer context lengths are actually the solution for this problem, when context is small enough and the potential of ambiguity is high enough, LLMs are the better tool.
Reminds how few years ago Tesla (if i remember - Karpaty) described that in Autopilot they started to extract the 3rd model and use it to explicitly apply some static rules.
What do you mean by "RAG SQL query generations"? Were you searching for example queries similar to the questions the user's asked and injecting those examples into the prompt?
Semantic search is a powerful tool that can greatly improve the relevance and quality of search results by understanding the intent and contextual meaning of search terms. However, it’s not without its limitations. One of the key challenges with semantic search is the assumption that the answer to a query is semantically similar to the query itself. This is not always the case, and it can lead to less than optimal results in certain situations.
https://fsndzomga.medium.com/the-problem-with-semantic-searc...
The blockchain hype train was ridiculous. Textbook "solution looking for a problem" that every consultant was trying to push to every org, which had to jump onboard simply because of FOMO.
I don't think that's quite right. It was businesses who were jumping at consultants to see how they stuff a blockchain into their pipeline to do the same thing they were already doing, all so they could put "now with blockchain!" on the website.
> This is where compound systems are a valuable framework because you can break down the problem into bite-sized chunks that smaller LLMs can solve.
Just a reminder that smaller fine tuned models are just as good at solving the problems they are trained to solve, as large models are.
> Oftentimes, a call to Llama-3 8B might be enough if you need to a simple classification step or to analyze a small piece of text.
Even 3B param models are powerful now days, especially if you are willing to put the time into prompt engineering. My current side project is working on simulating a small fantasy town using a tiny locally hosted model.
> When you have a pipeline of LLM calls, you can enforce much stricter limits on the outputs of each stage
Having an LLM output a number from 1 to 10, or "error" makes your schema really hard to break.
All you need to do is parse the output and it if isn't a number from 1 to 10... just assume it is garbage.
A system built up like this is much more resilient, and also honestly more pleasant to deal with.
> "The most common debate was whether RAG or fine-tuning was a better approach, and long-context worked its way into the conversation when models like Gemini were released. That whole conversation has more or less evaporated in the last year, and we think that’s a good thing. (Interestingly, long context windows specifically have almost completely disappeared from the conversation though we’re not completely sure why.)"
I'm a bit confused by their thinking it's a good thing while being confused about why the subject has "disappeared from the conversation".
Could anyone here shed some light / share an opinion on it/why "long context windows" aren't discussed any more? Did everyone decide they're not useful? Or they're so obviously useful that nobody wastes time discussing them? Or...
YES (although i'm hesitant to even say anything because on some level this is tightly-guarded personal proprietary knowledge from the trenches that i hold quite dear). why aren't you spinning off like 100 prompts from one input? it works great in a LOT of situations. better than you think it does/would, no matter your estimation of its efficacy.
a combination of lots of things, with the general theme being focused prompts good at individual, specific subtasks circuits… off the top of my head:
- that try to extract the factual / knowledge content and try to update the rest of the system (e.g. if the user chats you to not send notifications after 9pm, ideally you’d like the whole system to reflect that. if they say they like the color gold, you’d like the recommendation system to know that.)
- detect the emotional valence of the user’s chat and raise an alert if they seem, say, angry
- speculative “given this new information, go over old outputs and see if any of our assumptions were wrong or apt and adjust accordingly”
- learning/feedback systems that run evals after every k runs to update and optimize the prompt
- systems where there is a large state space but any particular user has a very sparse representation (pick the top k adjectives for this user out of a list of 100, where each adjective is evaluated in its own prompt)
- llm circuits with detailed QA rules (and/or many rounds of generation and self-reflection to ensure generation/result quality)
- speculative execution in order to trade increased cost / computation for lower latency. (cf. graph of thoughts prompting)
- out of band knowledge generation, prompt generation, etc.
- alerting / backstopping / monitoring systems
- running multiple independent systems in parallel and then picking the best one or even merge the best results across all of them.
the more, smaller prompts, the easier to do eval and testing, as well as making the system more parallelizable. also, you get stronger, deeper signals that communicate a deeper domain understanding to the user s.t. they think you know what you’re doing.
but the point is every bit as much that you are embodying your own human cognition within the llm— the reason to do that in the first place is because it is virtually infinitely scalable when it makes it out of your brain and onto/into silicon. even if each marginal prompt you add only has a .1% chance of “hitting”, you can just trigger 1000 prompts and voila: one more eureka moment for your system.
sure, there’s diminishing returns in the same way that the CIA wants 100 Iraq analysts but not 10000. but unlike the CIA, you dont need to pag salaries, healthcare, managers, etc. it all scales basically linearly. and, besides, is extremely cheap as long as your prompting is even halfway decent.
The title is deliberately provocative but if I'm reading this article right it seems to push an argument that I've made for ages in the context of my own business, and which I think - as the article itself suggests - actually represents the best practice from before the age of ChatGPT, to wit: have lots of ML models, each of which does some very specific thing well, and wire them up, wherever it makes sense to do so, piping the results from one into the input of another. The article is a fan of doing this with language models specifically - and, of course, in natural language-heavy contexts these ML models will most or all be language models - but the same basic premise applies to ML more generally and, as far as I am aware, this is how it used to be done in commercial applications before everyone started blindly trying to make ChatGPT do everything.
I recently discovered BERTopic, a Python library that bundles a five-step pipeline of now pretty old (relatively) NLP approaches in a way that is very similar to how we were already doing it, now wrapped in a nice handy one-liner. I think it's a great exemplar of the approach that will probably emerge from the hype storm on top.
(Disclaimer: I am not an AI expert and will defer to real data/stats nerds on this.)
if by “stupid” you mean “worse is better”. except this time it’s actually better. just because it’s apostasy according to the Church of Engineering does not mean it can be dismissed out of hand, no matter how much it hurts your sensibilities. (it used to mine as well, but then i learned to stop worrying and learned to love our new llm overlords)
I'm not dismissing it out of hand. I've spent significant time evaluating the capabilities of LLMs and studied how people are actually using it across different domains. The evidence is clear - LLMs are helpful tools but they make it way too easy to crank out lots of code without corresponding understanding. And if you think software is more about lines of code than it is developing understanding, I question whether you've ever made a piece of software that lasts more than 5 years.
"evaluating the capabilities" sounds like a fancy way of saying that you know what other people say it can do... i strongly encourage you to try it for yourself. try to make some sort of multi-part, complicated document (technical or non-). tell it that you and it together are collaborating on it, what it's for, etc. and just start interacting with it. give it the stuff that you have so far. ask it for counterarguments, place where the argument could be better, comments about structure and style, ... the world is your oyster. it will give you a MUCH better picture of how these things actually work-- what the process of externalizing cognition is actually like, and how deep and detailed you can get. the results will amaze you.
people have this massive misconception about what LLMs can do and how to get good results out of them where they think they just kinda ask it for stuff and voila, it appears. it could not be any further from reality. it is an interactive tool that you get into the right "frame of mind" (scare quotes because this is a descriptive analogy, not one meant to convey or impart mechanical sympathy) and then ask it... literally anything you want. but you have to DISCOVER these things-- these (nearly) magic words, phrases, encodings of the problem, etc. that get the LLM to generate results in a way resonant with the way that you do / the way that you want it to. then you get the generation part "for free". you teach it to be a perfect painter, and then let it paint (shoutout Robert Pirsig // Zen and the Art of Motorcycle Maintenance).
the whole "LLM applications to <X>" stuff is such a mirage, and so tied up in a misbegotten view of how they work and what they're good at. if you have a hyper-hyper-specialized domain, sure, you might need to... find some specialized data. find or train some bespoke model. but for damn near anything written in english (can't speak to other languages) it is as good at coding as it is at bioinformatics as it is at statistics as it is at sociology as it is at literature (to be rather flippant). just do some introspection on how you do your job, how you think through problems, how you generate your output, and then "outrospect" it "into the computer". once you do that, voila: you have a virtually infinitely scalable simulacrum of your own cognition. in no way is it "plug in the ai and it will automatically take my job, one size fits all". the reason why they works so well is precisely because they are NOT that.
think about it this way: if you knew that each day when you went to sleep all of your memories were deleted, but you and your brain otherwise worked exactly the same way as yesterday, what would you do? (the concept of the mediocre but cute movie "50 First Dates") what notes would you leave for yourself to get back into the same context as you were in when you went to sleep the night before? how would you convince yourself that the artifacts you leave for yourself in the morning are true? how would you convey the subtlety of your thoughts & feelings, idiosyncracies, point of view, style & syntax? how would you grow the system over time? the LLMs are closer to speaking the language of your own thoughts than any human and every human invention in all of history. prompting is just the way to incept ideas and thoughts, systems, etc. into it's ai mind in a way that i find to be profoundly similar to how our own perception is not "reality" per se, but rather the image of reality that we see within our own minds. an image that we build up from birth as we grow to understand reality and the world around us.
couldn't agree more... if we're talking about LISP machines-- man i wish that side won. i believe in design when it comes to creating tools for humans, because we are finite and have finite capacity. but the PoV is just not exportable to the LLM, and when you have the sorts of crystallizing collaborative moments w/ LLMs that i have had... let's just say that it's beyond question that they have way too much utility to not be the shape of things to come. they are the "copilot" today... tomorrow, the roles will be reversed. and that's AWESOME! our worse is their better and vice versa-- sounds like an amazing partnership to me. ebony and ivory.