Hacker News new | past | comments | ask | show | jobs | submit login

No offense, but I'd love to see what they've successfully built using LLMs before taking their advice too seriously. The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO



We work in some pretty serious domains and try to stay away from fine tuning:

- Most of our accuracy ROI is from agentic loops over top models, and dynamic RAG example injection goes far here that the relative lift of adding fine-tuning isn't worth the many costs

- A lot of fine-tuning is for OSS models that do worse than agentic loops over the proprietary GPT4/Opus3

- For distribution, it's a lot easier to deploy for pluggable top APIs without requiring fine-tuning, e.g., "connect to your gpt4/opus3 + for dumber-but-bigger tasks, groq"

- The resources we could put into fine-tuning are better spent on RAG, agentic loops, prompts/evals, etc

We do use tuned smaller dumber models, such as part of a coarse relevancy filter in a firehose pipeline... but these are outliers. Likewise, we expect to be using them more... but again, for rarer cases and only after we've exhausted other stuff. I'm guessing as we do more fine-tuning, it'll be more on embeddings than LLMs, at least until OSS models get a lot better.


See if the article said this, I would have agreed - fine-tuning is a tool and it should be used thoughtfully. Although I personally believe that in this funding climate it makes sense to make data collection and model training a core capability of any AI product. However that will only be available and wise for some founders.


Agreed, model training and data collection are great!

The subtle bit is just doesn't have to be for LLMs, as these are typically part of a system-of-models. E.g., we <3 RAG, and GNNs for improving your KG is fascinating. Likewise, dspy's explorations in optimizing prompts, vs LLMs, is very cool.


> we <3 RAG, and GNNs for improving your KG is fascinating

Oh man I am so torn between this being a fantastic idea and this being "building a better slide-rule in the age of the computer".

dspy is definitely a project I want to dig into more


Yeah I would recommend sticking to RAG on naively chunked data for weekend projects by 1 person. Likewise, a consumer tool like perplexity's search engine where you minimize spend per user task or go bankrupt, same thing, do the cheap thing and move on, good enough

Once RAG projects become important and good answers matter - we work with governments, manufacturers, banks, cyber teams, etc - working through data quality, data representation, & retrieval quality helps

Note that we didn't start here: We began with naive RAG, then relevancy filtering, then agentic & neurosymbolic querying, then dynamic example prompt injection, and now are getting into cleaning up the database/kg itself

For folks doing investigative/analytics projects in this space, happy to chat about what we are doing w Louie.AI. These are more implementation details we don't normally write about.


Have you actually used DSPy? I still can't figure out what it's useful for beyond optimizing basic few shot prompts.


We tried dspy and a couple others like it. They're neat and I'm happy those teams are experimenting with these frameworks. At the same time, they try to do "too much" by taking over the control flow of your code and running autotuning everywhere over it. We needed to write our own agent framework as even tools like langchain are too insecure and inefficient for being an enterprise platform, and frameworks like dspy are even more far out there.

A year+ later, the most interesting kernel of insight to us from dspy is autotuning a single prompt: it's an optimizeable model just like any other. As soon as you have an eval framework in place for your prompts, having something like dspy tune your prompts on a per-LLM basis would be very cool. I'm not sure where they are on that, it seems against the grain for their focus. We're only now reaching the point where we would see ROI on that kind of thing, it took a long time to get here.

We do run an agentic framework, so doing cross-prompt autotuning would be neat too -- especially for how the orchestrator (ex: CoT) composes with individual agents. We call this the "composition problem" and it's frustrating. However, again, dspy and friends do "too much", by trying to also be the agent framework & runtime, while we just want the autotuner.


It’s funny: I found the optimizer (which you could quite easily rip out from DSPy) to be the most underwhelming part of the equation.


The rest is neat but scary for most production scenarios, while a prompt autotuner can give significant lift + resilience in a predictable & maintainable way to most typical LLM apps

Again... I'm truly happy and supportive that academics are exploring a wild side of the design space. Just, as we are in the 'we ship code people rely on' side of the universe, it's hard to find scenarios where its potential benefits outweigh its costs.


Can you give a concrete example of GNNs helping?


Entity resolution - RAG often mixes vector & symbolic queries, and ER improves reverse indexing, which is a starting point for a lot of the symbolic ones

Identifying misinfo - Ranking & summarization based on internet data should be a lot more careful, and sometimes the controversy is the interesting part

For both, GNNs are generally SOTA


> The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO

The article has a section called "When to finetune", along with links to separate pages describing how to do so. They absolutely don't say that "fine-tuning isn't even a consideration". Instead, they describe the situations in which fine-tuning is likely to be helpful.


Huh. Well that's embarrassing. I guess I missed it when I lost interest in the caching section and jumped straight to Evaluation and Monitoring.


Hello, it’s Bryan, an author on this piece.

I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.

As far as fine-tuning in particular, our consensus is that there are easier options first. I personally have fine-tuned gpt models since 2022; here’s a silly post I wrote about it on gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-Conten...


I took at look at Magic earlier today and it didn't work at all for me, sorry to say. After the example prompt, I tried to learn about a table and it generated bad SQL (correct query to pull a row, but with limit 0). I asked it to show me the DDL and it generated invalid SQL. Then I tried to ask it to do some population statistics on the customer table and ended up confused about why there appears to be two windows in the cell, with the previously generated SQL on the left and the newly generated SQL on the right. The new SQL wouldn't run when I hit run cell, the error showed the originally generated SQL. I gave up and bounced.

I went back while writing this comment and realized it might be showing me a diff (better use of color would have helped, I have been trained by github). But I was at a loss for what to do with that. I just now figured out the Keep button exists and it accepted the diff and now it sort of makes sense, but the SQL still doesn't return any results.

My honest feedback is that there is way too much stuff I don't understand on the screen and it makes me confused and a little stressed. Ease me into it please, I'm dumb. There seems to be cells that are linked together and cells that aren't(? separated by purplish background) and I don't understand it. I am a jupyter user and I feel like this should be intuitive to me, but it isn't. I am not a designer, but I suspect the structural markings like cell boundaries are too faint compared to the content of the cells and/or the exterior of a cell having the same color as the interior is making it hard for me. I feel lost in a sea of white.

But the core issue is that, excluding the prompt I copy-pasted word for word which worked like a charm, I am 0 out of 4 on actually leveraging AI to solve the problems I asked of Magic. I like the concept of natural language BI (I worked on in the early days when Alexa came out) so I probably gave it more chances than I would have for a different product.

For me, it doesn't fit my criteria for good problems to solve with AI in 2024 - the conversational interface and binary right/wrong nature of querying/presenting data accurately make the cost of failure too high, which is a death sentence for AI products IMO (compare to proactive, non-blocking products like copilot or shades-of-wrong problems like image generation or conversations with imaginary characters). But text-to-SQL and data presentation make sense as AI capabilities in 2024 so I can see why that could be a good product to pursue. If it worked, I would definitely use it.


This was kind of conventional wisdom ("fine tune only when absolutely necessary for your domain", "fine-tuning hurts factuality"), but some recent research (some of which they cite) has actually quantitatively shown that RAG is much preferable to FT for adding domain-specific knowledge to an LLM:

- "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?" https://arxiv.org/abs//2405.05904

- "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" https://arxiv.org/abs/2312.05934


Thanks, I'll read those more fully.

But "knowledge injection" is still pretty narrow to me. Here's an example of a very simple but extremely valuable usecase - taking a model that was trained on language+code and finetuning it on a text-to-DSL task, where the DSL is a custom one you created (and thus isn't in the training data). I would consider that close to infeasible if your only tool is a RAG hammer, but it's a very powerful way to leverage LLMs.


This is exactly (one of) our use cases at Eraser - taking code or natural language and producing diagram-as-code DSL.

As with other situations that want a custom DSL, our syntax has its own quirks and details, but is similar enough to e.g. Mermaid that we are able to produce valid syntax pretty easily.

What we've found harder is controlling for edge cases about how to build proper diagrams.

For more context: https://www.eraser.io/decision-node/on-building-with-ai


Agree that your use-case is different. The papers above are dealing mostly with adding a domain-specific textual corpus, still answering questions in prose.

"Teaching" the LLM an entirely new language (like a DSL) might actually need fine-tuning, but you can probably build a pretty decent first-cut of your system with n-shot prompts, then fine-tune to get the accuracy higher.


Fine-tuning is an absolutely necessary for true AI, and even if it's desirable, it's unfeasible to do for now for any large model considering how expensive GPUs are. If I had infinite money, I'd throw it at continuous fine-tuning and would throw away the RAG. Fine-tuning also requires appropriate measures to prevent forgetting of older concepts.


It is not unfeasible. It is absolutely realistic to do distributed finetuning of an 8B text model on previous generation hardware. You can add finetuning to your set of options for about the cost of one FTE - up to you whether that tradeoff is worth it, but in many places it is. The expertise to pull it off is expensive, but to get a mid-level AI SME capable of helping a company adopt finetuning, you are only going to pay about the equivalent of 1-3 senior engineers.

Expensive? Sure, all of AI is crazy expensive. Unfeasible? No


I don't consider a small 8B model to be worth fine-tuning. Fine-tuning is worthwhile when you have a larger model with capacity to add data, perhaps one that can even grow its layers with the data. In contrast, fine-tuning a small saturated model will easily cause it to forget older information.

All things considered, in relative terms, as much as I think fine-tuning would be nice, it will remain significantly more expensive than just making RAG or search calls. I say this while being a fan of fine-tuning.


> I don't consider a small 8B model to be worth fine-tuning.

Going to have to disagree with you on that one. A modern 8B model that has been trained on enough tokens is ridiculously powerful.


A well-trained 8B model will already be over-saturated with information from the start. It will therefore easily forget much old information when fine-tuning it with new materials. It just doesn't have the capacity to take in too much information.

Don't get me wrong. I think an 70B or larger model would be worth fine-tuning, especially if it can be grown further with more layers.


> A well-trained 8B model will already be over-saturated with information from the start

Any evidence of that that I can look at? This doesn't match what I've seen nor have I heard this from the world-class researchers I have worked with. Would be interested to learn more.


Upon further thought, if fine-tuning involves adding layers, then the initial saturation should not matter. Let's say if an 8B model adds 0.8*2 = 1.6B of new layers for fine-tuning, then with some assumptions, a ballpark is that this could be good for 16 million articles for fine-tuning.


The reason to fine tune is to get a model that performs well on a specific task. It could lose 90 percent of it's knowledge and beat the unturned model at the narrow task at hand. That's the point, no?


It is not really possible to lose 90% of one's brain and do well on certain narrow tasks. If the tasks truly were so narrow, you would be better off training a small model just for them from scratch.


Fine tuning has been on the way out for a while. It's hard to do right and costly. LoRAs are better for influencing output style as they don't dumb down the model, and they're easier to create. This is on top of RAG just being better for new facts like the other reply mentioned.


How much of that is just the flood of traditional engineers into the space and the fact that collecting data and then fine-tuning models is orders of magnitude more complex than just throwing in RAG? I suspect a huge amount of RAG's popularity is just that any engineer can do a version of it + ChatGPT API calls in a day.

As for lora - in the context of my comment, that's just splitting hairs IMO. It falls in the category of finetuning for me, although I understand why you might disagree. But it's not like the article mentions lora either, nor am I aware of people doing lora without GPUs which the article is against (No GPUs before PMF)


I disagree. No amount of fine tuning will ever give the LLM the relevant context with which to answer my question. Maybe if your context is a static Wikipedia or something that will never change, you can fine tune it. But if your data and docs keep changing, how is fine tuning going to be better than RAG?


Luckily it's not one or the other. You can fine tune and use RAG.

Sometimes RAG is enough. Sometimes fine tuning on top of RAG is better. It depends on the use case. I can't think of any examples where you would want to fine tune and not use rag as well.

Sometimes you fine tune a small model so it performs close to a larger varient on that specific narrow task and you improve inference performance by using a smaller model.


Continuous retraining and deployment maybe? But I'm actually not anti-RAG (although I think it is overrated because the retrieval problem is still handled extremely naively), I just think that fine-tuning should also be in your toolkit.


Why is the retrieval part overrated? There isnt even a single way to retrieve. It could be a simple keyword sesrch, a vector sesrch, a combo, or just simply retrieving a single doc and stuffing it in the context


People will disagree, but my problem with retrieval is that every technique that is popular uses one-hop thinking - you retrieve information that is directly related to the prompt using old-school techniques (even though the embeddings are new, text similarity is old). LLMs are most powerful, IMO, at horizontal thinking. Building a prompt using one-hop narrow AI techniques and then feeding it into a powerful generally capable model is like building a drone but only letting it fly over streets that already exist - not worthless, but only using a fraction of the technology's power.

A concrete example is something like a tool for querying an internal company wiki and the query "tell me about the Backend Team's approach to sprint planning". Normal retrieval approaches will pull information directly related to that query. But what if there is no information about the backend team's practices? As a human, you would do multi-hop/horizontal information extraction - you would retrieve information about who makes up the backend team, you would then retrieve information about them and their backgrounds/practices. You might might have a hypothesis that people carry over their practices from previous experiences, so you look at the previous teams and their practices. Then you would have the context necessary to give a good answer. I don't know of many people implementing RAG like that. And what I described is 100% possible for AI to do today.

Techniques that would get around this like iterative retrieval or retrieval-as-a-tool don't seem popular.


People cant do that because of cost. If every single query involved taking everything even remotely related to the query, and passing it to OpenAI, it would get expensive very very fast.

Its not a technical issue, its a practicality issue imo.


That's very true. Although it is feasible if you are well-resourced and make the investment to own the toolchain end-to-end. Serving costs can be quite low (relatively speaking) if you control everything. And you have to pick the correct problem where the cost is worthwhile.


I don't see why this is seen as an either-or by people? Fine-tuning doesn't eliminate the need for RAG, and RAG doesn't obviate the need for fine-tuning either.

Note that their guidance here is quite practical:

> If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: