I’ve actually found the opposite. At work, we went from a fine-tuned model to a RAG system for internal and external documentation and a generic coding-focused model for code.
Fine tuning against in-house code seems like a small gain over a base model and search. It’s unlikely your code is unique and special and big enough that it’s hard to get results from a base model. You’ll be pinned to a certain version of a certain model, and you won’t be able to upgrade to future models nearly as quickly. Of course, you’re also fighting time again on each commit changing the code unless you continually fine tune it.
A RAG model might still struggle with a super vague question like “where does the foo cal bar with bax set” but it’s unlikely that this would work for fine tuning as well. This is where static code search by symbols really should be used.
There are frameworks for graph-based RAG that mix both approach. One LLM encodes info as a knowledge graph, gradually building up an ontology. Another LLM is used to query this knowledge graph by emitting speculative queries. As the database grows, the second LLM is fine-tuned again and again with exemple queries using the ontology the first LLM came up with.
RAG definitely is helpful! Fine-tuning imo is extremely powerful but it's still relatively alchemy - technically gpt4, Claude any large model is a finetune of a base model! Reasoning finetuning is also very powerful!
Tbh the hardest part is the lifecycle - ie new data, updating, serving etc - that seems to be the biggest issue
Is anyone having success with iteratively feeding chunks of code (or other documents) to LLM for search? I understand 'haystack' issues with LLMs are quite bad, but RAG is quite bad too and a lot of that haystack research seems to be with feeding very large contexts in.
Well, why not both? If you've already got a tuned model why not use RAG on that to get even better results? It already knows the big picture, it just needs the details so it doesn't have to hallucinate them.
> I'm interested to know if anyone is using fine-tuning to train a model on proprietary or in-house codebases and documentation.
I've done it, 1/2 the team though it was great 20% of the time, 1/2 the team hated it from day 0. I used roughly 500K lines of code.
> How much effort is required to turn code into something one can use for fine-tuning?
Very little to moderate, less than 200 lines of python, QWEM FIM, HF, LLAMA.CPP, LLAMA.CPP code extension.
> RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.
The only problem either way is keeping the information up to date, RAG just adds more cost to the inference process (which at my dev speed is pretty important).
> How much effort is required to turn code into something one can use for fine-tuning?
Fine tuning "fill in the middle" process is the process of taking a file, cutting out a some text in the middle and asking AI to guess what was there - there is a hugging face example that will have you doing it in an hour or less - your OPs team saying "No you cant litreally copy all code to a single folder" is probably the biggest hurdle (advise them you'll do it in CI and then they can stand up a FIM training endpoint that accepts a csv, pretty easy)
I've not done fine tuning on code bases but I have done other fine tuning.
You will generally get better results when you fine-tune the base model on your data.
Since you still want to use it with the chat template in the end, you fine-tune the base model with the chat template with your specific data.
From there you'll have a lora that knows your data alright, but still doesn't really work for chatting.
You take that lora, merge it with the base model. Let's call this the stage model.
Then you use mergekit to merge the base model with both the stage model and the chat model. I used the TIES merge method in the past. Now you have your final model.
I use vLLM for inference, and needed access to multiple fine tunes on only a single set of hardware. So from that point I go and take the base model and my final model and extract a new lora. I also take the base model and chat model and extract another lora for that. Then I load up vLLM with the base model and as many of the fine tune loras I need + the chat lora.
The only time this hasn't worked is if the chat model adds a bunch of new tokens on top of the base model. If I remember right there was an issue with that
Generally it is recommended that you fine-tune if you want to shape the output style. If you eant ro only output json, or just output jsdoc, etc.
To add knowledge, RAG is a better idea and easier to keep updated on code changes.
If rag does not give back good results then that's a problem with the retrieval part, which can be measured and improved.
We're building documentation + rag systems for enterprise code bases and this is what we saw works best.
We are at Scribe[1]. We do it to make sense of knowledge workflows on computers to predict the next step in the process (our software points out where in the DOM a user might need to interact with next). We fine tune with tons of JSON data and DOM data. I’m sure doing it with code no more complicated.
We see a lot of this in large orgs! The main issue imo is actually the selection of chat templates - there's a lot of people who use a template for finetuning then totally forget to use it for finetuning.
A lot of financial, legal and health companies do fine-tuning! Reasoning finetuning via GRPO is also very powerful since you don't need any cot data in between! Just inputs and outputs!
Are people fine-tuning LLMs on their local machines with a single GPU? What are people using to scale their training to multiple nodes / gpus? I've been playing around with Hugging Face Estimators in sagemaker.huggingface but not sure if there are better options for this?
It takes a significant amount of time (few hours) on a single consumer GPU, even 4090 / 5090, on personal machines. I think most people use online services like runpod, vast ai, etc to rent out high-powered H100 and similar GPUs for a few cents per hour, run the fine-tuning / training there, and just use local GPUs for inference on those fine-tuned models generated on cloud-rented instances.
It used to be that way! Interestingly I find people in large orgs and the general enthusiast don't mind waiting - memory usage and quality are more important factors!
Google Colab is quite easy to use and has the benefit of not making your local computer feel sluggish while you run the training. The linked Unsloth post provides a notebook that can be launched there and I've had pretty good luck adapting their other notebooks with different foundational models. As a sibling noted, if you're using LORA instead of a full fine-tune, you can create adapters for fairly large models with the VRAM available in Colab, especially the paid plans.
If you have a Mac, you can also do pretty well training LORA adapters using something like Llama-Factory, and allowing it to run overnight. It's slower than an NVIDIA GPU but the increased effective memory size (if you say have 128GB) can allow you more flexibility.
A 'LoRA' is a memory-efficient type of fine tuning that only tunes a small fraction of the LLM's parameters. And 'quantisation' reduces an LLM to, say, 4 bits per parameter. So it's feasible to fine-tune a 7B parameter model at home.
Anything bigger than 7B parameters and you'll want to look at renting GPUs on a platform like Runpod. In the current market, there are used 4090s selling on ebay right now for $2100 while runpod will rent you a 4090 for $0.34/hr - you do the math.
It's certainly possible to scale model training to span multiple nodes, but generally scaling through bigger GPUs and more GPUs per machine is easier.
For experimentation and smaller models, single gpu is the way to go! Tbh I normally find most people to spend the majority of their time on datasets, training loss convergence issues etc!
But if its helpful I was thinking about spinning up a platform for something like that!
is anyone outside of the research labs fine tuning models for production use cases? I have been seeing more people just using foundational models off the shelf especially in light of a new advancement that seems to come every few months
On paper fine tuning smaller models can greatly reduce the cost for a specific task, but I've not heard many real-world success stories around that.
I think vision LLMs are one of the most interesting applications here - things like fine-tuning for better results extracting data from a specific paper form or report structure. Again, not many public examples of that.
1. Codebases, docs, large corpses of internal datasets - fill in the middle, auto completion etc.
2. I know a tonne of financial institutions use fine-tuning for trading, real time data parsing headline analysis, signal creation etc
3. Distillation is also relatively common - taking outputs of a large model and distilling it to a small model
4. Accuracy increasing is the most important - not cost or latency - we find if you solve the finetuning life cycle ie continuous auto fine-tuning, data filtering, reinforcement learning via DPO, that works well!
5. Lots of organizations use DPO and preference fine-tuning to align models since they have tonnes of feedback data!
6. Yep vision fine-tuning! For eg medical diagnosis, docs, qa on pics etc
7. And obviously large model labs finetune all base models ie chatgpt4.5 is a finetune of a base model
8. Finally reasoning finetuning via GRPO is very cool! If you have inputs and outputs but no labelled cot in between, GRPO is the way to go! Custom reward functions by companies!
Vision LLMs are definitely an interesting application.
At Avy.ai we're running small (2B-7B, quantized) vision models as part of a Mac desktop application for understanding what someone is working on in the moment, to offer them related information and actions.
We found that the raw results in understanding the images with a light LORA fine tune are not substantially different -- but the ease of getting a small model to follow instructions in outputting structured data in response to the image and at the level of verbosity and detail we need is greatly enhanced with fine tuning. Without fine tuning the models on the smaller end of that scale would be much more difficult to use, not reliably producing output that matched what the consuming application expects
Using a grammar to force decoding say valid JSON would work, but that hasn't always been available in the implementations we've been using (like MLX). Solvable by software engineering and adding that to the decoders in those frameworks, but fine tuning has been effective without that work.
The bigger thing though was getting the models to have the appropriate levels of verbosity and detail in their ouput which fine tuning made more consistent.
Finetuning is easy and worthwhile, especially with LoRAs as these Unsloth demos do. The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable.
In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.
There are inference providers such as Together AI that will serve your LoRA adapters at no extra cost above the model price. Then, there’s basically no difference between using your fine-tuned model or an API model off the shelf (except for the benefits you get from fine-tuning).
> The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable
It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.
For self-hosting I've been using https://tuns.sh which is a tunneling solution using SSH. It works great for prototyping and I've been using it to host open-webui
> If you have the resources to fine tune, you have the resources to run inference on fine tuned model.
I don't think that's true.
I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.
Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.
I've been finetuning these models since before chatGPT, and the one lesson I've learned is that by the time you have set up everything to fine-tune a model, you can expect a newer model to do as well with prompt-tuning.
So, unless you hope to stay at the fore front (e.g. to be ahead of competitors), there has been no real reason to finetune for the last 4 years, at best you could hope to stay about 1-3 months ahead, depending on how fast you were at setting up your training. And if that is what you did hope to achieve, you needed to automate on a higher level, i.e. automate data collection and the collection of eval cases.
It feels like there should be a service where I just drag drop a folder of examples and it fine tunes the latest DeepSeek or whatever for me and even can host it for me at some cost. I'd pay for that immediately, but last I checked there was nothing that really did that well (would love to be wrong).
There are some options out there, depending on what type of task you're trying to fine tune. I think RL finetuning for DeepSeek e.g. isn't well developed yet, but you can finetune a small LLama model (~3B params) for classification or extraction tasks and it works really well. What sort of tasks were you looking at finetuning for?
Vibe coding has taken over for frontend dev, but outside that narrow band of very visible coding, most models aren't great at more esoteric programming languages. Even Swift gives Claude trouble. So the reason to fine-tune is simply that the best newest models still remain bad at things outside their comfort zone (how human).
I take my quip both ways, so I would wager that even with finetuning, these models are only 1 generation ahead in esoteric language performance and therefore _still not very good_. Am I correct?
you wrote, emphatically, that it would be "still not very good". Why do you believe that it would be still not very good after training on a specific problem? LLMs aren't able to do things outside their training data, as vast as it is, but if it's in it's training data, why are you emphatic that it's still not very good? If I ask it to make something that it just needs to copy out sample code of, it would be pretty good at that one very specific task to me.
I work for DeepMind on project Astra. Not to dwell too deep into confidentiality of what capabilities I have been looking at, but it has been the theme since the flamingo model that you only gain about 1 model-generation by fine-tuning versus prompt-tuning.
I have documents from the last 50 years that I need to digitalize, millions of them written in old Arabic. The OCR is not accurate due to handwritten documents, so I need to fine-tune a model on around 300k pairs of texts (OCR output and manually corrected versions)
Arabic OCR is a mess with historical texts. Take the word الف (alf/thousand) in dates like 1950 - in old documents, the ف (fa) had a dot below it, but modern OCR doesn't get this and outputs الد (alad), which is just gibberish in Arabic
Same problem with ق (qaf) written as ف (fa) in old Arabic
And don't get me started on merged letters! In محمد (Muhammad), sometimes the م (meem) sits right on top of the ح (haa), or appears as a little circle below the line. Modern OCR has no clue what to do with these
My solution? Run OCR first, then use LLMs to fix the mess based on context. The surprising part? In my tinkering, smaller fine-tuned models actually do BETTER at this specific task than the big general-purpose ones. They seem to learn the patterns of historical Arabic quirks more effectively. Pretty neat tradeoff of specialized knowledge vs. general intelligence
IMHO the biggest factor holding that back is how rushed and distanced these model releases are, still.
Both Phi-4-mini and Gemma 3 were released recently. Phi-4's damn close to a good, real, model release. Microsoft's done a great job of iterating.
Gemma 3's an excellent, intelligent, model, but it's got a gaping blind spot: tool-calling / JSON output. There was a vague quick handwave about it in some PR, a PM/eng on the Gemma team commented here in response to someone else that TL;DR "it's supported in Ollama!", which is Not Even Wrong, i.e. in the Pauli sense of the phrase.
- Ollama uses a weak, out of date llama.cpp thing where the output tokens are constrained to match a JSON schema. This falls apart almost immediately, i.e. as soon as there is more than one tool.
- The thing that matters isn't whether we can constrain output tokens, any model can do that, I've had Llama 3 1B making tool calls that way. The thing that matters is A) did you train that in and B) if you did, tell us the format
All that to say, IMHO we're still 6 months to a year out from BigCo understanding enough about their own stuff to even have a good base for it. Sure, tool calling and fine-tuning are orthogonal, in a sense, but in practice, if I'm interested in getting a specific type of output, odds are I wanted that formatted a specific way.
Gemma3 1B seems to be able to choose which tool to use for very simple cases, if you constrain using anyOf, and narrow it down to just a few with RAG first.
It can't understand numbers very well though, "one thousand five" might become "1500".
JSON constraints seem to make them unable to figure it out even if they'd normally get it every time.
I’m trying right now. The combination of small models, qlora and grpo has made it accessible to experimenters. I’m not using unsloth yet, but I will probably start checking it out pretty soon so that I can train larger models or increase the number of generations for grpo.
I am. I have some use cases related to data extraction where using a fine tuned small model outperforms the best-in-class closed source models and at a fraction of the cost.
Instead of versions, these things should be labeled by their release date, since this kind of training is based on started at a dataset snapshot in time, colloquially called knowledge-cutoff date which isnt really accurate
we are optimizing these on different dimensions at once, and multiple branches of evolution from each model
so a successor version name doesn't really convey that
Great article, but I didn't see anything about the costs.
I'm particularly interested in this aspect because we're considering fine-tuning Gemma 3, but our budget is tight. We're looking into (real-world) cost estimates for this approach.
Oh hey! For now we don't have a platform, so we generally tell folks to use Colab free gpus! Kaggle also has 30 hours for free per week! I put links for kaggle here: https://docs.unsloth.ai/get-started/unsloth-notebooks
It likely makes sense to use more expensive frontier models as teachers or architects for smaller fine-tuned ones that generate the majority of tokens (though possibly against the ToS).
RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.
How much effort is required to turn code into something one can use for fine-tuning?