I see a couple comments comparing llama.cpp and Ollama, and I think both have utility for different purposes. Having used both llama.cpp (which is fantastic) and Ollama, a couple things that I find valuable about Ollama out-of-the-box --
- Automatically loading/unloading models from memory - just running the Ollama server is a relatively small footprint; every time a particular model is called it is loaded into memory, and then unloaded after 5 mins of no further usage. It makes it very convenient to spin up different models for different use-cases without having to worry about memory management or manually shutting down those tools when not in use.
- OpenAI API compatibility - I run Ollama on a headless machine that has better hardware and connect via SSH port forwarding from my laptop, and with a 1 line change I can reroute any scripts on my laptop from GPT to Llama-3 (or anything else).
Overall, at least for tinkering with multiple local models and building small, personal tools, I've found the utility:maintenance ratio of Ollama to be very positive -- thanks to the team for building something so valuable! :)
I think this is a neat project, and use it a lot. My only complaint is the lack of grammar support. llama.cpp that they wrap will take a grammar. The dumbest patch to enable this is like two lines. And they seem to be willfully ignoring the (pretty trivial) feature for some reason. I'd rather not maintain a -but-with-grammars fork, so here we are.
I think the two main maintainers of Ollama have good intentions but suffer from a combination of being far too busy, juggling their forked llama.cpp server and not having enough automation/testing for PRs.
There is a new draft PR up to look at moving away from trying to juggle maintaining a llama.cpp fork to using llama.cpp with cgo bindings which I think will really help: https://github.com/ollama/ollama/pull/5034
There are many pull requests trying to implement this feature, and they don't even care to reply. This is the only reason I'm still using llama.cpp serve instead of this.
Sorry it's taking so long to review and for the radio silence on the PR.
We have been trying to figure out how to support more structured output formats without some of the side effects of grammars. With JSON mode (which uses grammars under the hood) there were originally quite a few issue reports namely around lower performance and cases where the model would infinitely generate whitespace causing requests to hang. This is an issue with OpenAI's JSON mode as well which requires the caller to "instruct the model to produce JSON" [1]. While it's possible to handle edge cases for a single grammar such as JSON (i.e. check for 'JSON' in the prompt), it's hard to generalize this to any format.
Supporting more structured output formats is definitely important. Fine-tuning for output formats is promising, and this thread [2] also has some great ideas and links.
I've been using llama.cpp for about a year now, mostly implementing some RAG and React related papers to stay up to date. I mostly used llama.cpp, but since a few months, I started to use both Ollama and Llama.cpp.
If you added grammars I wouldn't have to be running the two servers, I think you're doing an excellent job out of maintaining Ollama. Every update is like Christmas. They also don't seem to have the server as a priority (it's still literally just an example of how you'd use their C api).
So, I understand your position, since their server API has been quite unstable, and the grammar validation didn't work at all until February. I also still can't get their multiple model loading to work reliably right now.
Having said that, GBNF is a godsend for my daily use cases. I even prefer using phi3b with a grammar than deal with the hallucinations of a 70b without it. Fine tuning helps a lot, but can't solve the problem fully (you still need to validate the generation), and it's a lot less agile when implementing ideas. Crating some synthetic data sets is easier if you have support for grammars.
I think many like me are in the same spot. Thank you for being considerate about the stability and support that it would require. But please, take a look at the current state of their grammar validation, which is pretty good right now.
Not to put too fine of a point on it, but why not merge one of the simpler PRs for this feature, gate the feature behind an opt-in env var (ie. OLLAMA_EXPERIMENTAL_GRAMMAR=1), and sprinkle these caveats you've mentioned into the documentation? That should be enough to ward off the casuals that would flood the issue queue. Add more hoops if you'd like.
There seems to be enough interest in this specific feature that you don't need to make it perfect or provide a complicated abstraction. I am very willing to accept/mitigate the side effects for the ability to arbitrarily constrain generation. Not sure about others, but given there are half a dozen different PRs specifically for this feature, I am pretty sure they, too, are willing to accept the side effects.
Since it's trivial enough to run mainline features on actual llama.cpp, it seems redundant to ask ollama to implement and independently maintain branches or features that aren't fully working, if it's not something already in an available testing branch.
We're not relying on ollama for feature development and there are multiple open projects with implementations already, so no one is deprived of anything without this or a hundred other potential PRs not in ollama yet.
Same! I use ollama a lot, but when I need to do real engineering with language models, I end up having to go back to llama.cpp because I need grammar-constrained generation to get most models to behave reasonably. They just don't follow instructions well enough without it.
What's the deal with hosted open source model services not supporting grammars? I've seen fireworks.ai do it, but not anybody else - am I missing something?
Came here to say this. Love Ollama, have been using it since the beginning, can’t understand why the GBNF proposals are apparently going ignored. Really hope they move it forward. Llama3 really drove this home for us. For small-parameter models especially, grammar can be the difference between useful and useless.
Big kudos to the ollama team, echoing others: It just works. Fiddled with llama.cpp for ages trying to get it to run on my GPU, and ollama was setup and done in literally 3 minutes. The memory management of model loading and unloading is great, and now I can hack around and play with different LLMs from a simple API. Highly recommend that folks try it out, I thought local LLMs would be a pain to setup and use, and ollama made it super easy.
Not at the moment, although it is a highly requested feature (specifically fine-tuning). There are a few tools that you can (or will soon be able to) use to fine tune a model and then import the resulting adapter layers into Ollama: MLX [1] on macOS, Unsloth [2] on Windows and Linux
Pre-release versions are created to test new updates on bunch of different hardware setups (OS/GPUs) before releasing more broadly (and making new versions the default for the Linux/macOS/Windows installers – those pull from the 'latest' release).
There are a good number of folks that test the pre-releases as well (thank you!) especially if there's a bug fix or new feature they are waiting for. "Watch"ing the repo on GitHub will send emails/notifications of new pre-release versions
it took me several hours to get llama.cpp working as a server, it took me 2 minutes to get ollama working.
much like how i got into linux via linux-on-vfat-on-msdos and wouldnt have gotten into linux otherwise, ollama got me into llama.cpp by making me understand what was possible
then again i am Gen X and we are notoriously full of lead poisoning.
> it took me several hours to get llama.cpp working as a server
Mm... Running a llama.cpp server is annoying; which model to use? Is it in the right format? What should I set `ngl` to? However, perhaps it would be fairer and more accurate to say that installing llama.cpp and installing ollama have slightly different effort levels (one taking about 3 minutes to clone and run `make` and the other taking about 20 seconds to download).
Once you have them installed, just typing: `ollama run llama3` is quite convenient, compared to finding the right arguments for the llama.cpp `server`.
Sensible defaults. Installs llama.cpp. Downloads the model for you. Runs the server for you. Nice.
> it took me 2 minutes to get ollama working
So, you know, I think its broadly speaking a fair sentiment; even if it probably isn't quite true.
...
However, when you look at it from that perspective, some things stand out:
- ollama is basically just a wrapper around llama.cpp
- ollama doesn't let you do all the things llama.cpp does
- ollama offers absolutely zero way, or even the hint of a suggestion of a way to move from using ollama to using llama.cpp if you need anything more.
Here's some interesting questions:
- Why can't I just run llama.cpp's server with the defaults from ollama?
- Why can't I get a simple dump of the 'sensible' defaults from ollama that it uses?
- Why can't I get a simple dump of the GGUF (or whatever) model file ollama uses?
- Why isn't 'a list of sensible defaults' just a github repository with download link and a list of params to use?
- Who's paying for the enormous cost of hosting all those ollama model files and converting them into usable formats?
The project is convenient, and if you need an easy way to get started, absolutely use it.
...but, I guess, I recommend you learn how to use llama.cpp itself at some point, because most free things are only free while someone else is paying for them.
Consider this:
If ollama's free hosted models were no longer free and you had to manually find and download your own model files, would you still use it? Could you still use it?
If not... maybe, don't base your business / anything important around it.
It's a SaaS with an open source client, and you're using the free plan.
> If ollama's free hosted models were no longer free and you had to manually find and download your own model files, would you still use it? Could you still use it?
I would absolutely still use it; I've already ended up feeding it gguf files that weren't in their curated options. The process (starting from having foo.gguf) is literally just:
echo FROM ./foo.gguf > ./foo.gguf.Modelfile
ollama create foo -f foo.gguf.Modelfile
(Do I wish there was an option like `ollama create --from-gguf` to skip the Modelfile? Oh yes. Do I kinda get why it exists? Also yes (it lets you bake in a prompt and IIRC other settings). Do I really care? Nope, it's low on the list of modestly annoying bits of friction in the world.)
I dont feel any worse about Ollama funding the hosting and bandwidth of all of these models than I do about their upstream hosting source being Huggingface, which shares the same concerns.
People should definitely be more aware of Llamacpp, but I don't want to undersell the value that Ollama adds here.
I'm a contributor / maintainer of Llamacpp, but even I'll use Ollama sometimes -- especially if I'm trying to get coworkers or friends up and running with LLMs. The Ollama devs have done a really fantastic job of packaging everything up into a neat and tidy deliverable.
Even simple things -- like not needing to understand different quantization techniques. What's the difference between Q4_k and Q5_1 and -- what the heck do I want? The Ollama devs don't let you choose -- they say: "You can have any quantization level you want as long as it's Q4_0" and be done with it. That level of simplicity is really good for a lot of people who are new to local LLMs.
This is bad advice. Ollama may be “just a wrapper”, but it’s a wrapper that makes running local LLMs accessible to normal people outside the typical HN crowd that don’t have the first clue what a Makefile is or what cuBlas compiler settings they need.
Ollama also exposes an RPC interface on Linux at least. That can be used with open-webui. Maybe llama.cpp has this, idk, but I use Ollama mostly through its RPC interface. It’s excellent for that.
I spent less than 5 seconds learning how to use ollama: I just entered "ollama run llama3" and it worked flawlessly.
I spent HOURS setting up llama.cpp from reading the docs and then following this guide (after trying and failing with other guides which turned out to be obsolete):
Using llama.cpp, I asked the resulting model "what is 1+1", and got a neverending stream of garbage. See below. So no, it is not anywhere near as easy to get started with llama.cpp.
--------------------------------
what is 1+1?") and then the next line would be ("What is 2+2?"), and so on.
How can I make sure that I am getting the correct answer in each case?
Here is the code that I am using:
\begin{code}
import random
def math_problem():
num1 = random.randint(1, 10)
num2 = random.randint(1, 10)
problem = f"What is {num1}+{num2}? "
return problem
def main():
print("Welcome to the math problem generator!")
while True:
problem = math_problem()
user_answer = input(problem)
if answer_checker(user_answer):
print("Correct!")
else:
print("Incorrect. Try again!")
if __name__ == "__main__":
main()
\end{code}
My problem is that in the `answer_checker` function, I am generating new random numbers `num1` and `num2` every time I want to check if the user's answer is correct. However, this means that the `answer_checker` function is not comparing the user's answer to the correct answer of the specific problem that was asked, but rather to the correct answer of a new random problem.
How can I fix this and ensure that the `answer_checker` function is comparing the user's answer to the correct answer of the specific problem that was asked?
I've installed ollama and the open chat (?) software on a local machine. It was pretty simple to get working.
Is it 'just as simple' to install llama.cpp ?
If not, I would consider the ollama wrapper to have an actual use-case.
You can make a working wifi temp/humidity sensor with an esp32, dht22, esphome, and home assistant. It's not hard but many many people wouldn't have a clue where to start on this. They buy the more expensive packed items. Also, a useful use case.
Opening up any sort of demographic, may make it worth it. (But attributions should surely exist)
You said it first! I remember when Docker appeared and I was trying to understand what Docker offered beyond configuring the operating system until I realized there was nothing to understand!
I told this because one of my companies do Microsoft Windows ad-hoc virtualization at a user space level and tried to figure out how to benchmark against Docker.
I think you are arguing at cross-purposes. If you use HuggingFace to simply download models to run locally with ollama or llama.cpp, then you can say it is "local". But you can also use it as a service to run models (which is how they make money). Then they obviously aren't local.
I have no idea where people got the idea that Hugging Face isn’t local. I mean they show you how to run everything locally, with all the quantization strategies you could want, with far fewer bugs.
Who is reinventing PyPi? Ollama isn't even written in Python.
I think that's part of the point, PyPi and Python and Conda is an absolute shitshow of an ecosystem right now. 3 packages depend on 3 different versions of Pytorch. System installs CUDA 12.3 with the NVIDIA driver but but Xformers wants CUDA 12.1, and the NVIDIA driver that would provide 12.1 doesn't compile on Ubuntu 24.04. Pytorch provides its own CUDA that's different from the system version, and that makes another package unhappy. One package installs opencv, another package overwrites it with opencv-headless, pip install apex doesn't work with gcc-12, ...
Main reason I started using ollama is because it actually worked. I spent half an hour trying to get huggingface models to run on my GPU with no success. Meanwhile `sudo pacman -S ollama-rocm` seemed to just work out of the box.
Personally I love that ollama gives me a really simple API that I can code against in any language and on any platform that I care to. I can `docker run ollama` on my gaming rig and then hit it up from a Raspberry Pi running anything, Python or not.
- Automatically loading/unloading models from memory - just running the Ollama server is a relatively small footprint; every time a particular model is called it is loaded into memory, and then unloaded after 5 mins of no further usage. It makes it very convenient to spin up different models for different use-cases without having to worry about memory management or manually shutting down those tools when not in use.
- OpenAI API compatibility - I run Ollama on a headless machine that has better hardware and connect via SSH port forwarding from my laptop, and with a 1 line change I can reroute any scripts on my laptop from GPT to Llama-3 (or anything else).
Overall, at least for tinkering with multiple local models and building small, personal tools, I've found the utility:maintenance ratio of Ollama to be very positive -- thanks to the team for building something so valuable! :)