Ollama v0.1.45

zora_goron · 2024-06-16T02:08:50 1718503730

I see a couple comments comparing llama.cpp and Ollama, and I think both have utility for different purposes. Having used both llama.cpp (which is fantastic) and Ollama, a couple things that I find valuable about Ollama out-of-the-box --

- Automatically loading/unloading models from memory - just running the Ollama server is a relatively small footprint; every time a particular model is called it is loaded into memory, and then unloaded after 5 mins of no further usage. It makes it very convenient to spin up different models for different use-cases without having to worry about memory management or manually shutting down those tools when not in use.

- OpenAI API compatibility - I run Ollama on a headless machine that has better hardware and connect via SSH port forwarding from my laptop, and with a 1 line change I can reroute any scripts on my laptop from GPT to Llama-3 (or anything else).

Overall, at least for tinkering with multiple local models and building small, personal tools, I've found the utility:maintenance ratio of Ollama to be very positive -- thanks to the team for building something so valuable! :)

benreesman · 2024-06-16T03:11:52 1718507512

I’m pretty passionate about the space and I’ve seen good diffs from the ‘Ollama’folks, they’ve pushed the ggerganov repo forward in some ways.

I’m a bit unsettled about what to me feels like an ambiguous posture on commercial activity.

I’ve got no issue with folks doing open-source type stuff to make money, a lot of good code gets written that way.

But I’ve learned the hard way that if a project feels kinda squirrelly it’s wise to keep one’s ears up.

I still build ggerganov main from source and type out the five parameters.

etc-hosts · 2024-06-16T04:28:18 1718512098

you can even immediately unload the model from memory:

  curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'

okwhateverdude · 2024-06-15T23:00:23 1718492423

I think this is a neat project, and use it a lot. My only complaint is the lack of grammar support. llama.cpp that they wrap will take a grammar. The dumbest patch to enable this is like two lines. And they seem to be willfully ignoring the (pretty trivial) feature for some reason. I'd rather not maintain a -but-with-grammars fork, so here we are.

https://github.com/ollama/ollama/pull/4525#issuecomment-2157...

smcleod · 2024-06-16T00:05:14 1718496314

I think the two main maintainers of Ollama have good intentions but suffer from a combination of being far too busy, juggling their forked llama.cpp server and not having enough automation/testing for PRs.

There is a new draft PR up to look at moving away from trying to juggle maintaining a llama.cpp fork to using llama.cpp with cgo bindings which I think will really help: https://github.com/ollama/ollama/pull/5034

cosmez · 2024-06-15T23:08:41 1718492921

There are many pull requests trying to implement this feature, and they don't even care to reply. This is the only reason I'm still using llama.cpp serve instead of this.

ekianjo · 2024-06-16T02:40:58 1718505658

wouldnt it be more practical to make a PR for llamacpp to replicate what Ollama does well instead?

richardanaya · 2024-06-15T23:05:13 1718492713

Hey, that's my PR yah ... it's strange.

jmorgan · 2024-06-16T02:40:23 1718505623

Sorry it's taking so long to review and for the radio silence on the PR.

We have been trying to figure out how to support more structured output formats without some of the side effects of grammars. With JSON mode (which uses grammars under the hood) there were originally quite a few issue reports namely around lower performance and cases where the model would infinitely generate whitespace causing requests to hang. This is an issue with OpenAI's JSON mode as well which requires the caller to "instruct the model to produce JSON" [1]. While it's possible to handle edge cases for a single grammar such as JSON (i.e. check for 'JSON' in the prompt), it's hard to generalize this to any format.

Supporting more structured output formats is definitely important. Fine-tuning for output formats is promising, and this thread [2] also has some great ideas and links.

[1] https://platform.openai.com/docs/guides/text-generation/json...

[2] https://github.com/ggerganov/llama.cpp/issues/4218

regen7253 · 2024-06-16T09:08:00 1718528880

Thank you!

I've been using llama.cpp for about a year now, mostly implementing some RAG and React related papers to stay up to date. I mostly used llama.cpp, but since a few months, I started to use both Ollama and Llama.cpp.

If you added grammars I wouldn't have to be running the two servers, I think you're doing an excellent job out of maintaining Ollama. Every update is like Christmas. They also don't seem to have the server as a priority (it's still literally just an example of how you'd use their C api).

So, I understand your position, since their server API has been quite unstable, and the grammar validation didn't work at all until February. I also still can't get their multiple model loading to work reliably right now.

Having said that, GBNF is a godsend for my daily use cases. I even prefer using phi3b with a grammar than deal with the hallucinations of a 70b without it. Fine tuning helps a lot, but can't solve the problem fully (you still need to validate the generation), and it's a lot less agile when implementing ideas. Crating some synthetic data sets is easier if you have support for grammars.

I think many like me are in the same spot. Thank you for being considerate about the stability and support that it would require. But please, take a look at the current state of their grammar validation, which is pretty good right now.

okwhateverdude · 2024-06-16T09:45:11 1718531111

Not to put too fine of a point on it, but why not merge one of the simpler PRs for this feature, gate the feature behind an opt-in env var (ie. OLLAMA_EXPERIMENTAL_GRAMMAR=1), and sprinkle these caveats you've mentioned into the documentation? That should be enough to ward off the casuals that would flood the issue queue. Add more hoops if you'd like.

There seems to be enough interest in this specific feature that you don't need to make it perfect or provide a complicated abstraction. I am very willing to accept/mitigate the side effects for the ability to arbitrarily constrain generation. Not sure about others, but given there are half a dozen different PRs specifically for this feature, I am pretty sure they, too, are willing to accept the side effects.

washadjeffmad · 2024-06-16T13:37:58 1718545078

Since it's trivial enough to run mainline features on actual llama.cpp, it seems redundant to ask ollama to implement and independently maintain branches or features that aren't fully working, if it's not something already in an available testing branch.

We're not relying on ollama for feature development and there are multiple open projects with implementations already, so no one is deprived of anything without this or a hundred other potential PRs not in ollama yet.

protosam · 2024-06-16T01:40:55 1718502055

This is extremely useful and seems like what I need to fix all of my structured output woes, when my model gets chatty for no reason.

The issue poorly glosses over explaining what feature they are talking about, here's a link to the docs about it: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

mattmight · 2024-06-16T00:32:54 1718497974

Same! I use ollama a lot, but when I need to do real engineering with language models, I end up having to go back to llama.cpp because I need grammar-constrained generation to get most models to behave reasonably. They just don't follow instructions well enough without it.

waldrews · 2024-06-16T05:22:17 1718515337

What's the deal with hosted open source model services not supporting grammars? I've seen fireworks.ai do it, but not anybody else - am I missing something?

darkteflon · 2024-06-16T02:53:35 1718506415

Came here to say this. Love Ollama, have been using it since the beginning, can’t understand why the GBNF proposals are apparently going ignored. Really hope they move it forward. Llama3 really drove this home for us. For small-parameter models especially, grammar can be the difference between useful and useless.

forgingahead · 2024-06-16T03:30:16 1718508616

Big kudos to the ollama team, echoing others: It just works. Fiddled with llama.cpp for ages trying to get it to run on my GPU, and ollama was setup and done in literally 3 minutes. The memory management of model loading and unloading is great, and now I can hack around and play with different LLMs from a simple API. Highly recommend that folks try it out, I thought local LLMs would be a pain to setup and use, and ollama made it super easy.

I_am_tiberius · 2024-06-16T01:12:19 1718500339

Can you train models via Ollama or is it just used to run existing/pre-defined models?

jmorgan · 2024-06-16T02:22:51 1718504571

Not at the moment, although it is a highly requested feature (specifically fine-tuning). There are a few tools that you can (or will soon be able to) use to fine tune a model and then import the resulting adapter layers into Ollama: MLX [1] on macOS, Unsloth [2] on Windows and Linux

[1] https://github.com/ml-explore/mlx

[2] https://github.com/unslothai/unsloth

Carrok · 2024-06-16T02:30:55 1718505055

My best suggestion is use axoltol to tune, convert to HF format with llama.cpp, and serve with Ollama.

behnamoh · 2024-06-15T22:32:03 1718490723

so basically the llama.cpp wrapper got updated because llama.cpp got updated…

krick · 2024-06-16T15:34:53 1718552093

Is there some llama3-uncensored or something? I'm waiting for it to appear on ollama list of models since Llama 3 was released.

rcarmo · 2024-06-16T07:58:21 1718524701

I hope it fixes the current bugs around environment variable handling and pointing it to a specific model directory…

5ytihijik · 2024-06-16T16:06:08 1718553968

https://effect.website/

yjftsjthsd-h · 2024-06-15T22:45:39 1718491539

What, if anything, is the difference between release and pre-release in this context?

jmorgan · 2024-06-16T02:26:53 1718504813

Pre-release versions are created to test new updates on bunch of different hardware setups (OS/GPUs) before releasing more broadly (and making new versions the default for the Linux/macOS/Windows installers – those pull from the 'latest' release).

There are a good number of folks that test the pre-releases as well (thank you!) especially if there's a bug fix or new feature they are waiting for. "Watch"ing the repo on GitHub will send emails/notifications of new pre-release versions

DidYaWipe · 2024-06-16T02:22:53 1718504573

Is what?

v3ss0n · 2024-06-15T23:13:41 1718493221

Reminder: You don't need ollama, running llamacpp is as easy as ollama. Ollama is just a wrapper over llamacpp.

BadHumans · 2024-06-15T23:24:11 1718493851

Llamacpp is great but saying it is as easy to setup as Ollama just isn't true.

omneity · 2024-06-16T02:01:44 1718503304

It actually is true. Running an OpenAI compatible server using llama.cpp is a one-liner.

Check out the docker option if you don’t want to build/install llama.cpp.

https://github.com/ggerganov/llama.cpp/tree/master/examples/...

doubloon · 2024-06-16T02:06:30 1718503590

it took me several hours to get llama.cpp working as a server, it took me 2 minutes to get ollama working.

much like how i got into linux via linux-on-vfat-on-msdos and wouldnt have gotten into linux otherwise, ollama got me into llama.cpp by making me understand what was possible

then again i am Gen X and we are notoriously full of lead poisoning.

wokwokwok · 2024-06-16T03:16:35 1718507795

> it took me several hours to get llama.cpp working as a server

Mm... Running a llama.cpp server is annoying; which model to use? Is it in the right format? What should I set `ngl` to? However, perhaps it would be fairer and more accurate to say that installing llama.cpp and installing ollama have slightly different effort levels (one taking about 3 minutes to clone and run `make` and the other taking about 20 seconds to download).

Once you have them installed, just typing: `ollama run llama3` is quite convenient, compared to finding the right arguments for the llama.cpp `server`.

Sensible defaults. Installs llama.cpp. Downloads the model for you. Runs the server for you. Nice.

> it took me 2 minutes to get ollama working

So, you know, I think its broadly speaking a fair sentiment; even if it probably isn't quite true.

...

However, when you look at it from that perspective, some things stand out:

- ollama is basically just a wrapper around llama.cpp

- ollama doesn't let you do all the things llama.cpp does

- ollama offers absolutely zero way, or even the hint of a suggestion of a way to move from using ollama to using llama.cpp if you need anything more.

Here's some interesting questions:

- Why can't I just run llama.cpp's server with the defaults from ollama?

- Why can't I get a simple dump of the 'sensible' defaults from ollama that it uses?

- Why can't I get a simple dump of the GGUF (or whatever) model file ollama uses?

- Why isn't 'a list of sensible defaults' just a github repository with download link and a list of params to use?

- Who's paying for the enormous cost of hosting all those ollama model files and converting them into usable formats?

The project is convenient, and if you need an easy way to get started, absolutely use it.

...but, I guess, I recommend you learn how to use llama.cpp itself at some point, because most free things are only free while someone else is paying for them.

Consider this:

If ollama's free hosted models were no longer free and you had to manually find and download your own model files, would you still use it? Could you still use it?

If not... maybe, don't base your business / anything important around it.

It's a SaaS with an open source client, and you're using the free plan.

yjftsjthsd-h · 2024-06-16T05:08:38 1718514518

> If ollama's free hosted models were no longer free and you had to manually find and download your own model files, would you still use it? Could you still use it?

I would absolutely still use it; I've already ended up feeding it gguf files that weren't in their curated options. The process (starting from having foo.gguf) is literally just:

    echo FROM ./foo.gguf > ./foo.gguf.Modelfile
    ollama create foo -f foo.gguf.Modelfile

(Do I wish there was an option like `ollama create --from-gguf` to skip the Modelfile? Oh yes. Do I kinda get why it exists? Also yes (it lets you bake in a prompt and IIRC other settings). Do I really care? Nope, it's low on the list of modestly annoying bits of friction in the world.)

evilduck · 2024-06-16T03:58:24 1718510304

There’s literally no means of paying Ollama for anything and their project is also MIT licensed like llama.cpp is.

And they have docs explaining exactly how to use arbitrary GGUF files to make your own model files. https://github.com/ollama/ollama/blob/main/docs/import.md

I dont feel any worse about Ollama funding the hosting and bandwidth of all of these models than I do about their upstream hosting source being Huggingface, which shares the same concerns.

wokwokwok · 2024-06-16T05:25:07 1718515507

Hugging face has a business model.

It’s reasonable to assume sooner or later ollama will too; or they won’t exist anymore after they burn through their funding.

All I’m saying is that what you get with ollama is being paid for by VC funding and the open source client is a loss leader for the hosted service.

Whether you care or not is up to you; but I think llama.cop is currently a more sustainable project.

Make your own decisions. /shrug

HanClinto · 2024-06-16T02:29:28 1718504968

People should definitely be more aware of Llamacpp, but I don't want to undersell the value that Ollama adds here.

I'm a contributor / maintainer of Llamacpp, but even I'll use Ollama sometimes -- especially if I'm trying to get coworkers or friends up and running with LLMs. The Ollama devs have done a really fantastic job of packaging everything up into a neat and tidy deliverable.

Even simple things -- like not needing to understand different quantization techniques. What's the difference between Q4_k and Q5_1 and -- what the heck do I want? The Ollama devs don't let you choose -- they say: "You can have any quantization level you want as long as it's Q4_0" and be done with it. That level of simplicity is really good for a lot of people who are new to local LLMs.

navbaker · 2024-06-16T00:17:54 1718497074

This is bad advice. Ollama may be “just a wrapper”, but it’s a wrapper that makes running local LLMs accessible to normal people outside the typical HN crowd that don’t have the first clue what a Makefile is or what cuBlas compiler settings they need.

flemhans · 2024-06-16T01:42:50 1718502170

Or just don't wanna bother. Ollama just works and accelerated me getting running and trying different models a lot.

oaththrowaway · 2024-06-16T02:04:00 1718503440

Ollama is easier to get working on my server with a container, simple as that

ein0p · 2024-06-16T04:40:19 1718512819

Ollama also exposes an RPC interface on Linux at least. That can be used with open-webui. Maybe llama.cpp has this, idk, but I use Ollama mostly through its RPC interface. It’s excellent for that.

BaculumMeumEst · 2024-06-23T16:12:28 1719159148

I spent less than 5 seconds learning how to use ollama: I just entered "ollama run llama3" and it worked flawlessly.

I spent HOURS setting up llama.cpp from reading the docs and then following this guide (after trying and failing with other guides which turned out to be obsolete):

https://voorloopnul.com/blog/quantize-and-run-the-original-l...

Using llama.cpp, I asked the resulting model "what is 1+1", and got a neverending stream of garbage. See below. So no, it is not anywhere near as easy to get started with llama.cpp.

--------------------------------

what is 1+1?") and then the next line would be ("What is 2+2?"), and so on.

How can I make sure that I am getting the correct answer in each case?

Here is the code that I am using:

\begin{code} import random

def math_problem(): num1 = random.randint(1, 10) num2 = random.randint(1, 10) problem = f"What is {num1}+{num2}? " return problem

def answer_checker(user_answer): num1 = random.randint(1, 10) num2 = random.randint(1, 10) correct_answer = num1 + num2 return correct_answer == int(user_answer)

def main(): print("Welcome to the math problem generator!") while True: problem = math_problem() user_answer = input(problem) if answer_checker(user_answer): print("Correct!") else: print("Incorrect. Try again!")

if __name__ == "__main__": main() \end{code}

My problem is that in the `answer_checker` function, I am generating new random numbers `num1` and `num2` every time I want to check if the user's answer is correct. However, this means that the `answer_checker` function is not comparing the user's answer to the correct answer of the specific problem that was asked, but rather to the correct answer of a new random problem.

How can I fix this and ensure that the `answer_checker` function is comparing the user's answer to the correct answer of the specific problem that was asked?

Answer: To fix this....

--------------------------------

IAmNotACellist · 2024-06-15T22:35:30 1718490930

[flagged]

brewtide · 2024-06-15T22:50:24 1718491824

I've installed ollama and the open chat (?) software on a local machine. It was pretty simple to get working.

Is it 'just as simple' to install llama.cpp ?

If not, I would consider the ollama wrapper to have an actual use-case.

You can make a working wifi temp/humidity sensor with an esp32, dht22, esphome, and home assistant. It's not hard but many many people wouldn't have a clue where to start on this. They buy the more expensive packed items. Also, a useful use case.

Opening up any sort of demographic, may make it worth it. (But attributions should surely exist)

spzb · 2024-06-15T22:49:19 1718491759

Hard to see how it's a "ripoff" given that llama.cpp is MIT licensed. It's pretty much encouraged to re-use MIT-licensed code.

zamalek · 2024-06-15T22:44:34 1718491474

I mean, docker is just a ripoff of Linux namespaces with an API on top.

IAmNotACellist · 2024-06-15T22:49:05 1718491745

The gulf between Docker and ollama is like the gulf between a calculator and counting on your fingers.

wslh · 2024-06-15T23:01:42 1718492502

You said it first! I remember when Docker appeared and I was trying to understand what Docker offered beyond configuring the operating system until I realized there was nothing to understand!

I told this because one of my companies do Microsoft Windows ad-hoc virtualization at a user space level and tried to figure out how to benchmark against Docker.

hehdhdjehehegwv · 2024-06-15T22:52:13 1718491933

Yeah, I’m busy doing actual work - Ollama acting as a wrapper is exactly what I want.

rvnx · 2024-06-15T22:55:07 1718492107

Basically someone created a rather well thought .sh (and even with a nice GUI) that is super helpful.

It's free to use, and free to re-use, as it is under MIT licence.

Well, just thanks for their work.

hehdhdjehehegwv · 2024-06-16T01:00:35 1718499635

Yeah and the GitHub for .cpp links directly to ollama - it’s literally advertised by the core maintainer as a good project.

doctorpangloss · 2024-06-15T22:57:23 1718492243

Why is there so much investment in ecosystems outside of Hugging Face? And why do they keep reinventing PyPi?

cosmez · 2024-06-15T23:13:33 1718493213

I don't think diversity in a software ecosystem is a bad idea. Nothing is stopping Hugging Face from changing their terms and screwing us over

nickpsecurity · 2024-06-15T23:36:47 1718494607

Google is great. Their slogan is “Don’t Be Evil.” We don’t need to invest in anything else.

(Time passes.)

“This ad-heavy, search engine is horrible. We need more solid competition!”

Supplier diversity prevents or reduces situations like the above.

ms-menardi · 2024-06-15T23:45:35 1718495135

Huggingface isn't local, while Ollama is.

Carrok · 2024-06-16T00:28:32 1718497712

This is more important than people seem to realize. Local means almost no data security issues. Not local means, well, all the data security issues.

Der_Einzige · 2024-06-16T01:34:29 1718501669

What? That's not true at all. Citation needed!

neals · 2024-06-16T02:01:28 1718503288

"Huggingface isn't local, while Ollama is." - ms-menardi

Der_Einzige · 2024-06-16T03:32:11 1718508731

It's just wrong and a lie. Huggingface is local.

jhbadger · 2024-06-16T03:42:18 1718509338

I think you are arguing at cross-purposes. If you use HuggingFace to simply download models to run locally with ollama or llama.cpp, then you can say it is "local". But you can also use it as a service to run models (which is how they make money). Then they obviously aren't local.

Der_Einzige · 2024-06-16T05:07:11 1718514431

Huggingface let's you run locally with huggingface exclusive code! They started this way and the managed/hosted solution came later.

You downvoters are the folks who jumped into this craze post ChatGPT.

doctorpangloss · 2024-06-17T01:28:37 1718587717

I have no idea where people got the idea that Hugging Face isn’t local. I mean they show you how to run everything locally, with all the quantization strategies you could want, with far fewer bugs.

dheera · 2024-06-16T00:28:44 1718497724

Who is reinventing PyPi? Ollama isn't even written in Python.

I think that's part of the point, PyPi and Python and Conda is an absolute shitshow of an ecosystem right now. 3 packages depend on 3 different versions of Pytorch. System installs CUDA 12.3 with the NVIDIA driver but but Xformers wants CUDA 12.1, and the NVIDIA driver that would provide 12.1 doesn't compile on Ubuntu 24.04. Pytorch provides its own CUDA that's different from the system version, and that makes another package unhappy. One package installs opencv, another package overwrites it with opencv-headless, pip install apex doesn't work with gcc-12, ...

wmedrano · 2024-06-16T02:24:27 1718504667

Main reason I started using ollama is because it actually worked. I spent half an hour trying to get huggingface models to run on my GPU with no success. Meanwhile `sudo pacman -S ollama-rocm` seemed to just work out of the box.

dheera · 2024-06-16T04:10:56 1718511056

Exactly, and as a bonus it runs an JSON API on localhost so you can call it from anything on your LAN, it's really, really amazing

lolinder · 2024-06-16T00:49:41 1718498981

Personally I love that ollama gives me a really simple API that I can code against in any language and on any platform that I care to. I can `docker run ollama` on my gaming rig and then hit it up from a Raspberry Pi running anything, Python or not.