Just a note from my own experiments with Ollama—it's tempting to use LangChain as a sort of unofficial Ollama client library because it's so easy to set up, but be warned that because LangChain abstracts over many providers it doesn't actually give access to the full power of the Ollama API, so for anything other than very simple toy demos it's not a great choice unless you actually want to use the other LangChain abstractions.
The first pain point you're likely to run into is that Ollama provides a running binary context that represents all the processing your GPU has done in the conversation so far, but the last time I looked into LangChain they wanted you to build up a plaintext context on the client and ask Ollama to re-process it on every request. This works well as part of the wider suite of LangChain tools, but isn't great if you really just need Ollama.
For JS/TS and Python, Ollama now provides an official client library which presumably does take advantage of all Ollama features [0]. For other languages, the Ollama API is actually very simple [1] and easy to wrap yourself, and I'd recommend doing so unless you specifically need some of the abstractions LangChain provides.
Since you brought up that context, do you happen to know how it works?
I tested it, and it definitely works like a history - eg you can feed it to a model and ask it a question about what you just talked about, and it works.
Does Ollama essentially just convert the conversation so far into a binary representation, then convert it back to text and feed it back to the model ahead of your newest query? Or is it doing something more involved?
I wish it was better documented, have a ton of questions about how it functions in practice that I’ll end up trying to reverse engineer.
Like, what’s the lifetime of this context? If I load a new model into memory then reload the original, is that context valid? Is it valid if the computer restarts? Is it valid if the model gets updated?
Not an LLM expert but based on your explanation that it’s related to embedding (and another comment that Ollama loads it directly into memory) then I’m guessing that the model updating its weights definitely invalidates the context. Not so sure about the other options, like unloading/reloading the same model.
It's easier to reason about if you understand what it is. That binary data is basically a big vector representation of a textual context's contents. I suspect that it doesn't matter which model you use with the context binary, as ollama is handling providing it to the model.
Any particular resources you’d recommend to learn more about this?
So if I’m understanding it correctly, there’s one consistent way that Ollama will vectorize a set of text. Perhaps there are various ways one could, but Ollama chooses one.
What about multimodality? Ie taking a context from a prompt to llava to identify an image, with further questions about the contents of that image? Any non-llava model would definitely hallucinate, but would llava?
Multimodal models use embeddings as well, the difference there is that they've been trained to associate the same position in latent space to text and to the image that text describes, that way it can turn a textual response into an image and vice versa. A lot of models use CLIP, an embedding method from openAI.
AFAIU, typically LangChain will use the underlying (Ollama’s in this case) SDK, and prefers using the native configuration types and options, so support for whatever advanced features could be added!
Can you share a source or docs describing the “running binary context”? I am using this stack (as a dev, not an ml engineer) and appreciate the tip but would like to understand the delta better. Just spent a few minutes searching through code but don’t see anything obvious. Thanks!
My second link documents the `context` parameter for Ollama. It's unfortunately pretty hard to search for because the term is overloaded.
`context` is a number array that gets returned with each /generate request and can be returned to the server with the next request. There's not much more documentation than what's there in that link, but as I understand it the array is a JSON representation of the current state of inference at the time that Ollama returned. So when you send the context back up with a second request, Ollama can just load that array into memory and then start inference with the new prompt, rather than having to re-encode the whole conversation up to that point.
I'm just playing around with these models, first time using them locally. I don't know about huggingface, but with ollama I was up and running in about 10 minutes. Shockingly easy! Also, ollama claims to distribute the load automatically over CPU and GPU.
Gemma 2B and 7B are both very fast, work well in common domains, but have some holes in their knowledge. Mixtral impresses with completeness, but is noticeably slower.
Interesting, I have had a different experience with Gemma 7B in being able to be generally useful for basic questions after a couple prompts.
It frequently gets stuck and starts outputting grammatically incorrect sentences and answers unrelated to the prompt.
I also found it answering everything with "Sure, " and wasn't able to stop even if prompted. Another thing was that every response was a list of items.
Feels like a temperature issue, but not sure. With mistral or llama I get a better session in every case so far (Only about 5 conversations with Gemma).
Ollama is great for running a separate LLM inference server with very little setup required. This can be useful for a number of situations. Here are a few:
* You have a machine with a GPU that can do the inference but you need to run the application code on a smaller device. In my case, that's a Raspberry Pi with a touchscreen.
* You have multiple applications that all need to use LLM inference for different purposes. Loading the models exactly once is more efficient than loading them with huggingface for each application.
* Your python script is ephemeral and run on demand. You don't want to load the model each execution because it's slow, so you need a daemon. Ollama can serve that role without you having to write anything other than the script.
If you're writing Python and only need one application that's going to run on a powerful machine, you're probably better off running the models directly, but I'd venture a guess that that's a minority case.
As for flexibility, there are probably some things that are easier to do when you're running the inference yourself, but Ollama's API is powerful enough for most use cases.
> Your python script is ephemeral and run on demand. You don't want to load the model each execution because it's slow, so you need a daemon. Ollama can serve that role without you having to write anything other than the script.
This is exactly my use case for it; I invoke a Python binary which uses the Ollama API and get a model response within seconds because it’s already resident in memory.
Just in case it's useful for anyone, Gemma 2B using ollama runs quite smoothly on raspberry pi 5 too. I have just used ollama's interactive shell so far.
I appreciate that the post is about integration with Gemma and not about it's quality but it is possible that the specific interaction was cherry picked for correctness.
My cherry:
> calculate -20 + 111 -91
> The answer is 91.
> -20 + 111 -91 = 91
Also, note that there is an official Ollama Python library:
The first pain point you're likely to run into is that Ollama provides a running binary context that represents all the processing your GPU has done in the conversation so far, but the last time I looked into LangChain they wanted you to build up a plaintext context on the client and ask Ollama to re-process it on every request. This works well as part of the wider suite of LangChain tools, but isn't great if you really just need Ollama.
For JS/TS and Python, Ollama now provides an official client library which presumably does take advantage of all Ollama features [0]. For other languages, the Ollama API is actually very simple [1] and easy to wrap yourself, and I'd recommend doing so unless you specifically need some of the abstractions LangChain provides.
[0] https://ollama.com/blog/python-javascript-libraries
[1] https://github.com/ollama/ollama/blob/main/docs/api.md#gener...