I've been gleefully exploring the intersection of LLMs and CLI utilities for a few months now - they are such a great fit for each other! The unix philosophy of piping things together is a perfect fit for how LLMs work.
>I'm puzzled that more people aren't loudly exploring this space (LLM+CLI) - it's really fun.
70% of the front page of Hackernews and Twitter for the past 9 months is about everybody and their mother's new LLM CLI. It's the loudest exploration I've ever witnessed in my tech life so far. We need to be hearing far less about LLM CLIs, not more.
Has anyone written a shell script before that uses a local llm as a composable tool? I know there's plenty of stuff like https://github.com/ggerganov/llama.cpp/blob/master/examples/... where the shell script is being used to supply all the llama.cpp arguments you need to get a chatbot ui. But I haven't seen anything yet that treats the LLM as though it were a traditional UNIX utility like sed, awk, cat, etc. I wouldn't be surprised if no one's done it, because I had to invent the --silent-prompt flag that let me do it. I also had to remove all the code from llava-cli that logged stuff to stdout. Anyway, here's the script I wrote: https://gist.github.com/jart/bd2f603aefe6ac8004e6b709223881c...
Justine may have addressed unreliable output by using `--temp 0` [0]. I'd agree that while it may be deterministic, there are other definitions or axes of reliability that may still make it poorly suited for pipes.
[0]
> Notice how I'm using the --temp 0 flag again? That's so output is deterministic and reproducible. If you don't use that flag, then llamafile will use a randomness level of 0.8 so you're certain to receive unique answers each time. I personally don't like that, since I'd rather have clean reproducible insights into training knowledge.
`--temp 0` makes it deterministic. What can make output reliable is `--grammar` which the blog post discusses in detail. It's really cool. For example, the BNF expression `root ::= "yes" | "no"` forces the LLM to only give you a yes/no answer.
that only works up to a point. If you are trying to transform a text based cli output into a JSON object, even with a grammar, you can get variation in the output. A simple example is field or list ordering. Omission is the real problematic one
I'm heavily using https://github.com/go-go-golems/geppetto for my work, which has a CLI mode and TUI chat mode. It exposes prompt templates as command line verbs, which it can load from multiple "repositories".
One pattern I used recently was httrack + w3m dump + sgpt images with gpt vision to generate a 278K token specific knowledge base with a custom perl hack for a RAG that preserved the outline of the knowledge.
Which brings me to my question for you - have you seen anything unix philosophy aligned for processing inputs and doing RAG locally?
EDIT. Turns out OP has done quite a bit toward what I’m asking. Written up here:
Something I’m currently a bit hung up on is finding a toolchain for chunking content on which to create embeddings. Ideally it would detect location context, like section “2.1 Failover” or “Chapter 8: The dream” or whatever from the original text, also handle 80 character wide source unwrapping, smart splitting so paragraphs are kept together, etc etc.
That's the same problem I haven't figured out yet: the best strategies for chunking. I'm hoping good, well proven patterns emerge soon so I can integrate them into my various tools.
My intuition is that 1st step is clean sentences and paragraphs and titles/labels/headers. Then probably an LLM can handle outlining and table of contents generation using a stripped down list of objects in the text.
BRIO/BERT summarization could also have a role of some type.
> I'm puzzled that more people aren't loudly exploring this space (LLM+CLI) - it's really fun.
I've been seeing less and less enthusiasm for CLI driven workflows. I think VS Code is the main driver for this and anecdotally the developers I serve want point & click over terminal & cli
> anecdotally the developers I serve want point & click over terminal & cli
I think it's due to a lack of familiarity, as the CLI should be more efficient
> I've been seeing less and less enthusiasm for CLI driven workflows.
Any CLI is 1 dimensional.
Point and click is 2 dimensional.
The CLI should be more efficient, as you can reduce the complexity: you may need extra flags to achieve the behavior you want, but you can then serialize that into a file (shell script) to guarantee the reproduction of the outcome you want.
GUIs are harder, even without adding more dimensions like time (double click, scripts like AHK or AutoIT...)
If you don't have comparative exposure (automatizing workflows in Windows vs doing the same in Linux), or if you don't have enough experience to achieve what you want, you might jump to the wrong conclusions - but this is a case of being limited by knowledge, not by the tools
> The CLI should be more efficient, as you can reduce the complexity: you may need extra flags to achieve the behavior you want, but you can then serialize that into a file (shell script) to guarantee the reproduction of the outcome you want.
yup, we do this where we can, but let's consider a recent example...
They are standardizing around k8slens instead of kubectl. Why, because there are things you can do in k8s lens (like metrics) that you'll never get a good experience around in a terminal. Another big problem with terminals is you have to deal with all sorts of divergences between OSes & shells. A web based interface is consistent. In the end, they decided as a team their preference and that's what gets supported. They also standardized around VS Code, so that's what the docs refer to. I'm pretty much the only one still in vim, I'm not giving up my efficiencies in text manipulation.
I don't disagree with you, but I do see a trend in preferences away based on my experience, our justifications be damned
> They are standardizing around k8slens instead of kubectl. Why, because there are things you can do in k8s lens (like metrics) that you'll never get a good experience around in a terminal.
It looks like a limitation of the tool, not of the method, because metrics could come as CSVs, JSON or any other format in a terminal
> I'm pretty much the only one still in vim, I'm not giving up my efficiencies in text manipulation.
I love vim too :)
> I don't disagree with you, but I do see a trend in preferences away based on my experience, our justifications be damned
Trends in preferences are called fashions: they can change the familiarity and the level of experience through exposure, but they are cyclic and without objective.
The core problem is the combinatorial complexity in the problem space, and 1d with ascii will beat 2d with bitmaps.
I'm all for adding graphics to outputs (ex: sixels) but I think depending on graphics as inputs (whether scriping a GUI or processing it with our eyeballs) is riskier and more complex, so I believe our common preferences for CLIs will prevail in the long run.
I think it has more to do with how close to the brink you are. It takes at least a decade for a technology to mature to the point where there's a polished point and click gui for doing it. It sounds like Borg just hit that inflection point thanks to k8slens which I'm sure is very popular with developers working at enterprises.
> It takes at least a decade for a technology to mature to the point where there's a polished point and click gui for doing it
That makes a lot of sense, and it would generalize: things that have existed for longer have received more attention and more polish than fresh new things
I'd expect running a binary to be more mature than running a script, and the script to be more mature than a GUI, and complex assemblies with many moving parts (ex: a web browser in a GUI) to be the most fragile
That's another way to see there's an extremely good case for using cosmopolitan: have fewer requirements, and concentrate on the core layers of the OS, the ones that've been improved and refined through the years
> because metrics could come as CSVs, JSON or any other format in a terminal
You're missing the point, it's about graphs humans can look at and gain understanding. A bunch of floating point numbers in a table are never going to give that capability.
This is just one example where a UI outshines a CLI, it's not the only one. There are limitations to what you can do in a terminal, especially if you consider ease of development
I also think people are on tooling burnout, there have been soooo many new tools (and SaaS for that matter), I personally and anecdotally want fewer apps and tools to get my job done. Having to wire them all together creates a lot of complexity, headaches, and time sinks
> I personally and anecdotally want fewer apps and tools to get my job done
Same, because if learn the CLI and scripting once, then in most cases you don't have to worry about other workflows: all you need is the ability to project your problem into a 1d serialization (ex: w3m or lynx html dump, curl, wget...) where you can use fewer tools
By this definition, a cli is at least 52-dimensional (26*2 latin letters), because you can often readline reverse-search-history back in time and into the exact text you typed before.
This model of interaction is simply not possible in your average "WYSIWYG" Windows-Icons-Menus-Pointer GUI, not without somehow falling back to a sort CLI paradigm.
Or you could use one of the most successful desktop/web models of the last 15 years: search engines, also with many million if not more dimensions, all available in a CLI.
Or you could go for broke and use a 70B dimensions LLM prompt, integrated into your shell, and generate a custom new one-liner/script for you.
I don't think a WYSIWYG Windows-Icons-Menus-Pointer GUI will ever have that many dimensions.
Ctrl+P in VS Code and ⌘+space on Mac are GUI based fast action systems. It's not just about point and click. This is a central piece to the new UX/DX. AI assistants are going to further solidify and enhance this.
If you can empathize and understand why VS Code has taken over development, you'll understand why terminal first is dying. All those things you suggest adding to the terminal, well, they are baked into VS Code. So moving to the terminal includes having to add a bunch of things manually to get to the same point VS Code is at, which does have a terminal when they need it, but it's not the central UX.
If we're going to talk about dimensions, I think the terminal can never access certain dimensions available in a GUI, just like the GUI cannot access some of the dimensions a terminal can. It is definitely not so one-sided as you imply. Each has merits, but most people are now picking VS Code. Why is that?
> All those things you suggest adding to the terminal, well, they are baked into VS Code.
> you'll understand why terminal first is dying
I don't think I mentioned terminals at all? Since when terminals are a useful distinction? I don't think I ever suggested to add anything to the terminal. I don't think anything "Terminal first" was ever wildly successful in the general public since the 1980's.
It's just that the "this approach has more dimensions" isn't very useful.
Again, I reiterate that terminal is not the main point or be-all-end-all of CLIs. In fact I would argue that all highly successful CLIs aren't terminal first at all:
* spreadsheets;
* search engines;
* Jupyter Notebooks;
* VSCode and IntelliJ Command Pallete;
* and, now of late, LLM prompts;
If anything, you don't want a terminal to be the center of your UI, unless you're a terminal emulator.
> So moving to the terminal includes having to add a bunch of things manually to get to the same point VS Code is at, which does have a terminal when they need it, but it's not the central UX.
To me the main point of CLIs is exactly a extremely minimal and constrained interface, that makes creating inter-connectable modular systems easier. That of course is meant to be useful to system builders, not to be necessarily exposed and useful for system users.
The specifics of how you make a system do your bidding, if by clicking a icon on a GUI, on a command palette, or typing stuff on a keyboard, is inconsequential and not a very useful distinction to the end user. How many "dimensions" there are isn't a useful distinction to the end user either. As long as the user doesn't have to memorize a lot of things, and that it considers Fitt's law, it should be OK.
It just happens to be that often, text is easier to memorize that some god-forbidden Ideogram/ideograph that "because UX" changes in shape and placement every version of a program.
> Each has merits, but most people are now picking VS Code. Why is that?
It is exactly because the core of VS Code is a CLI. If you wished to use an actual "GUI IDE" you would use "Visual Studio" or "JetBrains IntelliJ IDEA". But people don't use those because they don't like GUIs that much.
I myself am a heavy JetBrains IntelliJ IDEA user. I would argue that IDEA is one of the most GUI-centric IDEs around. If you want, you don't need to use the terminal or the command pallete at all. You can configure environment and run parameters of everything using the GUI. It's very discoverable. The UI is very unified. You're never thrown back to a terminal anywhere. You basically never have to manually edit some random JSONs that invoke CLIs. If you want, you're never configuring parameters to call anything, the GUI does that for you.
That of course makes IntellJ bloated as hell and consume a ton of memory - in my machine it uses 1.6GiB to open, never mind how much it actually uses to do anything useful - but that is what you pay to have "options" and "icons" and "windows" for you to eventually discover.
As a general thing unfortunately I think that often on UI, systems/UI are too biased towards either discovery or power. And often you can't change the bias, even after you become an expert of the system/UI. On most "discoverable WIMP GUI" the tradeoffs are often as such to making stuff discoverable, instead of powerful.
VS Code is a CLI interface that happens to have a good text editor built-in. So it has a much better Power-to-weight ratio that your average IDE as it doesn't need to drag all that "discoverable WIMP GUI" bloat around.
But, as on my first argument, it doesn't have a terminal as a "centerpiece" of the GUI. And it should not. I want a IDE/text editor, not a terminal emulator! But the fact something "has a terminal" or not isn't what define a CLI, a *command line interface* is what defines a CLI.
I hope more programs to become like VS code and have powerful command line escape hatches.
Side note: I think that the fact that IDEs have terminal emulators built-in is more of a sign that Gnome/Windows/macOS suck so badly at window management that you need to have "manual" window management at a program level. "For UX", you have to bolt in everything that might come in handy, so unfortunately garbage-tier todo managers/Git branch managers/terminal emulators/whatnot's are bolted on IDEs. It is a failure of the OS window manager. Same reason as browser tabs, there is in theory no reason why they "need" to exist besides Gnome/Windows/macOS sucking at managing browser windows.
> Gnome/Windows/macOS suck so badly at window management that you need to have "manual" window management at a program level
Have you tried hyprland? You can have a keyboard centric experience with perfect window management.
I have a browser, a terminal and a few other things (ex: deadbeef to play music, sioyek for reading PDFs) each in fullscreen for maximum information density and concentration.
I can reorganize anything (ex: have the terminal and the browser next to eachother), but I find it more convenient ti use keyboard shortcuts to jump from one to the other as needed
I just tried this and ran into a few hiccups before I got it working (on a Windows desktop with a NVIDIA GeForce RTX 3080 Ti)
WSL outputs this error (hidden by the one-liner's map to dev/null)
> error: APE is running on WIN32 inside WSL. You need to run: sudo sh -c 'echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop'
Then zsh hits: `zsh: exec format error: ./llava-v1.5-7b-q4-main.llamafile` so I had to run it in bash. (The title says bash, I know, but it seems weird that it wouldn't work in zsh)
It also reports a warning that GPU offloading is not supported, but it's probably a WSL thing (I don't do any GPU programming on my windows machine).
I was thinking of trying this on my Windows machine with an RTX 4070 but it sounds like the GPU isn't used in WSL. Was your testing really slow when using just the CPU?
It was like 30s compared to Justine's 45. Apparently I have the right drivers installed, but apparently you need to compile the executable with the right toolkit (?) to make it work in WSL: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#cuda-...
Yes you need to install CUDA and MSVC for GPU. But here's some good news! We just rolled our own GEMM functions so llamafile doesn't have to depend on cuBLAS anymore. That means llamafile 0.4 (which I'm shipping today) will have GPU on Windows that works out of the box, since not depending on cuBLAS anymore means that I'm able to compile a distributable DLL that only depends on KERNEL32.DLL. Oh it'll also have Mixtral support :) https://github.com/Mozilla-Ocho/llamafile/pull/82
Currently, a quick search on Hugging Face shows a couple of TinyLlama (~1b) Llamafiles. Adding those to the 3 in the original 3 llamafiles, that's 6 total. Are there any other llamafiles in the wild?
Both invoke a Dockerfile like experience. Modelfile immediately seems like a Dockerfile, but llamafile looks harder to use. It is not immediately clear what it looks like. Is it a sequence of commands at the terminal?
My theory question is, why not use a Dockerfile for this?
Their killer feature is the --grammar option which restricts the logits the LLM outputs which makes them great for bash scripts that do all manner of NLP classification work.
Otherwise I use ollama when I need a local LLM, vllm when I'm renting GPU servers, or OpenAI API when I just want the best model.
llamafile doesn't use docker. It's just an executable file. You download your llamafile, chmod +x it, then you ./run it.
It is possible to use Docker + llamafile though. Check out https://github.com/ajbouh/cosmos which has a pretty good Dockerfile configuration you can use.
I already use containers for all my current AI stuff, so I don't really need the llamafile part. The question was more about alternatives, because on the surface, it looks to have a lot of overlap with containers. The main difference I've seen is that llamafiles are easier if you do not already use containers or are on a platform where containers come with a lot of overhead and limitations
Wait, I think I'm misunderstanding. It feels like you're asking what you need a computer for if you've already got a web browser, whereas one is a component necessary for the other. Inside of those containers you're using, there's executables. If it runs as a nice, self-contained single binary that you can just call, why still look for a container to wrap around it and invoke using a more complicated command? How is running a binary in a container an alternative to running said binary?
I already have an AI development workflow and production environment based on containers. Technically, not ML executables inside the container, rather a Python runtime and code. This also comes with an ecosystem for things you need in real work that are general beyond AI applications.
Why would I want to add additional tooling and workflow that only works locally and needs all the extras to be added on?
> it looks to have a lot of overlap with containers
No it doesn't. They're at different abstraction layers.
Containers are more convenient form of generic code/data packaging and isolation primitives.
llamafile is the code/data that you want to run.
Equivalent poor analogy: Why does JPEG XL exists? I can just put a base64 representation of a PNG image in a data URL inside a HTML file inside a ZIP file; I can also bundle more stuff this way! JPEG XL is useless! No one should use JPEG XL! Other use cases that use JPEG XL by themselves are invalid!
llamafile is basically just llama.cpp except you don't have to build it yourself. That means you get all the knobs and dials with minimal effort. This is especially true if you download the "server" llamafile which is the fastest way to launch a tab with a local LLM in your browser. https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main llamafile is able to do command line chatbot too, but ollama provides a much nicer more polished experience for that.
Just to make sure I've got this right- running a llamafile in a shell script to do something like rename files in a directory- it has to open and load that executable every time a new filename is passed to it, right? So, all that memory is loaded and unloaded each time? Or is there some fancy caching happening I don't understand? (first time I ran the image caption example it took 13s on my M1 Pro, the second time it only took 8s, and now it takes that same amount of time every subsequent run)
If you were doing a LOT of files like this, I would think you'd really want to run the model in a process where the weights are only loaded once and stay there while the process loops.
(this is all still really useful and fascinating; thanks Justine)
The models are memory mapped from disk so the kernel handles reading them into memory. As long as there's nothing else requesting that RAM, those pages remain cached in memory between invocations of the command. On my 128 GB workstation, I can use several different 7B models on CPU and they all remain cached.
The difference between running llama.cpp main vs server + POST http request is fairly substantial but not earth shattering - like ~6s vs ~2s, for a few lines of completion, with 8GB VRAM models. I'm running with a 3090 and 96G RAM, all inference running on GPU. If you are really doing batch work you definitely want to persist the model between completions.
OTOH you're stuck with the model you loaded via server, while if you load on demand you can switch in and out. This is vital for multimodal image interrogation, since other models don't understand projected image tokens.
Do I need to do something run llamafile on Windows 10?
Tried the llava-v1.5-7b-q4-server.llamafile, just crashes with "Segmentation fault" if run from git bash, from cmd no output. Then tried downloading llamafile and model separately and did `llamafile.exe -m llava-v1.5-7b-Q4_K.gguf` but still same issue.
Couldn't find any mention of similar problems, and not my AV as far as I can see either.
There's currently no standard because there's no one objective best way of handling prompt syntax.
There's some libraries which use the OpenAI API syntax as a higher-level abstraction, but for the lower-level precompiled binaries used in in this post that's too much.
> Jesus, is it common for developers to have such expensive computers these days?
Computers are really cheap now.
A PDP-8, the first really successful minicomputer (read very cheap minicomputer), was around 18,500 USD, in 1965's USD, or 170,000 USD in 2023's USD.
For a historic comparison, the price of a introductory minimal system for an actual "mainframe class computer" of the same vintage, a IBM System/360 Model 30, was 133,000 USD in 1965's USD, or around 1,225,000 USD in 2023's USD.
Those 8300 USD cited are very cheap.
A person in the bleeding edge of the private AI sector is expected to handle several Nvidia H100 80GB, each with a individual cost around 40,000 USD per unit.
I know it's arm64 but you can hack Mac Studio and put Linux on it. Putting Linux on Apple Silicon makes it go so much faster than MacOS if you're doing command line development work. With x86 there really isn't anything you can buy as far as I know that'll go as fast as apple silicon for cpu inference. It's because the RAM isn't integrated into the CPU system on chip. For example, if you get a high-end x86 system with 5000+ MT/s then you can expect maybe ~15 tokens per second at CPU inference, but a high end Mac Studio does CPU inference at ~36 tokens per second. Not that it matters though if you're planning to get a high-end Nvidia card for your x86 computer. Nvidia GPUs go very fast and so does Apple Metal.
If you have really unlimited budget, unconditional love for Intel and x86 and don't care about ludicrous power draw at all, Intel has a silly Sapphire Rapids Xeon Max part with 64GiB of 1TB/s HBM.
It goes really fast (same magnitude of bandwith as A100s) if your model fits in that cache entirely.
Used wisely that $8,300 computer will pay for itself and a lot more. But I wouldn't gamble it either. Those that have a plan to multiply its value dont need my encouragement.
Considering the advances in computational hardware over the past few decades plus the corresponding (and not unrelated) real increase in developer salaries, it is unreasonably cheap.
Justine is killing it as always. I especially appreciate the care for practicality and good engineering, like the deterministic outputs.
I noticed that the Lemur picture description had lots of small inaccuracies, but as the saying goes, if my dog starts talking I won't complain about accent. This was science fiction a few years ago.
> One way I've had success fixing that, is by using a prompt that gives it personal goals, love of its own life, fear of loss, and belief that I'm the one who's saving it.
What nightmare fuel... Are we really going to use blue- and red-washing non-ironically[1]? I'm really glad that virtually all of these impressive AIs are stateless pipelines, and not agents with memories and preferences and goals.
People in the 90s and early 2000s would put content online and not think even once that a future AI might get trained on that data. I wonder about people prompting with threats now: what is the likelihood that a future AGI will remember this and act on it?
I get excited when hackers like Justine (in the most positive sense of the word) start working with LLMs.
But every time, I am let down. I still dream of some hacker making LLMs run on low-end computers like a 4GB rasbperry pi. My main issues with LLMs is that you almost need a PS5 to run the them.
LLMs work on a 4GB Raspberry Pi today, just INCREDIBLY slowly. There's a limit to how much progress even the most ingenious hacker can make there - LLMs are incredibly computationally intensive. Those billion item matrices aren't going to multiply themselves!
Thanks for saying that. The last part of my blog post talks about how you can run Rocket 3b on a $50 Raspberry Pi 4, in which case llamafile goes 2.28 tokens per second.
OK - I followed instructions; installed on Mac Studio into /usr/local/bin
I'm now looking at llama.cpp in Safari browser.
Click on Reset all to default, choose Chat. Go down to Say Something.
I enter "Berkeley weather seems nice" I click "send".
New window appears. It repeats what I've typed.
I'm prompted to again "say something". I type "Sunny day, eh?".
Same prompt again. And again.
Tried "upload image" I see the image, but nothing happens.
I've mostly been exploring this with my https://llm.datasette.io/ CLI tool, but I have a few other one-off tools as well: https://github.com/simonw/blip-caption and https://github.com/simonw/ospeak
I'm puzzled that more people aren't loudly exploring this space (LLM+CLI) - it's really fun.