Hacker News new | past | comments | ask | show | jobs | submit login
Llama.cpp 30B runs with only 6GB of RAM now (github.com/ggerganov)
1311 points by msoad on March 31, 2023 | hide | past | favorite | 414 comments



Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.


Didn't expect to see two titans today: ggerganov AND jart. Can ya'll slow down you make us mortals look bad :')

Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.

Perhaps SWE is dead after all, but LLMs didn't kill it...


>how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions

Probably not all that much. All of the Python numeric computing frameworks (Numpy, PyTorch, TensorFlow, etc.) are basically just wrappers for lower level C++/C/Fortran code. Unless you’re doing something boneheaded and converting framework-native tensors to Python objects, passing tensors around within a framework essentially just passes a pointer around, which has marginal overhead even when encapsulated in a bloated Python object.

Indeed, a huge number of PyTorch operations are explicitly zero copy: https://pytorch.org/docs/stable/tensor_view.html


It’s not that the performance is the issue, it’s that it’s unmaintainable and prone to break. Exceptions aren’t handled right, dependencies are a disaster (Proprietary NVIDIA drivers+CUDA+PyTorch+ the various versions of stuff are a complete disaster)

This leads to all sorts of bugs and breaking changes that are cool in an academic or hobbyist setting but a total headache on a large production system.


Yeah, I've been using python for the first time in a while to try out some of the llm stuff and I can't believe how bad the dependency hell is. It's probably particularly bad due to the pace of change in this field. But I spend an hour getting dependencies fixed every time I touch anything. 80% of the Google Collabs I find are just outright broken. I wish there were other viable non python options to try out these things.


You're using virtual environments, right?

ML libraries are particularly bad, most other stuff works well.

Friends don't let friends install pip into /usr/lib.


This just goes to show what a mess this is.

Suppose you have a big piece of compute hardware (e.g. at a university) which is shared by multiple users. They all want to come in and play with these models. Each one is tens to hundreds of gigabytes. Is each user supposed to have their own copy in their home directory?


This is not exactly a new problem.


That's kind of the point. We solved this problem decades ago. You have a system package manager that installs a system-wide copy of the package that everybody can use.

But now we encounter this broken nonsense because solved problems get unsolved by bad software.


IME the ML world with Python is a whole mess on top of the existing dependency issues.

I've been very _careful_ too (using pyenv/virtualenvs etc) for dependency management, but with Nvidia driver dependencies and "missing 'sqlite3/bz2' issues related to the underlying interpreter (not to mention issues with different Python3.x versions) I'm lucky to be able to even run a 'hello world' ML sample after an afternoon of fighting with it.

My Ubuntu install w/ Nvidia card only seems to recognize the GPU in some circumstances even when using the same `conda` env. Often this is remedied by rebooting the machine(?).

No idea how companies manage this stuff in production. Absolute minefield that seems to catastrophically break if you sneeze at it.

I'll admit I am not an expert in managing ML envs, but I've dealt with a lot of python environments for typical CRUD stuff, and while rough at times, it was never this bad.


No idea what a Google Collab is, but does the code come with an environment or at least a specifications of which packages and versions to use (requirements.txt)?

It sounds unnecessarily weird to me that people would share Python code that simply doesn't work out at all out of the box.


Its rarely as easy as sharing a requirements.txt. There are lots of things that can still break - for examples you get weird situations where different modules require different versions of a third module. Or all the Cuda toolkit version issues thsy seem to come up with gpu stuff. When we share python, we tend to share a docker image, and even this isn't foolproof. A big problem I think is that it doesn't incentivize building something portable. And it's very hard to test across different machines. Add to that all the different practices re virtual environments, venv, conda, etc, everyone tries to install the dependencies differently or is starting from some nonstandard state. It's a mess.


Maybe using Nix it's a better experience for creating such an environment where you depending also on system utilities.


Everyone is using llama.cpp because we reject the idea of giving up on system libraries like nix does. That kind of tomfoolery (at least in the desktop context) is only required when you use software projects that use libraries/languages which break forwards compatibility every 3 years.

If you just write straight c++ (without c++xx, or anything like it) you can compile the code on machines from decades ago if you want.


What's c++xx?


C++11, and greater.


Huh, I was proficient in Rust before "properly" learning C++, so maybe that accounts for it, but I didn't realize C++11 was controversial. Is it just move semantics, or are there some library things that are hard to implement?


I think what OP is saying is that decades-old systems wouldn't have C++11-compatible compilers on them.


And maybe that "C++" is now basically a bunch of different incompatible languages instead of just 1 language, depending on what "xx" is (11, 14, 17, 20, 23, etc).

It's like Python 2 vs Python 3 except even worse.


In my experience, C++03 code works just fine without changes on a C++11 and C++14 compilers, so no, it's not at all like Python 2/3. The few features that were ripped out were exactly the stuff that pretty much no-one was using for good reasons (e.g. throw-specifications).


> No idea what a Google Collab is

It's ~equivalent to a Jupyter notebook.


The stack is very volatile and unmaintainable because it doesn't need to be maintainable. Exactly why we have unmaintainable software in other domains. During the last 10 years there are ALWAYS totally new model architecture with new operations (or in case of CV new bizarre uses of Conv). By the time you get your performant perfectly maintainable masterpiece ready it's not needed anymore. The stack optimizes for flexibility and iteration speed naturally, just like why people use Rails.

In fact I'd love to see that Transformer really dominates. We can then start to converge on software. And compute-wise transformers are really simple, too!


Still a poor excuse. Had they written this in Java and things wouldn't be so difficult both on performance and maintainability.

Never understood why people think that indented languages are any simpler when in fact they bring all kinds of trouble for getting things done.


There's deeplearning4j (from Theano days!), go figure why it didn't take off.


There's a Java ML library called Tribuo that might be worth looking at.


Thanks, the boring aspect of Java is appealing here.


> The stack optimizes for flexibility and iteration speed naturally

“unmaintainable” (as in “i’m spending an hour each day sorting out which dep update broke my project”) usually gets in the way of the former point.


Does this mean it would be easy to move off Python all together? It seems like the problem stems from everyone using pytorch at the base layer. How realistic is it recreate those apis in another, more modern language. Coding in Rust, Go... then distributing a single binary vs. pip hell seems like it would be worth it.


Check https://pytorch.org/tutorials/advanced/cpp_frontend.html

You can easily build a standalone binary (well, it would be GiB+ if you use CUDA... but that's the cost of statically linking cu*), had you coded your model and training loop in C++.

It then happily runs everywhere as long as a NVIDIA GPU driver is available (don't need to install CUDA).

Protip: Your AI research team REALLY DON'T WANT TO DO THIS BECAUSE THEY LOVE PYTHON. Having Python, even with the dependency management shit, is a feature, not a bug.

(if you want Rust / Go and don't want to wrapping libtorch/tf then you have a lot of work to do but yeah it's possible. also there are model compiler guys [1] where the promise is model.py in model.o out you just link it with your code)

[1] https://mlc.ai


Go would be interesting for the reason you could send an executable.

I’d love for JS/TS to dominate as well. Use ‘bun bun’ to send an executable if need be, but also use in in web backends.


I was in a PLT group in grad school going into robotics. I could spend all day ranting about how Python is just completely unsuitable for professional software development. Even something like F# would be an enormous improvement.


What a bad take!

Python is not the cause of dependency hell. Deep dependency trees are. The only way to deal with this is to use seperate environments and to carefully specify the exact requirements.

Those who claim some language would be a magical fix clearly lack experience in multiple languages.


It's true nothing forces or forbids this, but some languages/toolings/communities/ecosystems encourage that more than others though.


This doesn't even seem that clever, just regular ol' use of mmap where there was none before. Wonder what other performance is being left on the floor. I'm convinced entire power plants could be retired if the world stopped using python unfortunately.


>> I'm convinced entire power plants could be retired if the world stopped using python unfortunately.

On the other hand, many business and professionals wouldn't exist :)


I can't find a single good argument for Python based on merit that's not at least 15+ years dated and stems from "But Google is using it".

It's not the easiest syntax, not the best compiler support, performance and threading is a joke. The entire language is based on hype back from the time when the only two mainstream languages were C++ and Java.


It’s not like there’s a gun to anyone’s head forcing them to use Python. The ecosystem (library, framework, IDEs) is what draws people to use it.

If there was a superior alternative that covers the breadth of the Python ecosystem I’m pretty sure no one would have any scruples in using it. A programming language and its syntax is the least interesting or complex part when it comes to solving problems. Just rattling off some amazing libraries I've used over the last few years:

https://scikit-image.org - Image processing

https://imgaug.readthedocs.io - Image augmentation

https://scikit-learn.org/stable - ML

https://pymoo.org - Multi objective optimization

https://simpy.readthedocs.io/ - Discrete event simulation

https://lifelines.readthedocs.io - Survival analysis

https://bambinos.github.io/bambi - Bayesian modeling

https://unit8co.github.io/darts/ - Time series forecasting

https://abydos.readthedocs.io/en/latest/abydos.distance.html - Basically any string distance metric you can think of

The list just goes on and on.. oh yeah, some Deep Learning libraries too, which some people find useful.


>It’s not like there’s a gun to anyone’s head forcing them to use Python. The ecosystem (library, framework, IDEs) is what draws people to use it

Sure, but that is the gun, especially (as reflected in your examples) for machine learning. The best frameworks (PyTorch, TensorFlow, JAX) are all Python, with support for other languages being an afterthought as best.

The use of scripting languages (Python, Lua - original Torch) for ML seems to have started partly because he original users were non-developers, more from a math/stats background, and partly because an interactive REPL loop is good for a field like this that is very experimental/empirical.

Does it make sense that we're now building AGI using a scripting language? Not really, but that's where we are!


Python is the 2nd best language for everything.

It doesn’t excel at anything, but anything a software can do, it can be done in Python somehow.

So, a great pick when you’ve got no idea where you’re going to, when you’re prototyping, when you don’t care about performance or perfection.

I agree that for large scale systems when you already know what you’re doing, Python shows its limits quite soon (and we should add the problems with missing/slow type checking that slows down large scale systems development).


To steal from another thread, Python is the McDonald's of languages - it's ubiquitous, it doesn't take much effort, and it's really not very good.

The trope about it being the 2nd best language for everything isn't correct. It's taught in universities because it has a very short time to gratification, and the basic syntax is quite intuitive. Academics latched onto it for ML because of some excellent libraries, and it became established as a vital part of the ecosystem from there.

But it's a nightmare to support a moderate to large codebase in production, packaging continues to be a mess, and it's full of weird quirks. Great for weekend projects, but for pete's sake take a minute and port them into something more reliable before going to production with them.


I think you’re focusing too much on the letter, rather than the idea.


> Python is the 2nd best language for everything.

Huh? Why?

You can barely deploy it to Web.

it doesn't scale perfoance wise

you can't built robust abstractions

The REPL is merely OK

You can barely ship working code without containers

the syntax is hard to manipulate programmatically

Python has inertia but it's holding us back


Well for starters, web deployment isn't "everything". Python is the de-facto go-to language for research or general prototyping, where not everyone is a programming wiz keeping track of the latest trendy new compiled language. Not everyone can compile stuff even.. :)

Having said that, I've deployed two large Django projects on the web with tons of customers and it runs and scales just fine, and it's a DREAM to maintain and develop for than for example Java.. I would go so far as to say the opposite, if you haven't used Python for web deployment you've been missing out! (you lose some efficiency I'm sure but you gain other things)


I was talking about running in the Web browser. it's not everything, but it's an important part of everything in my book.


https://github.com/pyodide/pyodide is pretty amazing for running Python client side in the browser.

You could run notebooks entirely client side https://jupyterlite.readthedocs.io/en/latest/

The startup is slow but otherwise it is pretty functional.


You have good points but "the syntax is hard to manipulate programmatically"??

Maybe you haven't noticed but Lisp is now a tiny niche and most new languages aren't homoiconic either..


I don't think that proves anything. If we had "JavaLisp" in the browser instead of JavaScript then Lisp would be very popular. Besides that, Python is harder to manipulate than many non-Lisps, such as JavaScript and Go.


Python became popular without being the 'web language', the Lisps didn't.


Curly brace languages are more popular again.


Here, we can set Lisp aside and take grandparent comment's definition of syntax to be concrete, character-level syntax.

Python concrete syntax is harder to manipulate programmatically compared to Javascript concrete syntax.

For instance, to insert one statement into another, we need to traverse the lines of that syntax and add the right amount of indentation. We can't just plant the syntax into the desired spot and be done with it.


What is another language working that well in a larger number of areas?


Clojure

JavaScript

Typescript

OCaml

Haskell

F#


Any JVM language or .NET language will take more to interface with native libraries, it’s not the same.

Ocaml is very niche, I feel it’s an hard sell for a general purpose language. Haskell, 3x that.

JS and TS, could be. But are they so much better than Python, if better at all?


Native library interfacing isn't really Python's strong suit, interpreter plugins are quite painful to write.

.NET has P/Invoke which is much nicer.

JVM is getting Panama+jextract, which is the nicest yet. You can go straight from header files to pure Java bindings which don't need any extra native code at all. But it's not shipped yet :(


What is an “interpreter plugin?” Writing a Python C extension is not that painful, it’s quite well supported. And you’ve got cffi and ctypes as well.


Python has had cffi since figuratively forever, so I’m not sure why you compare native modules to P/Invoke?


Most important libraries with native components are using C extensions, not cffi.


> Ocaml is very niche, I feel it’s an hard sell for a general purpose language. Haskell, 3x that.

The impression about Haskell’s nicheness compared with OCaml prevails. But Haskell has a larger userbase and a larger library ecosystem than OCaml.


A few years have passed since I last tried out both languages. Ocaml was sort of approachable, while Haskell required quite a different mindset imho, hence the “nicheness” from the general usage standpoint.


Outside of typescript, this feels like a response from a decade ago, when Python was still mired in the 2 vs 3 problem.

What's happened to the popularity of all of these languages since 2010? Outside of JS/TS, absolutely nothing. If anything, they've lost mindshare.


I've been using Haskell professionally for 8 years and its ecosystem is laughable compared to Python.


Python, the language with global interpret lock, Is not the 2nd best language for everything, especially in the age od multicore processors.


Python is the practical language for when you do your cpu intensive tasks outside of it as a feature, since the GIL isn’t a problem with io parallelism.

You’d do better complaining about still nascent, compared to alternatives, async support or lack of jit in the official implementation.


It's not the easiest syntax?

It's the easiest among most popular languages. It uses the least amount of symbols, parenthesis and braces only for values.

Some people don't like the significant whitespace, but that helps readability.


> Some people don't like the significant whitespace, but that helps readability.

Compared to what? Unindented or badly indented code in other languages?

In other languages you can move code around and it still works - and nobody prevents you from adding whitespace for readeability (it may be even done automatically for you).


> It uses the least amount of symbols, parenthesis and braces only for values.

is there any evidence that this makes it easier?

people learn python as beginners because it has a reputation for being easy for beginners

I don't see anything about the syntax that makes it inherently easier


And what symbols it has, it reuses them wisely.

The square brackets alone make it a winner. Array, list and strings indexing. Dictionary lookups. Slices and substrings. List comprehensions. The notations convenience of this alone is immense.

Built in list, string, and dicts. For the 90% of code that is not performance critical, this is a godsend. Just looking at the c++ syntax for this makes me never want to use a stl data structure for anything trivial.


What languages are you comparing it against?

Python is more readable than C. Way better than C++. Far simpler to reason about than Java. Maybe Typescript is on a similar level, but throwing a beginner into the JS ecosystem can be daunting. Perhaps Ruby could be argued as equally simple, but it feels like that's a dead end language these days. Golang is great, but probably not as easy to get rolling with as Python.

What else? Are you going to recommend some niche language no one hires for?


> Far simpler to reason about than Java.

Strong disagreement. Explicit types make reasoning about Java much easier, especially when you are in an unfamiliar codebase.

Python is not quite the 'write-only' language of Perl, but it is a lot easier to write it than it is to read it.


Python is getting typescript like typing support. Slowly, yes, but way better than Java’s type system.


Anecdata, but I learned Python many years ago precisely because I found the syntax was clear.

I liked the one way of doing most things philosophy, coming off working on a large C++ code base.


Being more clean than C++ doesn't prove much :)


Very true!

At the time I evaluated other languages to learn, narrowed it down to Ruby and Python, and picked Python as I felt it had a nicer syntax than Ruby. And the "one way to do things" philosophy. This was back in 2005 or so.

What other languages of that period would you say had a nicer syntax than Python?


Python has the ecosystem. That’s it. The lingua franca of data science. At this point it doesn’t even matter anymore why.

Just like you’re not gonna usurp JavaScript on the web.


There were plenty of other languages competing with python for the same niche such as perl, ruby, js, php etc... Python is superior to all of those just for syntax alone, it is easier and cleaner to both read and write.


That might be true, but it seems to generally fall under the category of 'relevant 15+ years ago', doesn't it?


How do you qualify relevancy? Your own personal bubble and bias? Adoption and usage?

Pull requests and stars on github? That might be a start.

https://madnight.github.io/githut/#/pull_requests/2022/4 https://madnight.github.io/githut/#/stars/2022/4

Though you may say but but alltheprivaterepos! Then I challenge you to back up what you mean by relevance and prove python is a category of relevant 15+ years ago.


I'm arguing against the point that it clearly did have the easiest syntax compared to the competition back then and not because Google was using it.

Even if it doesn't have the best syntax now (which I doubt), the tooling and libraries make it a better choice over any language that have an edge over python syntax.


> I'm arguing against the point that it clearly did have the easiest syntax compared to the competition back then and not because Google was using it.

Maybe, not sure? My point was that both the syntax and Google using it was more relevant 15 years ago than now.

(I don't have much of an opinion on the 15+ years ago thing.)


I don't see any reason for it to be less true now.

Is python syntax worse than any brand new languages like rust or go? Absolutely not. It's still better.

Did Google stop using it? I don't think so, but I also don't think people picked it just because Google did.


Python's syntax is ok.

Btw, I wish they would take some inspiration from Haskell's syntax.

Haskell also has significant whitespace, but its defined as syntactic sugar for a more traditionally syntax with curly braces and semicolons.

Approximately no-one uses that curly-brace syntax, but it's good for two things:

- silences the naysayers

- more importantly: allows you to copy-paste code even into forms that mess up your indentation.


In a few years, none of this is going to matter anyway since it is likely we will be able to automatically translate everything cheaply.


> Python is superior to all of those just for syntax alone, it is easier and cleaner to both read and write.

Do you have any argument to support this, aside from personal bias?


I can make some arguments but it all boils down to personal bias and anecdotes.

The forced use of spacing to delineate blocks means you will never see a bunch of brackets eating up screen space and the common error where someone adds another line to an if statement but doesn't add braces.

Semicolons not being conventional means less screen noise and less code golf 1 liners.

The focus on imperative vs functional means you rarely ever see something like this a(b(c(d(e(f(g))))

PHP suffers greatly from poorly named standard functions on top of all of that.

Don't get me started on Ruby metaprogramming.

These are just the things I could think of off the top of my head. I do not want to spend my afternoon on this. This is just my experience looking at code for over 20 years, you either believe it or you don't. There's no scientific studies to prove that 1 syntax feature is superior.

I highly doubt that everyone chose python just because Google did. Python was a giant step in syntax compared to the competition back then, and now even if there is a new language out there right now that has a better syntax, it's not going to be better by much, and it is not going to have the tooling, libraries, or the community.


Having not been around when Python gained in popularity, and having mostly been using Node.js and Swift, this is actually quite interesting.

Thanks!


Couldn't you level the same argument against eg C++?


Before an NDA send him to Rura Penthe I use to have an internet friend pedantic about seemingly useless compilers and interpreters. Quests like: use obscure language A to translate obscure language B to obscure language C. Then use B compiled to C to interpret D.

A long story short, in the future the AI can just convert all our code to FORTH or HolyC or some "creative" combination of languages chosen by prophecy (read: hallucination) perhaps even Python — as a show of strength.


> Perhaps SWE is dead after all, but LLMs didn't kill it...

Cheap electronics did. 32GB of RAM is maybe $150, a developer converting & maintaining your system to use mmap is $150k/year.


This still doesn't make sense. It doesn't take a full year to do optimizations like this. Maybe a month at most if you include the investigation time. And the memory usage is $150 times the number of users which is in the thousands at least.


Tragedy of the commons. If you want to do something that benefits everyone a little bit, and you can't productize it like OpenAI's $20/month subscription, then there's no rational economic reason to do it, and you have to wait for someone like me who has an irrational love of coding. It's not a lifestyle that makes you rich, but it does help you see the opportunities to fix problems that the well-resourced folks who are supposed to be solving them would never even notice; in fact, they'd probably think you're trolling them if you ever brought it up.


>Tragedy of the commons.

Tragedy of folks forgetting how to program.

This mmap() "trick" isn't a trick, its a standard practice for anyone who has cut their teeth on POSIX or embedded. See also mlock()/munlock() ..


Well that's exactly the thing. They haven't. We're talking about a group of people here who live inside scientific papers and jupyter notebooks. They're able to make machines literally think, but you'd be pushing them out of their comfort zone if you stuck them in front of something like Emacs with C. Some people like GG, Jeff Dean, etc. are strong in both skill sets, but they're outliers.


Tragedy of the commons only work for things you don't directly pay for.


Exactly, the software supplier isn't paying for RAM.


well in a way - open source software something that you don’t directly pay for


More saliently, the overwhelming majority of the Linux kernel's direct and extended userbase has contributed nothing at all directly to the Linux kernel, as just one example.


So let's toss management and go write good code for the principle of it, and not business bullshit calculus


Good, tell me how your company will be doing.

What people sometimes fail to understand is that code is a mean to an end, not an end in itself.

If you want to make code for itself, work on an opensource and/or personal project. If you are paid to work on something, you're paid for the something to get out, not for it to feature the best code ever.


With the margins that tech makes, many companies could certainly afford to care more about code quality. But they don't, instead it gets stuffed into cash reserves where the money sits idle, doing nothing but enriching shareholders.

Or hiring useless business people to install around the periphery of engineering. Which is funny because now tech is letting all those folks go.


Sigh.

It's not like the zero copy buzzword is going to help you during training, all your weights have to stay on GPU, you are going to sample your training data randomly and your data is on a networked storage anyway, so mmap HURTS. You'd better just O_DIRECT.

Similarly, as long as you run your inference on GPU it's not like you can mmap... And I have indeed worked on inference runtimes for mobile devices and on the rare cases we need to use CPU only (hey, your phone also have GPUs since forever) at $PREVIOUS_JOB we did have a mmap-able model format, it also helps in TEE/SGX/whatever enclave tech. Oh, and there are no Python at all.

The recent development of ggml is interesting as it catches a moment that "big ML shop infra" guys don't care: running models on Apple Silicon. M1/M2s are expensive enough that we don't consider deploying them instead of those 1000000000 bizarre accelerators in production, yet everyone on HN seems to have one and hey it's fast enough for LMs. They are rather unique as they are CPU+high bandwidth RAM+accelerators with totally shared RAM with CPU, instead of some GPU shit.

tldr it's not like "big ML shop infra" guys are stupid and leaves performance on table. They just don't run their production workload on MacBooks. That's where the community shine right?


On a Mac, mmap definitely works for the GPU since it’s all the same unified memory.


in llama.cpp inference runs on CPU, using AVX-2 optimizations. You don't need GPU at all

It runs on my 2015 ThinkPad!


This! Thank you.


Money did.

Why waste developer hours (meaning effort) if you can just scale the infra for a little cash? Do it in small enough increments and the increases only outweigh FTEs if you consider all scaling events and look at a long enough time scale.

Suddenly it takes way too much for way too little, but it cost half as many overpaid developers who can’t be arsed to performance.

Edit: in case that sounds like the opposite of intended, ggerganov and jart are the outliers, the exception.


> such clever use of mmap

Just wanna say, that this use of mmap() is cleverly used in this context, but should be acknowledged as a widely accepted industry standard practice for getting higher performance, particularly in embedded applications but also in performance-oriented apps such as digital audio workstations, video editing systems, and so on.


Just because mmap() is commonly used doesn't mean it's commonly understood. Yes, it powers just about everything important in terms of the skeletons of our local systems. So why has the thought of using it occurred to so few people until now? Almost a whole generation has passed since things like mmap() were relegated to "the work's been done!" category of computing. People moved on to caring about things like My Browser and The Cloud where mmap() doesn't exist. Most people don't know about it. The ones who do, are reluctant to use it. Scientific computing projects are totally devoted to supporting MSVC (since you just know data scientists are secretly using those GPUs for gaming) so any thought devs may have had previously about using mmap() would have certainly triggered fears w.r.t. WIN32 before any chance to fully consider the true depth of its value would kick in. Plus data migrations are very difficult to pull off. It worked here due to the outpouring of community support, since people were blocked on this. But for a corporation with tons of cash to burn, it's a harder sell.


The Cloud has been with us since the birth of computing. What is happening is, the computing industry goes through waves of attrition, whereby the schools push everyone up the Brand New Stack, while industry, frustrated with generations of programmers who can't program, just Builds Another Stack.

Repeat, ad infinitum. In the cracks you'll find people re-learning things they should've known, if only they weren't slagging off the grey beards .. or, even worse .. as grey beards not paying attention to the discoveries of youth.

>Most people don't know about it. The ones who do, are reluctant to use it.

Not so sure about this. The reluctance is emotional, its not technical. Nobody is killing POSIX under all of this - it is deployed. Therefore, learn it.

>so any thought devs may have had previously about using mmap() would have certainly triggered fears w.r.t. WIN32

Does not compute. Own up, you're an AI.


you'd be surprised how many professional programmers these days work exclusively in high level languages and know nothing about using operating system features to their fullest.

but to your point, until technology itself actually replaces us, deeply skilled computer people are always going to be able to squeeze more performance out of software implemented in high level languages by those who have not studied computers extensively.


You can mmap from python.


The CPython mmap module docs: https://docs.python.org/3/library/mmap.html

zero_buffer (CFFI, 2013) https://github.com/alex/zero_buffer/blob/master/zero_buffer....

"Buffers on the edge: Python and Rust" (2022) https://alexgaynor.net/2022/oct/23/buffers-on-the-edge/ :

> If you have a Python object and want to obtain its buffer, you can do so with memoryview in Python or PyObject_GetBuffer in C. If you’re defining a class and want to expose a buffer, you can do so in Python by… actually you can’t, only classes implemented in C can implement the buffer protocol. To implement the buffer protocol in C, you provide the bf_getbuffer and bf_releasebuffer functions which are called to obtain a buffer from an object and when that buffer is being released, respectively.

iocursor (CPython C API, ~Rust std::io::Cursor) https://github.com/althonos/iocursor

Arrow Python (C++) > On disk and MemoryMappedFile s: https://arrow.apache.org/docs/python/memory.html#on-disk-and...

"Apache Arrow: Read DataFrame With Zero Memory" (2020) https://towardsdatascience.com/apache-arrow-read-dataframe-w...

pyarrow.Tensor: https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...

ONNX is built on protocolbuffers/protobufs (google/protobufs), while Arrow is built on google/flatbuffers.

FlatBuffers https://en.wikipedia.org/wiki/FlatBuffers :

> It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible (like some mutation operations).


In fact, you can mmap from PyTorch directly.


and numpy


It is easy to use mmap from Python. You can do zero copy too e.g., https://stackoverflow.com/questions/17244488/reading-struct-... (see mmap+frombuffer example)

Though in practice, in many cases, mmap won't be faster, it can be even slower than open+read.


For the life of me I could never fix torch.load, they'll say just quantization (convert) a model to 4/8bit to make it smaller but you'll get crashes when out of system memory plus no docs.. then you admit defeat by using more swapfile :S


Thank you for saying it out loud, I thought I was going crazy!


> But we don't have a compelling enough theory yet to explain the RAM usage miracle.

My guess would be that the model is faulted into memory lazily page by page (4K or 16K chunks) as the model is used, so only the actual parts that are needed are loaded.

The kernel also removes old pages from the page cache to make room for new ones, and especially so if the computer is using a lot of its RAM. As with all performance things, this approach trades off inference speed for memory usage, but likely faster overall because you don't have to read the entire thing from disk at the start. Each input will take a different path through the model, and will require loading more of it.

The cool part is that this memory architecture should work just fine with hardware acceleration, too, as long as the computer has unified memory (anything with an integrated GPU). This approach likely won't be possible with dedicated GPUs/VRAM.

This approach _does_ still work to run a dense model with limited memory, but the time/memory savings would just be less. The GPU doesn't multiply every matrix in the file literally simultaneously, so the page cache doesn't need to contain the entire model at once.


I don't think it's actually trading away inference speed. You can pass an --mlock flag, which calls mlock() on the entire 20GB model (you need root to do it), then htop still reports only like 4GB of RAM is in use. My change helps inference go faster. For instance, I've been getting inference speeds of 30ms per token after my recent change on the 7B model, and I normally get 200ms per eval on the 30B model.


Disk accesses should not lie. If only 6GiB are read from the disk, then I believe either the model is indeed sparse in its computation, or there may be a bug somewhere.


You couldn't have said it clearer.


> htop still reports only like 4GB of RAM is in use

I think that's just an accounting thing. Many UNIX variants will not "charge" read only memory mapped pages to a process, because they could be shared among many processes and evicted at will.


Hmm, can you try running iotop, to see how much is being red from disk? Is it 20Gb or 6Gb? Maybe the prefetch is able to fill in before page faults are happening? Or maybe you are hitting the disk cache?


> You can pass an --mlock flag, which calls mlock() on the entire 20GB model (you need root to do it), then htop still reports only like 4GB of RAM is in use.

How is that possible? Is the model being compressed even more (even after converting to 4 bit) somehow? Or is most of the model unused?


mmap-ed memory pages backed by a file that aren't dirty aren't counted in an process's RSS usage, only kernel page cache. The mmap-ed regions of virtual memory does get counted in VSZ (virtual memory) but that is just virtual and can be larger than RAM+swap.


This is incredible, great work. Have you tried it with the 65B model? Previously I didn't have a machine that could run it. I'd love to know the numbers on that one.


What does it look like if used context size increases?


Very cool! Are you testing after a reboot / with an empty page cache?


Pretty much. I do my work on a headless workstation that I SSH into, so it's not like competing with Chrome tabs or anything like that. But I do it mostly because that's what I've always done. The point of my change is you won't have to be like me anymore. Many of the devs who contacted after using my change have been saying stuff like, "yes! I can actually run LLaMA without having to close all my apps!" and they're so happy.


Linux has a command to drop caches at runtime (https://www.tecmint.com/clear-ram-memory-cache-buffer-and-sw...) which is VERY useful during debugging.


Metal only recent versions (macOS 13 / iOS 16) supports mmap and use that in GPU directly. CUDA does have unified memory mode even it is dedicated GPU, would be interesting to try that out. Probably going to slow down quite a bit, but still interesting to have that possibility.


Based on that discussion, it definitely sounds like some sort of bug is hiding. Perhaps run some evaluations to compare perplexity to the standard implementation?

Edit: looks like there's now confirmation that running it on a 10GB VM slows inference down massively, so looks like the only thing strange is the memory usage reading on some systems.


Took a look at it, did you try MAP_HUGETLB? This looks like the kind of application that can gain very large runtime advantages from avoiding TLB pressure. It might take a bit longer (or fail entirely) on machines where you can't get enough huge pages, but attempting it (or probing for free pages via /proc/meminfo) and then falling back to mapping it without might take slightly longer for the mmap() but the advantages of taking an order of magnitude (assuming you can get 1G pages) fewer TLB misses might be worth it.


Is the title misleading here ?

30B quantized requires 19.5 GB, not 6GB; Otherwise severe swapping to disk

  model    original size   quantized size (4-bit)
  7B     13 GB    3.9 GB
  13B    24 GB    7.8 GB
  30B    60 GB    19.5 GB
  65B    120 GB   38.5 GB


Now it's clear that there was a bug in the measurement. The author used a machine with lots of RAM, so I guess most of us are still stuck with quantized 13B. Still, the improvement hopefully translates, and I hope that 30B will run with 3 bit quantization in a few days.


Also, current SSD's achieve 7.5 GB/s+ read speed, opposed to older SSD from 2013 with 500 MB/s, so performance will drastically differ depending on your system specs in case of pulling weights from disk to RAM on demand. Also, there is $ vmmap <pid> where we can see various statistics about process memory and used swap, that are not available in top or htop.


Even with 7.5GB/s you are gonna at best achieve 2.7 seconds for a computing a token, in a hyperoptimistic scenario that you can actually achieve that speed in reading the file, which is too slow for doing much. Maybe if one could get the kernel to swap more aggressively or sth it could cut half that time or so, but it still would be quite slow.


That's the size on disk, my man. When you quantize it to a smaller float size you lose precision on the weights and so the model is smaller. Then here they `mmap` the file and it only needs 6 GiB of RAM!


The size mentioned is already quantized (and to integers, not floats). mmap obviously doesn't do any quantization.


I’m hopeful that when especially skilled developers like you have banged on minimizing inference resources, the others and you will start looking at distributed training ideas. Probably there is a way to decentralize the training so we can all throw in our GPUs together on building the most useful models for code generation than can be free to use and relatively cheap to run inference on. If you have any thoughts on that side of the LLM space I’m sure we would all be super curious to hear them.

Thank you for the amazing work. It’s so appreciated by so many on HN like me I’m sure.


Why is it behaving sparsely? There are only dense operations, right?


From what I've read there's no evidence it's "behaving sparsely".. That was just offered as a suggestion why it might not be loading all the weights, but makes no sense in terms of the model. It's going to be using all the weights.

Another suggestion is that not all of the word/token embedding table might be used, which would be a function of the input used to test, but that would be easy enough to disprove as there would then be different memory usage for different inputs.

It seems possible the reported memory usage is lower than reality if that's how mmap/top work. In any case, a good use of mmap it seems, especially since for a multi-layer model layer weights will be used sequentially so paged load-on-demand will work relatively well even in a low memory situation.


I also have this question, yes it should be. The forward pass should require accessing all the weights AFAIK.


Gosh, thank you for getting to this before I did. The first thing I said when I saw it loading tens of GB from the disk on each run is, is there some reason they're not using mmap?

This isn't just a matter of making the 30B model run in 6GB or whatever. You can now run the largest model, without heavy quantization, and let the OS figure it out. It won't be as fast as having "enough" memory, but it will run.

In theory you could always have done this with swap, but swap is even slower because evictions have to be written back to swap (and wear out your SSD if your swap isn't on glacially slow spinning rust) instead of just discarded because the OS knows where to read it back from the filesystem.

This should also make it much more efficient to run multiple instances at once because they can share the block cache.

(I wonder if anybody has done this with Stable Diffusion etc.)


It really shouldn't act as a sparse model. I would bet on something being off.


Thanks for this! I was able to integrate alpaca-30B into a slack bot & a quick tkinter GUI (coded by GPT-4 tbh) by just shelling out to `./main` in both cases, since model loading is so quick now. (I didn't even have to ask GPT-4 to code me up Python bindings to llama's c-style api!)


What’s your setup for running these? I’m not seeing performance improvements on off the shelf hardware that would allow for this.


I host a llama-13B IRC chatbot on a spare old android phone.


Have a repo anywhere?


It's just the same llama.cpp repo everyone else is using. You just git clone it to your android phone in termux and then run make and you're done. https://github.com/ggerganov/llama.cpp

Assuming you have the model file downloaded (you can use wget to download it) these are the instructions to install and run:

pkg install git

pkg install cmake

pkg install build-essential

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make -j

./main


Yeah, I’ve already been running llama.cpp locally, but not found it to perform at the level attested in the comment (30B model as a chat bot on commodity hardware). 13B runs okay, but inference appears generally too slow on to do anything useful on my MacBook. I wondered what you might be doing to get usable performance in that context.


You can change the number of threads llama.cpp uses with the -t argument. By default it only uses 4. For example, if your CPU has 16 physical cores then you can run ./main -m model.bin -t 16

16 cores would be about 4x faster than the default 4 cores. Eventually you hit memory bottlenecks. So 32 cores is not twice as fast as 13 cores unfortunately.


Thanks! Will test that out!


Great work. Is the new file format described anywhere? Skimming the issue comments I have a vague sense that r/o matter was colocated somewhere for zero copy mmap or is there more to it?


That's something I'm working on presently.


Just shows how inefficient some of the ML research code can be


As a former grad student, I can tell you, that's all research code, not just ML, or even "performance-oriented" research code.


Training tends to require a lot more precision and hence memory than inference. I bet many of the tricks here won't work well for training.


For now we've just shown how measuring memory consumption can be tricky at times.


Exactly.

It also shows the number of impostors in this thread and inflated titles of self-proclaimed 'seniors' who can't optimize ML code to even be on the same league as Tunney (jart), and Gerganov (ggerganov).

Not even ChatGPT or Copilot could even submit a change or in-fact completely rewrite and optimize this code like they have done.


Remember this moment when you're about to criticise LLMs. People can act suboptimal too, even experts.


Have you tried running it against a quantized model on HuggingFace with identical inputs and deterministic sampling to check if the outputs you're getting are identical? I think that should confirm/eliminate any concern of the model being evaluated incorrectly.


Maybe off topic, but I just wanted to say that you're an inspiration!


This is nothing short of legendary. Was following the thread on Twitter and LOLed at the replies of “Praise be Jart”, but there’s something of the sublime here. Great weight wrangling judo :)


It appears that this was just a misreading of how memory usage was being reported and there was actually no improvement here. At least nothing so sensational as being able to run a larger-than-RAM model without swapping from disk on every iteration.


Please read the original link to the pull request, where I stated my change offered a 2x improvement in memory usage. You actually are able to load models 2x larger without compromising system stability, because pages are no longer being copied. That's because you previously needed 40gb of RAM to load a 20GB model, in order to ensure your file cache wasn't destroyed and need to reread from disk the next time. Now you only need 20GB to load a 20GB model.

The peculiarity here is that tools like htop were reporting the improvement as being an 8x improvement, which is interesting, because RAM use is only 2x better due to my change. The rusage.com page fault reporting was also interesting too. This is not due to sparseness. It's because htop was subtracting MAP_SHARED memory. The htop docs say on my computer that the color purple is used to display shared memory, and yellow is used to display kernel file caches. But it turned out it just uses yellow for both, even though it shouldn't, because mincore() reported that the shared memory had been loaded into the resident set size.


It's obviously a productive change and kudos for taking it on, but much of the enthusiasm being generated here was driven by the entirely unanticipated prospect of running a model at full speed using less memory than the model's own footprint, and by the notion that inference with a dense model somehow behaved in a sparse manner at runtime. Best to be a bit more grounded here, particularly with regard to claims that defy common understanding.


I wanted it to be sparse. Doesn't matter if it wasn't. We're already talking about how to modify the training and evaluation to make it sparser. That's the next logical breakthrough in getting inference for larger models running on tinier machines. If you think I haven't done enough to encourage skepticism, then I'd remind you that we all share the same dream of being able to run these large language models on our own. I can't control how people feel. Especially not when the numbers reported by our tools are telling us what we want to be true.


Thanks for all the nice projects you are releasing for free, btw.


Authorship of this change is contested.

https://news.ycombinator.com/item?id=35431865


Some OS’s zram compress the unpinned pages instead of swapping to disk. It might be faster than fetching the pages again from disk. I wonder if this is a reason why folks see different results.


>I'm glad you're happy with the fact that LLaMA 30B (a 20gb file) can be evaluated with only 4gb of memory usage!

Isn't LLaMA 30B a set of 4 files (60,59Gb)?

-edit- nvm, It's quantized. My bad


Before, it had to read the 16GB into a buffer, which itself could be written to disk if the system needs to page out.

That is a big reason startup time is fast.


Could mmap also be used to improve the memory usage of whisper.cpp?


How diverse is the training corpus?



Is there any measure, not of size or token amount, but of diversity in the content of the text?

Did that metric meaningfully change when the amount of required memory dropped?

If the amount of diversity is lowered, I would expect that to lower the amount of patterns to be modeled from the text. If that is the case, then the resulting model size itself would be lowered, during and after training.


By "diversity," do you mean something like "entropy?" Like maybe

    H_s(x) := -\sum_{x \in X_s} p(x) log(p(x))
where X_s := all s-grams from the training set? That seems like it would eventually become hard to impossible to actually compute. Even if you could what would it tell you?

Or, wait... are you referring to running such an analysis on the output of the model? Yeah, that might prove interesting....


I'm really just speculating here.

Because the text we write is not evenly distributed random noise, what we encode into it (by writing) is entropy.

Because LLMs model text with inference, they model all of the entropy that is present.

That would mean that the resulting size would be a measure of entropy (sum of patterns) divided by repetition (recurring patterns). In this count, I would consider each unique token alone an instance of the identity pattern.

So to answer both questions: yes.


Hey, I saw your thoughtful comment before you deleted it. I just wanted to apologize — I had no idea this was a de facto Show HN, and certainly didn’t mean to make it about something other than this project.

The only reason I posted it is because Facebook had been DMCAing a few repos, and I wanted to reassure everyone that they can hack freely without worry. That’s all.

I’m really sorry if I overshadowed your moment on HN, and I feel terrible about that. I’ll try to read the room a little better before posting from now on.

Please have a wonderful weekend, and thanks so much for your hard work on LLaMA!

EDIT: The mods have mercifully downweighted my comment, which is a relief. Thank you for speaking up about that, and sorry again.

If you'd like to discuss any of the topics you originally posted about, you had some great points.


I don't think you need to apologize for your comment. Even if I post a Show HN I expect no special treatment on my comments. That would be ridiculous and this is not even a Show HN.


great apology :)


"How much RAM did you shave off last week?"

"Oh, you know, like 12-18GB"

"Haha shut the fuck up, how much RAM did you shave off last week"

"12-18GB"

"Let me tell you what - you show me your commits right now, if you shaved off 12-18GB of RAM last week I quit my job right now and come work for you"

https://www.youtube.com/watch?v=TxHITqC5rxE


Maybe not so fast. Other users are reporting that it’s not actually running properly in environments with limited RAM. The reduced memory usage might be more of a reporting misunderstanding, not an actual reduction in memory usage.


It will run, just will have to reread the model for every new token.


with nvme gen 4 ssds this might not be that huge of an issue, and for sure much cheaper than investing in ram


I don't believe the consumer ones actually have sustained sequential read speed to saturate Gen 4.


Gen 5 pcie is ~4GB/s per lane, AMD Genoa chips have 128 such lanes. That means on the order of 500GB/s aggregate throughput, which is comparable to the aggregate theoretical throughput of the 12 channel DDR5 RAM of the Genoa CPUs.

In other words, with enough data interleaving between enough NVME SSDs, you should have SSD throughput of the same order of magnitude as the system RAM.

The weights are static, so it’s just reads.


sequential reads are the best case scenario for ssds. writes degrade, as they're first committed to SLC cache before being written to slower tlc/qlc.


The pace of collaborative OSS development on these projects is amazing, but the rate of optimisations being achieved is almost unbelievable. What has everyone been doing wrong all these years cough sorry, I mean to say weeks?

Ok I answered my own question.


>What has everyone been doing wrong all these years

So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.

The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.

Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.

[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.


Training via CPU isn’t that bad if fully optimized with AVX512 extensions.


> Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

this is why i think the patent and copyright system is a failure. The idea that having laws protecting information like this would advance the progress of science.

It doesn't, because look how an illegally leaked model gets much more advances in shorter time. The laws protecting IP merely gives a moat to incumbents.


> The laws protecting IP merely gives a moat to incumbents.

Yes. These laws are bad. We could fix this with a 2 line change:

    Section 1. Article I, Section 8, Clause 8 of this Constitution is hereby repealed.
    Section 2. Congress shall make no law abridging the right of the people to publish information.


Abolishing the copyright clause would not solve this problem because OpenAI is not leveraging copyright or patents. They're just not releasing anything.

To fix this, you'd need to ban trade secrecy entirely. As in, if you have some kind of invention or creative work you must publish sufficient information to replicate it "in a timely manner". This would be one of those absolutely insane schemes that only a villain in an Ayn Rand book would come up with.


> Abolishing the copyright clause would not solve this problem because OpenAI is not leveraging copyright or patents. They're just not releasing anything.

The problem is how in the world is ChatGPT so good compared to the average human being? The answer is that human beings (except for the 1%), have their left hands tied behind their back because of copyright law.


The idea of a patent was always a time-limited monopoly and in exchange you would reveal trade secrets that could presumably advance science. I think like many aspects of modernity, it's a bit outmoded these days, particularly in software, but that was the idea. Copyright was similar, but it did not last nearly as long as it does today in the original US incarnations.


> we don't really have a way for the FOSS community to pool together that much money

There must be open source projects with enough money to pool into such a project. I wonder whether wikimedia or apache are considering anything.


Maybe we can repurpose the SETI@home infrastructure :)


BOINC might be usable but the existing distributed training setups assume all nodes have very high speed I/O so they can trade gradients and model updates around quickly. The kind of setup that's feasible for BOINC is "here's a dataset shard, here's the last epoch, send me back gradients and I'll average them with the other ones I get to make the next epoch". This is quite a bit different from, say, the single-node case which is entirely serial and model updates happen every step rather than epoch.


Or big cloud platform could give some compute for free, give back some of the profit they get from oss.


>Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers.

How so? Why couldn't we just start a gofundme/kickstarter to fund the training of an open-source model?


who will be entrusted to benefit from this? recall that OpenAI did begin as an open source project. and that they chose to go the capitalist route, despite initially explicitly stating on their site they were a non-profit


OpenAI specifically cited scaling costs as a reason for why they switched their org structure from non-profit to "capped profit"[0].

You could potentially crowdfund this, though I should point out that this was already tried and Kickstarter shut it down. The effort in question, "Unstable Diffusion", was kinda sketchy, promising a model specifically tuned for NSFW work. What you'd want is an organization that's responsible, knows how to use state of the art model architectures, and at least is willing to try and stop generative porn.

Which just so happens to be Stability AI. Except they're funded as a for-profit on venture capital, not as a something you can donate to on Kickstarter or Patreon.

If they were to switch from investor subsidy to crowdfunding, however, I'm not entirely sure people would actually be lining up to bear the costs of training. To find out why we need to talk about motive. We can broadly subdivide the users of generative AI into a few categories:

- Companies, who view AI as a way to either juice stock prices by promising a permanent capitalist revolution that will abolish the creative working class. They do not care about ownership, they care about balancing profit and loss. Insamuch as they want AI models not controlled by OpenAI, it is a strategic play, not a moral one.

- Artists of varying degrees of competence who use generative AI to skip past creative busywork such as assembling references or to hack out something quickly. Insamuch as they have critiques of how AI is owned, it is specifically that they do not want to be abolished by capitalists using their own labor as ground meat for the linear algebra data blender. So they are unlikely to crowdfund the thing they are angry is going to put them out of a job.

- No-hopers and other creatively bankrupt individuals who have been sold a promise that AI is going to fix their lack of talent by making talent obsolete. This is, of course, a lie[2]. They absolutely would prefer a model unencumbered by filters on cloud servers or morality clauses in licensing agreements, but they do not have the capital in aggregate to fund such an endeavor.

- Free Software types that hate OpenAI's about-face on open AI. Oddly enough, they also have the same hangups artists do, because much of FOSS is based on copyleft/Share-Alike clauses in the GPL, which things like GitHub Copilot is not equipped to handle. On the other hand they probably would be OK with it if the model was trained on permissive sources and had some kind of regurgitation detector. Consider this one a wildcard.

- Evildoers. This could be people who want a cheaper version of GPT-4 that hasn't been Asimov'd by OpenAI so they can generate shittons of spam. Or people who want a Stable Diffusion model that's really good at making nonconsensual deepfake pornography so they can fuck with people's heads. This was the explicit demographic that "Unstable Diffusion" was trying to target. Problem is, cybercriminals tend to be fairly unsophisticated, because the people who actually know how to crime with impunity would rather make more money in legitimate business instead.

Out of five demographics I'm aware of, two have capital but no motive, two have motive but no capital, and one would have both - but they already have a sour taste in their mouth from the creep-tech vibes that AI gives off.

[0] In practice the only way that profit cap is being hit is if they upend the economy so much that it completely decimates all human labor, in which case they can just overthrow the government and start sending out Terminators to kill the working class[1].

[1] God damn it why do all the best novel ideas have to come by when I'm halfway through another fucking rewrite of my current one

[2] Getting generative AI to spit out good writing or art requires careful knowledge of the model's strengths and limitations. Like any good tool.


Maybe a lot of people/companies also don't want to give their data and knowledge to OpenAI, so that they can sell it off to the competition.


Yes, that's the "strategic" play I mentioned before.

This isn't really helpful for people who want open AI though, because if your strategy is to deny OpenAI data and knowledge then you aren't going to release any models either.


AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.

You can't if you have one 12gb gpu. You can if you have couple of dozens. And then petals-style training is possible. It is all very very new and there are many unsolved hurdles, but I think it can be done.


One thing I don't understand: If it's possible to chunk and parallelize it, is it not relatively straightforward to do these chunks sequentially on a single GPU with a roughly linear increase in runtime? Or are the parallelized computations actually interdependent and involving message-passing, making this unfeasible?


Data moving back and forth from CPU to to bus to GPU and back again for however many chunked model parts you have would increase training time far beyond what you would be willing to invest, not to mention how inefficient and power intensive it is, far more power needed than doing just CPU only or GPU only training. Back to the time part - it's not linear at all. IMO its easily quadratic.

It's not unfeasible, in fact that's how things were done before lots of improvements to the various libraries in essence, many corps still have poorly built pipelines that spend a lot of time in CPU land and not enough in GPU land.

Just an FYI as well - intermediate outputs of models are used in quite a bit of ML, you may see them in some form being used for hyperparameter optimization and searching.


Maybe a good candidate for the SETI@home treatment?


It is a good candidate. Tech is good 6-18 months away, though.


How much faster can we develop the tech if we leverage GPT-4 to do it?


Sure, but when one 12gb GPU costs ~$800 new (e.g. for the 3080 LHR), "a couple of dozens" of them is a big barrier to entry to the hobbyist, student, or freelancer. And cloud computing offers an alternative route, but, as stated, distribution introduces a new engineering task, and the month-to-month bills for the compute nodes you are using can still add up surprisingly quickly.


We are talking groups, not individuals. I think it is quite possible for couple of hundreds of people to cooperate and train something at least as big as LLaMa 7B in a week or two.


Can the foss community really find nobody with the motivation to use their Bitcoin rig as a research machine? Or do you need even more specialized hardware than that?


The bitcoin rig is specialised - it can only compute SHA hashes. You need more general compute.


$n00k of compute time is nothing, sorry. This is the kind of thing that academic institutions can give out for free…


why don't you write the check then huh


Roughly: OpenAIs don’t employ enough jarts.

In other words, the groups of folks working on training models don’t necessarily have access to the sort of optimization engineers that are working in other areas.

When all of this leaked into the open, it caused a lot of people knowledgeable in different areas to put their own expertise to the task. Some of those efforts (mmap) pay off spectacularly. Expect industry to copy the best of these improvements.


They have very good people but those people have other priorities.


The jart: the standard unit of developer.

"I've got 20 yrs experience, and I think I'm about 150 milli jarts, maybe 200 on a good day."


I'd say it's probably not a priority for them right now.

Of course it would save them some money if they could run their models on cheaper hardware, but they've raised $11B so I don't think that's much of a concern right now. Better to spend the efforts on pushing the model forward, which some of these optimisations may make harder.


It's a pretty big concern if you had to spend a billion on training, but 6 months later the open source community is able to replicate your training for <100K because you were too cheap to hire an extra 100 optimization experts

That'd be a 10,000 fold depreciation of an asset due to a preventable oversight. Ouchies.



In March 2014, Tunney petitioned the US government on We the People to hold a referendum asking for support to retire all government employees with full pensions, transfer administrative authority to the technology industry, and appoint the executive chairman of Google Eric Schmidt as CEO of America

https://en.m.wikipedia.org/wiki/Justine_Tunney


what's your point? also interestingly JART is in the thread here, so they might have read your comment :)


I'm pretty sure she knows already?


The professional optimizes well enough to get management off their back, the hobbyist can be irrationally good.


The professional operates within prioritization tranches. Make it work, make it reliable, make it fast, make it pretty. If you're still iterating on proof-of-concept/prototyping you'll generally confine yourself to the first and/or second levels. Once you've settled on a finalized prototype you then follow the rest of the prioritization levels to achieve shippable product.


> but the rate of optimisations being achieved is almost unbelievable. What has everyone been doing wrong all these years cough sorry, I mean to say weeks?

It’s several things:

* Cutting-edge code, not overly concerned with optimization

* Code written by scientists, who aren’t known for being the world’s greatest programmers

* The obsession the research world has with using Python

Not surprising that there’s a lot of low-hanging fruit that can be optimized.


I’m not sure this is fair - a lot of the performance optimisations have come from applied mathematicians rather than programmers, and python is not generally considered to be the bottleneck (it is the interface rather than what is running the computation - it calls a C API which then often uses CUDA and may also run on hardware specifically designed for ML).


Why does Python get so much flak for inefficiencies? It's really not that slow, and in ML the speed-sensitive parts are libraries in lower level languages anyway. Half of the optimization from this very post is in Python.


It might not be slow in general, but it's easy to write slow code in it.


In ML python is effectively the interface rather than the bit that is doing the heavy lifting.

The interface is designed to be easy to use (python) and the bit that is actually doing the work is designed to be heavily performant (which is C & CUDA and may even be running on a TPU).


It really is that slow.

You're completely correct that the speed-sensitive parts are written in lower-level libraries, but another way to phrase that is "Python can go really fast, as long as you don't use Python." But this also means ML is effectively hamstrung into only using methods that already exist and have been coded in C++, since anything in Python would be too slow to compete.

There's lots of languages that make good tradeoffs between performance and usability. Python is not one of those languages. It is, at best, only slightly harder to use than Julia, yet orders-of-magnitude slower.


Python has the misfortune of competing against JS in this arena, which just so happens to have the most obsessively optimized JIT ever.


In ML? No, the best competition for Python in ML is... well, it's either C++ or Julia, depending on how you define "competition," given Python is effectively a glorified C++ interface.


I have predicted that LLaMA will be available on mobile phones before the end of this year. We are very close.


You mean in contained app? It can already run on a phone. GPU acceleration would be nice at this point, though.


I was thinking about the limitation by the mobile HW. Yeah GPU support would be nice.


People have actually ran it on phones.


This is what my comment is about. It happened much sooner than I thought.


And THIS is why I always advocate for democratization of AI models, getting these into hands of open source community and letting people use it as they wish etc.

But a lot of people would rather only have govt or corp control of it...


I might be missing something but I actually couldn't reproduce. I purposefully chose a computer with 16GiB RAM to run the 30B model. Performance was extremely slow, and the process was clearly not CPU-limited, unlike when it's running the 13B model. It's clearly swapping a lot.


The apparent answer is here: https://news.ycombinator.com/item?id=35398012

> mmap-ed memory pages backed by a file that aren't dirty aren't counted in an process's RSS usage, only kernel page cache. The mmap-ed regions of virtual memory does get counted in VSZ (virtual memory) but that is just virtual and can be larger than RAM+swap.


> I might be missing something but I actually couldn't reproduce.

Someone in the GitHub comments had the same experience when using a 10GB VM to limit memory usage.

It appears the claims of memory reduction were premature. Perhaps an artifact of how memory usage is being reported by some tools.


Same, performance of the quantised 30B model on my m1 16GB air is absolutely terrible. A couple of things I noticed on activity monitor: 1. "memory used" + "cached files" == 16GB (while swap is zero) 2. Disk reading is 500-600MB/s 3. it seems that every token is computed exactly _after every ~20GB read from disk_ which actually points that for calculating each token it actually re-reads the weights file again (instead of caching it). I actually suspect that swapping may have been more efficient.

The last part (3) that it rereads the whole file again is an assumption and it could just be a coincidence that the new token is computed at every ~20GB read from disk, but it makes sense, as I do not think swapping would have been that inefficient.


Can you share the intermediate files? They're taking ages to process on my 16GB-RAM laptop


Which files are you referring to exactly?


ggml-model-f16.bin and ggml-model-q4_0.bin

those are the output of convert-pth-to-ggml.py and quantize respectively

I had to cancel 30B as I needed to use the computer after some 12 hours, now I have to fix the ext4 filesystem of the drive where I was doing it, fun times for the weekend

guess I'll settle for 13B, I was using 7B but the results are pretty lousy compared to GPT4all's Lora, let alone GPT3.5-turbo or better

I'll give a shot to quantising 13B, I'm on 16GB of RAM locally


Yeah, the first time I ran the 30B model, it crashed my machine and I had to reinstall from scratch (linux).


Same here M1 MacBook Pro. Zero speed up on loading and inference.


Are the weights on NVME? Old SSD? HDD?


It’s interesting how NVMe will be even more critically important if this lazy weights loading approach works out. PCIe 5 has arrived just in time for LLM interference it seems.


Well in this case it does not have to do with SSDs, quite the opposite here a performance gain seems to happen by caching the file in RAM in the beginning.


That’s not my understanding. The entire point is the model can’t fit in RAM. mmap allows lazy loading from storage.


Yes but to compute a token it has to eventually read the data, either cached in RAM or from storage. There is no way that a fast SSD can compete with RAM in terms of I/O speed. To achieve any speed benefit the whole file has to be cached in RAM. This has different benefits eg threads can share memory, and the file does not have to be reread next time it is called because it is already cached in RAM, but, in final analysis, you need to have the RAM, or then you are reading from the disk, and reading 20gb for each token means you need to read 1T for a paragraph of 50 tokens. My m1, which by no means has a slow SSD, reads the file at 500-600mb/s, while a thunderbolt pci-4 enclosure reads at 700-800mb/s, even if you double that it will still take 10-20 seconds per token. To get less than 1 second per token for the 30B model one has to read there at 20gb/s. At the time we have done that, there will be even huger (v)RAMs and even larger models.


PCIe 5 NVMe drives can do 11+ GBps so at least 2x your numbers. We seem to be talking past each other because the point of the change is to run inference on a cpu for an LLM with a larger weight size than the host RAM can fit.

It looks to me that if I was planning on building a new machine capable of LLM inference it’s going to be possible using commodity gamer components and if lazy weights is viable, then such a machine with multiple PCIe 5 nvme drives in a raid 0 can potentially almost reach memory bandwidth.

On my list of to investigate next is in regards to inference with GPUs, could somehow multiple smaller GPUs be used with a technique similar to the OP post.


In my case they are on a SATA SSD.


I love how LLMs have got the attention of proper programmers such that the Python mess is getting cleaned up.


Now it's a C mess!


How so?


There are two distinct kinds of jobs: ML researchers and software engineers. A lot of ML researchers write pretty bad code by software engineering standards but that's okay; it's not their job to produce clean code. They string together libraries in Python and do a lot of experimentation and analysis. When they produced something ready to be productionized, software engineers then come in and optimize things.

This is totally the right way. Make it work, then make it right, then make it fast.


> ML researchers and software engineers. A lot of ML researchers write pretty bad code by software engineering standards but that's okay

Yes. When you have to try out dozens of research ideas, most of which won't pan out, then you stop writing engineering-style code and switch to hacker mode. Why make it nice when you could be trying 2 more ideas in the meantime. Most of research code it is going to the trash anyway.


Not sure if you came up with "Make it work, then make it right, then make it fast." but I just screenshotted it and made it the mantra for my current project...which is by far more complicated than anything I have done as a side project. I am struggling with the desire to go "make it right" as I work on shipping the deployable prototype (right now running on cloud services I control)...thanks for this


https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast

> This formulation of this statement has been attributed to [KentBeck][0]; it has existed as part of the [UnixWay][1] for a long time.

[0]: https://wiki.c2.com/?KentBeck

[1]: https://wiki.c2.com/?UnixWay


Funny enough, this tracks the early history of Google as well. It was originally written by Larry Page and Sergey Brin (both grad students at the time) in Python, then Sanjay Ghemawat rewrote the whole thing in C++.


That sounds very interesting. Anyone know where I can find more on this story?


I learned this from Steven Levy's great book, "In the Plex": https://www.amazon.com/Plex-Google-Thinks-Works-Shapes/dp/14...


"Make it work, then make it right, then make it fast" is brilliant. This should be a universal principle for almost everything


I meant "how is the Python mess getting cleaned up," rather.


C has an almost infinite horizon for optimization. Python is good prototypes but we are beyond that stage now


99% of LLM evaluation with PyTorch was already done in C++.

These .cpp projects don't improve anything for performance. They just drop dependencies necessary for training and experimentation.


Optimization isn't just about speed. As you said, dropping dependencies makes it portable, embeddable, more versatile


It's also nice to not lose your mind over how crazy Python and Docker are, when all you want to do is run inference in a shell script as though it were the `cat` command. That sacred cow is going to have to come out of the temple sooner or later, and when that happens, people are going to think, wow, it's just a cow.


Have you tried Julia for this instead?


Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.


I've done some experiments here with Llama 13B, in my subjective experience the original fp16 model is significantly better (particularly on coding tasks). There are a bunch of synthetic benchmarks such a wikitext2 PPL and all the whiz bang quantization schemes seem to score well but subjectively something is missing.

I've been able to compare 4 bit GPTQ, naive int8, LLM.int8, fp16, and fp32. LLM.int8 does impressively well but inference is 4-5x slower than native fp16.

Oddly I recently ran a fork of the model on the ONNX runtime, I'm convinced that the model performed better than pytorch/transformers, perhaps subtle differences in floating point behavior etc between kernels on different hardware significantly influence performance.

The most promising next step in the quantization space IMO has to be fp8, there's a lot of hardware vendors adding support, and there's a lot of reasons to believe fp8 will outperform most current quantization schemes [1][2]. Particularly when combined with quantization aware training / fine tuning (I think OpenAI did something similar for GPT3.5 "turbo").

If anybody is interested I'm currently working on an open source fp8 emulation library for pytorch, hoping to build something equivalent to bitsandbytes. If you are interested in collaborating my email is in my profile.

1. https://arxiv.org/abs/2208.09225 2. https://arxiv.org/abs/2209.05433


> Llama 13B

> Llama.cpp 30B

> LLaMA-65B

the "number B" stands for "number of billions" of parameters... trained on?

like you take 65 billion words (from paragraphs / sentences from like, Wikipedia pages or whatever) and "train" the LLM. is that the metric?

why aren't "more parameters" (higher B) always better? aka return better results

how many "B" parameters is ChatGPT on GPT3.5 vs GPT4?

GPT3: 175b

GPT3.5: ?

GPT4: ?

https://blog.accubits.com/gpt-3-vs-gpt-3-5-whats-new-in-open...

how is Llama with 13B parameters able to compete with GPT3 with 175B parameters? It's 10x+ less? How much RAM goes it take to run "a single node" of GPT3 / GPT3.5 / GPT4?


> the "number B" stands for "number of billions" of parameters... trained on?

No, it's just the size of the network (i.e. number of learnable parameters). The 13/30/65B models were each trained on ~1.4 trillion tokens of training data (each token is around half a word).


Isn't that related to architecture? The most recent GPUs and tensor procs have native support for 4-bit(partially) and 8-bit int whereas older GPUs take noticeable performance hits for 8-bit vs fp16/32.


Ah but LLM.int8 (eg. as in huggingface transformers) isn't actually int8, it's a mixed precision encoding scheme that is nominally eight bits per parameter. This means custom cuda kernels etc, these kernels could be improved but without hardware support its always going to be slow.

Straight int8 quantization generally does not work for post training quantization of transformers. The distribution of weights includes a significant amount of outlier values that seem to be important to model performance. Apparently quantization aware training can improve things significantly but I haven't seen any developments for llama yet.

Interestingly on the 4 bit front, NVIDIA has chosen to remove int4 support from the next gen Hopper series. I'm not sure folks realize the industry has already moved on. FP8 feels like a bit of a hack, but I like it!


For this specific implementation here's info from llama.cpp repo:

Perplexity - model options

5.5985 - 13B, q4_0

5.9565 - 7B, f16

6.3001 - 7B, q4_1

6.5949 - 7B, q4_0

6.5995 - 7B, q4_0, --memory_f16

According to this repo[1] difference is about 3% in their implementation with right group size. If you'd like to know more, I think you should read GPTQ paper[2].

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

[2] https://arxiv.org/abs/2210.17323


Define "comprehensive?"

There are some benchmarks here: https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_cu... and here: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...

Check out the original paper on quantization, which has some benchmarks: https://arxiv.org/pdf/2210.17323.pdf and this paper, which also has benchmarks and explains how they determined that 4-bit quantization is optimal compared to 3-bit: https://arxiv.org/pdf/2212.09720.pdf

I also think the discussion of that second paper here is interesting, though it doesn't have its own benchmarks: https://github.com/oobabooga/text-generation-webui/issues/17...


Some results here: https://github.com/ggerganov/llama.cpp/discussions/406

tl;dr quantizing the 13B model gives up about 30% of the improvement you get from moving from 7B to 13B - so quantized 13B is still much better than unquantized 7B. Similar results for the larger models.


I wonder where such difference between llama.cpp and [1] repo comes from. F16 difference in perplexity is .3 on 7B model, which is not insignificant. ggml quirks are definitely need to be fixed.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa


I'd guess the GPTQ-for-LLaMa repo is using a larger context size. Poking around it looks like GPTQ-for-llama is specifying 2048 [1] vs the default 512 for llama.cpp [2]. You can just specify a longer size on the CLI for llama.cpp if you are OK with the extra memory.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/934034c8e...

[2] https://github.com/ggerganov/llama.cpp/tree/3525899277d2e2bd...


GPTQ-for-LLaMa recently implemented some quantization tricks suggested by the GPTQ authors that improved 7B especially. Maybe llama.cpp hasn't been evaluated with those in place?


Does anyone know how/why this change decreases memory consumption (and isn't a bug in the inference code)?

From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.

Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?


It's not a bug, but it's misreading the htop output as mmap doesn't show up as a resident set size there. The pages are RO and not dirty so it's "on the OS" to count it and the OP had lots of RAM on the computer so the model just resides in his page cache instead.


Ahh, this would do it, thanks :).


yeah, I believe some readers are misinterpreting the report. The OS manages mmap, it won't show up as "regular" memory utilization because it's lazy-loaded and automatically managed. If the OS can keep the whole file in memory, it will, and it will also magically swap to disk prioritizing explicit memory allocation (malloc).

Sounds like the big win is load time from the optimizations. Also, maybe llama.cpp now supports low-memory systems through mmap swapping? ... at the end of the day, 30B quantized is still 19GB...


Maybe lots of the data is embedding values or tokenizer stuff, where a single prompt uses a fraction of those values. And then the rest of the model is quite small.


That shouldn't be the case. 30B is a number that directly represents the size of the model, not the size of the other components.


If you read a file with malloc and memcpy, it copies the data from the kernel to userspace. With mmap there is no copying.


I messed around with 7B and 13B and they gave interesting results, although not quite consistent enough results for me to figure out what to do with them. I'm curious to try out the 30B model.

Start time was also a huge issue with building anything usable, so I'm glad to see that being worked on. There's potential here, but I'm still waiting on more direct API/calling access. Context size is also a little bit of a problem. I think categorization is a potentially great use, but without additional alignment training and with the context size fairly low, I had trouble figuring out where I could make use of tagging/summarizing.

So in general, as it stands I had a lot of trouble figuring out what I could personally build with this that would be genuinely useful to run locally and where it wouldn't be preferable to build a separate tool that didn't use AI at all. But I'm very excited to see it continue to get optimized; I think locally running models are very important right now.


Well, only if you don't count the filesystem cache as "RAM". You still need enough memory so that the kernel can hold all these pages, even if Llama.cpp is not using this memory itself.


This seems suspiciously like a bug (either in inference or in mmap reporting), as these models are not sparse enough for the savings to come from anywhere viable.


Wow I continue being amazed by the progress being made on language models in the scope of weeks. I didn't expect optimisations to move this quickly. Only a few weeks ago we were amazed with ChatGPT knowing it would never be something to run at home, requiring $100.000 in hardware (8xA100 card).


Before ChatGPT was in beta, there were already models that fit into 2gb and smaller. They were complete shit, but they did exist.


I know but what's changing is that they aren't shit now. Not on par with GPT but getting much closer. Especially with a little massaging like Stanford has done.


"The recent change also means you can run multiple LLaMA ./main processes at the same time, and they'll all share the same memory resources." So this could have a main and multiple sub-worker llm processes possibly collaborating while sharing same memory footprint?


Yes, if the model is mmap'ed read-only (as I'm sure it is).

There are other bottlenecks than CPU cores though, it might not be very useful to run multiple in parallel..


Less memory than most Electron apps!


With all my dislike to Electron, I struggle to remember even one Electron app that managed to use 6 gigs.


I assume it was a joke


I've seen WhatsApp doing it. It start with 1.5G anyway, so after some images and stuff it inflates quite a lot.


Maybe they should do more mmap as well.


LMAO


My guess is that there was an error during quantization that resulted in a large amount of the weights not properly being used. A potential test would be to compare the number of page faults between quantized an unquantized model and confirm they are roughly the same proportionally. This could also explain how e.g. gpt4all seem to notice better performance on unquantized weights when there really shouldn't be.


Does this mean that we can also run the 60B model on a 16GB ram computer now?

I have the M2 air and can't wait until further optimisation with the Neural Engine / multicore gpu + shared ram etc.

I find it absolutely mind boggling that GPT-3.5(4?) level quality may be within reach locally on my $1500 laptop / $800 m2 mini.


I doubt it: text size and text pattern size don't scale linearly.


Interesting, i wonder how it scales.


Does that also mean 6GB VRAM?

And does that include Alpaca models like this? https://huggingface.co/elinas/alpaca-30b-lora-int4


According to https://mobile.twitter.com/JustineTunney/status/164190201019... you can probably use the conversion tools from the repo on Alpaca and get the same result.

If you want to run larger Alpaca models on a low VRAM GPU, try FlexGen. I think https://github.com/oobabooga/text-generation-webui/ is one of the easier ways to get that going.


Late edit: Deep Speed, not FlexGen. I don't know if FG could work, but that repo only supports it for Opt models.


Yeah, or deepspeed presumably. Maybe torch.compile too.

I dunno why I thought llama.cpp would support gpus. shrug


Lots of C++ programs use the GPU. It's irrelevant.


I think those specific Alpaca models are all in safetensor now and there isn't simple converter to ggml.


No(llama.cpp is cpu-only) and no(you need to requantize the model).


Total noob questions.

1. How does this compare with ChatGPT3

2. Does it mean we could eventually run a system such as ChatGPT3 on a computer

3. Could LLM eventually replace Google (in the sense that answers could be correct 99.9% of the time) or is the tech inherently flawed


"Could LLM eventually replace Google"

If you try to use LLMs as a Google replacement you're going to run into problems pretty quick.

LLMs are better thought of as "calculators for words" - retrieval of facts is a by-product of how they are trained, but it's not their core competence at all.

LLaMA at 4bit on my laptop is around 3.9GB. There's no way you could compress all of human knowledge into less than 4GB of space. Even ChatGPT / GPT-4, though much bigger, couldn't possible contain all of the information that you might want them to contain.

https://www.newyorker.com/tech/annals-of-technology/chatgpt-... "ChatGPT Is a Blurry JPEG of the Web" is a neat way of thinking about that.

But... it turns out you don't actually need a single LLM that contains all knowledge. What's much more interesting is a smaller LLM that has the ability to run tools - such as executing searches against larger indexes of data. That's what Bing and Google Bard do already, and it's a pattern we can implement ourselves pretty easily: https://til.simonwillison.net/llms/python-react-pattern

The thing that excites me is the idea of having a 4GB (or 8GB or 16GB even) model on my own computer that has enough capabilities that it can operate as a personal agent, running searches, executing calculations and generally doing really useful stuff despite not containing a great deal of detailed knowledge about the world at all.


I just need an LLM that can search, retrieve, and condense information on reddit, stackoverflow, and wikipedia to a given query.


depending on the max tokens, I think you can pretty easily fine-tune a model to return answers with actions required then wrap your prompt app to react to those, paste answers and "reask/reprompt" the same question...

similar stuff is being research under "langchains" term


The largest LLaMA model at ~60 billion parameters is not quite as large as ChatGPT 3 in size, and probably not quite as well trained, but it's basically in the same class. Even the complete, not quantitized model, can be run with llama.cpp on ARM64 and x86_64 CPUs already, assuming you have enough RAM (128 GB?).


Minor correction, chatGPT uses GPT-3.5 and (most recently, if you pay $20/month) GPT-4. Their branding definitely needs some work haha. We are in track for you to be able to run something like chatGPT locally!


worse, yes, yes, no


i think i found a decent 30b alpaca model if anyone else is just now diving in:

https://huggingface.co/Pi3141/alpaca-lora-30B-ggml/tree/main


> 6GB of RAM

> Someone mentioning "32-bit systems"

Um no, you're not mapping 6GB on RAM on a 32-bit system. The address space simply doesn't exist.


Windows Server could use up to 64 GB for a 32-bit operating system. Individual processes couldn't map more than 4 GB, but the total could be larger: https://en.wikipedia.org/wiki/Physical_Address_Extension


It actually happened. With paging and bigger word sizes, which were common in 32 bit systems even with hardware acceleration.

A 32bit address space only means you have 4GibiAddresses, which do not need to be pointing to single bytes. In fact the natural thing to do in a 32 bit system for a structure like this is moving 32bit words, which actually means you're addressing a 16GB space, flat. And then there's segmentation.

For instance the 286 had a 24bit address showing for 16MB in direct addressing mode and 1GB via segmentation (what back then was usually referred to by virtual memory)

The 386 had a 32 bit address width and its MMU allowed access to 64TB in virtual mode and 4GB in protected mode.This was indeed one of the reasons Linux was not made 286-compatible. Its protected mode was only 1GB and segmented rather than 4GB flat, so Linus didn't have to deal with XMS or EMS for a chip that was becoming obsolete soon anyway. But the 1GB space was there, and at the time that was plenty.


I wonder if Georgi or jart use GPT in their programming and design. I guess the training data was lacking for the sort of stuff they do due to their field of work especially jart.


Not yet. GPT-4 helped answer some questions I had about the WIN32 API but that's the most use I've gotten out of it so far. I'd love for it to be able to help me more, and GPT-4 is absolutely 10x better than GPT 3.5. But it's just not strong enough at the kinds of coding I do that it can give me something that I won't want to change completely. They should just train a ChatJustine on my code.


1. turn the swap off or monitor it closely 2. try to load a big model, like 65b-q4 or 30b-f16 3. observe the OOM - It's not so hard to test this.


Using a memory mapped file doesn't use swap. The memory is backed by the file that is memory mapped!


Is there a reason Llama is getting so much attention compared to say T5 11B?

Not sure how neutral or what benchmarks are used on the following link, but T5 seems to sit a lot higher on this leaderboard?

https://accubits.com/large-language-models-leaderboard/


Is llama open source? I heard it was pirated from Facebook


I did not claim llama was open source, but I see the url I posted insinuates that (probably for a contorted meaning of open source, as in source available for approved academics).

Anyway, T5 being available for download from Huggingface only makes my question more pertinent...


I made an app for running t5 locally - compiled version allows you to run without installing anything.

https://capsizegames.itch.io/chat-ai

https://github.com/Capsize-Games/chatai


interesting, what are the hardware requirements?

does it happen to run on CPU on a server with 96GB RAM?


the compiled app is meant for people to install and use with their GPU and runs on as low as a GTX 1080. I haven't tested against CPU only builds.

You can take a look at the source code and see if it would be useful to you.


llama can run on an m1. T5 still needs a specialized gpu


What is the reason T5 needs a specialized GPU and Llama doesn't?

In the end they are mathematical models, so what would prevent someone from loading T5 into a machine with plenty of RAM (like a server)? Would the codebase truly require that much refactoring? How difficult would it be to rewrite the model arhitecture as a set of mathematical equations (Einstein summation) and reimplement inference for CPU?


I'm far from an expert in this area. But llama has been updated so anyone can hack with it on their m1 macbook (which many developers have). If someone updated T5 to be as easy to dev against, then I am sure they would see similar community interest.

Most people don't have the hardware or budget to access these specialized high vram GPUs.


Still can't run it, thanks AMD ROCm.


CPU inference runs perfectly fine, though.


Fine but very very slow


I don't understand. I thought each parameter was 16 bit (two bytes) which would predict minimally 60GB of RAM for a 30 billion parameter model. Not 6GB.


I was thinking something similar. Turns out that you don't need all the weights for any given prompt.

> LLaMA 30B appears to be a sparse model. While there's 20GB of weights, depending on your prompt I suppose only a small portion of that needs to be used at evaluation time [...]

Found the answer from the author of this amazing pull request: https://github.com/ggerganov/llama.cpp/discussions/638#discu...


Does this mean LLaMA only uses 10% of it's brain? An urban legend come to life!


No, your OP is mistaken. The model weights have to all be accessed for the forward pass. What has happened is that using mmap changes where the memory is consumed (kernel vs process) and so it was being incorrectly interpreted. There are still 30B parameters, and you'll need that times however big your floating point representation is to use the model still.


But do they all need to be accessed at the same time? If not, pages that are not being actively used can be dropped from memory until needed again.


Parameters have been quantized down to 4 bits per parameter, and not all parameters are needed at the same time.


Great to see this advancing! I’m curious if anyone knows what the best repo is for running this stuff on an Nvidia GPU with 16GB vram. I ran the official repo with the leaked weights and the best I could run was the 7B parameter model. I’m curious if people have found ways to fit the larger models on such a system.



looks great thank you!


I'd assume that 33B model should fit with this(only repo that I know of that implements SparseGPT and GPTQ for LLaMa), I, personally, haven't tried though. But you can try your luck https://github.com/lachlansneff/sparsellama


Has this been backported to alpaca CPP?


Would it in theory be possible to back up only 1 file (of several TB in size), your disk as a block device file, here? Or would this require reuploading the entire disk every time any single file inside changes rather than only the blocks that changed?


It's possible to run llama.cpp on windows, e.g. see this tutorial:

https://www.youtube.com/watch?v=coIj2CU5LMU

Would this version (ggerganov) work with one of those methods?


Yes, the 30B model is working for me on Windows 10 / AMD 5600G CPU / 32GB RAM, with llama.cpp release master-3525899 (already one release out of date!), in PowerShell, using the Python 3.10 version that automatically installs when you type "python3".

I did the following:

1. Create a new working directory.

2. git clone https://github.com/ggerganov/llama.cpp

3. Download the latest release from https://github.com/ggerganov/llama.cpp/releases (note the CPU requirements in the filename) and unzip directly into the working directory's llama.cpp/ - you'll have the .exe files and .py scripts in the same directory.

4. Open PowerShell, cd to the working directory/llama.cpp, and create a new Python virtual environment: python3 -m venv env and activate the environment: .\env\Scripts\Activate.ps1

5. Obtain the LLaMA model(s) via the magnet torrent link and place them in the models directory. I used 30B and it is slow, but usable, on my system. Not even ChatGPT 3 level especially for programming questions, but impressive.

6. python3 -m pip install torch numpy sentencepiece

7. python3 convert-pth-to-ggml.py models/30B/ 1 (you may delete the original .pth model files after this step to save disk space)

8. .\quantize.exe ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q4_0.bin 2

9. I copied the examples/chat-13B.bat to a new chat-30B.bat file, updated the model directory, and changed the last line of the script to: .\main.exe

10. Run using: .\examples\chat-30B.bat

https://github.com/ggerganov/llama.cpp#usage has details, although it assumes 7B and skips a few of the above steps.


Is the 30B model clearly better than the 7B?

I played with Pi3141/alpaca-lora-7B-ggml two days ago and it was super disappointing. In percentage between 0% = alpaca-lora-7B-ggml and 100% GPT-3.5, where would LLaMA 30B be positioned?


I haven't been able to run it myself yet, but according to what I read so far from people who did, the 30B model is where the "magic" starts to happen.


Check out the graph on page 3 of this PDF: https://arxiv.org/abs/2302.13971 The 33B model started beating the 7B when it had been trained on only 1/3 as much data. And then they kept training it to 40% more than the total that 7B was trained on. It's better.


I'm curious if someone will have to port these enhancements elsewhere, ie: https://github.com/rustformers/llama-rs


How cool is that! There's also exciting news about early access for GPT5 13T model [1] features from Noosphere AI

[1] https://noosphere.chat


Does that only happen with the quantized model or also with the float16 / float32 model? Is there any reason to use float models at all?


Half-OT:

How do all these models compare to each other?

Are there any metrics that can tell me how much better or worse LLaMA is compared to GPT3?

What does it even mean to be better?


Where do you download the tokenizer.model that is needed to convert the GPT4ALL model to the appropriate format?


lesmo provides a magnet uri to all models here. it's inside (499kb): https://github.com/facebookresearch/llama/pull/73/files/016a...


Somebody please teach ggreganov python. I am sure he will create same magic with python.


how is llama performance relative to chatgpt ? is it as good as chatgpt3 or even 4 ?


It is as good as GPT-3 at most sizes. Instruct layer needs to be put on top in order for it to compete with GPT 3.5(which powers ChatGPT). It can be done with comparatively little amount of compute(couple hundred bucks worth of compute for small models, I'd assume low thousands for 65B).


That's surprising to read, given that ChatGPT (at least the first version) was much worse than text-davinci-003 at following instructions. The new version seems to be much better, though.


No, it wasn't


What's the difference between llama.cpp and alpaca.cpp?


I assume the former is just the foundation model (which only predicts text) while the latter is instruction tuned.


This is super amazing stuff and I am still in awe.


Looks like we can run OpenAI gpt3.5 with 32g ram?


Where are you going to download that from?


Hey if Meta can see a leak, then a former non-profit like OpenAI definitely can.


Difference is that Meta was "publishing" the model to researchers, and one of them leaked it. I suspect OpenAI is a little less likely to share the GPT models around unfortunately...


The real question is when can I run Jarvis


the ggml-model-f16.bin file for 30B is taking me a bunch of hours to process

are these made available somewhere?


the README.md file of the repository is not updated to reflect these changes yet, right?


mmap saves the day again.


mmap saves the day again


On the legal front, I’ve been working with counsel to draft a counterclaim to Meta’s DMCA against llama-dl. (GPT-4 is surprisingly capable, but I’m talking to a few attorneys: https://twitter.com/theshawwn/status/1641841064800600070?s=6...)

An anonymous HN user named L pledged $200k for llama-dl’s legal defense: https://twitter.com/theshawwn/status/1641804013791215619?s=6...

This may not seem like much vs Meta, but it’s enough to get the issue into the court system where it can be settled. The tweet chain has the details.

The takeaway for you is that you’ll soon be able to use LLaMA without worrying that Facebook will knock you offline for it. (I wouldn’t push your luck by trying to use it for commercial purposes though.)

Past discussion: https://news.ycombinator.com/item?id=35288415

I’d also like to take this opportunity to thank all of the researchers at MetaAI for their tremendous work. It’s because of them that we have access to such a wonderful model in the first place. They have no say over the legal side of things. One day we’ll all come together again, and this will just be a small speedbump in the rear view mirror.

EDIT: Please do me a favor and skip ahead to this comment: https://news.ycombinator.com/item?id=35393615

It's from jart, the author of the PR the submission points to. I really had no idea that this was a de facto Show HN, and it's terribly rude to post my comment in that context. I only meant to reassure everyone that they can freely hack on llama, not make a huge splash and detract from their moment on HN. (I feel awful about that; it's wonderful to be featured on HN, and no one should have to share their spotlight when it's a Show HN. Apologies.)


IANYL - This is not legal advice.

As you may be aware, a counter-notice that meets the statutory requirements will result in reinstatement unless Meta sues over it. So the question isn't so much whether your counter-notice covers all the potential defenses as whether Meta is willing to sue.

The primary hurdle you're going to face is your argument that weights are not creative works, and not copyrightable. That argument is unlikely to succeed for the the following reasons (just off the top of my head): (i) The act of selecting training data is more akin to an encyclopedia than the white pages example you used on Twitter, and encyclopedias are copyrightable as to the arrangement and specific descriptions of facts, even though the underlying facts are not; and (ii) LLaMA, GPT-N, Bard, etc, all have different weights, different numbers of parameters, different amounts of training data, and different tuning, which puts paid to the idea that there is only one way to express the underlying ideas, or that all of it is necessarily controlled by the specific math involved.

In addition, Meta has the financial wherewithal to crush you even were you legally on sound footing.

The upshot of all of this is that you may win for now if Meta doesn't want to file a rush lawsuit, but in the long run, you likely lose.


I think the counter on those arguments is that LLM owners want to avoid arguing that the model is a derivative work of the training data.

If the LLM is a specific arrangement of the copyrighted works, it's very clearly a derivative work of them


I was not suggesting that an LLM itself consists of an arrangement of the copyrighted works comprising the training data, but that the specific selection of the copyrighted works comprising the training data is part of what differentiates one LLM from another. A strained but useful analogy might be to think of the styles of painting an artist is trained in and/or exposed to prior to creating their own art. Obvious or subtle, the art style an artist has studied would likely impact the style they develop for themself.

However, to address your point about derivative works directly, the consensus among copyright law experts appears to be that whether a particular model output is infringing depends on the standard copyright infringement analysis (and that’s regardless of the minor and correctable issue represented by memorization/overfitting of duplicate data in training sets). Only in the most unserious legal complaint (the class action filed against Midjourney, Stability AI, etc.) is the argument being made and that the models actually contain copies of the training data.


I just want to say that I appreciate the legal analysis. Thanks for your time.

If you ever come up with more hypothetical arguments in favor of NNs being copyrightable, please let me know. Or post them somewhere.


All models trained on public data need to be made public. As it is their outputs are not copyrightable, it’s not a stretch to say models are public domain.


You seem to be mixing a few different things together here. There's a huge leap from something not being copyrightable to saying there is grounds for it to be made public. No copyright would greatly limit the ability of model makers to legally restrict distribution if they made it to the public, but they'd be fully within their rights to keep them as trade secrets to the best of their ability. Trade secret law and practice is its own thing separate from copyright, lots of places have private data that isn't copyrightable (pure facts) but that's not the same as it being made public. Indeed part of the historic idea of certain areas of IP like patents was to encourage more stuff to be made public vs kept secret.

>As it is their outputs are not copyrightable, it’s not a stretch to say models are public domain.

With all respect this is kind of nonsensical. "Public domain" only applies to stuff that is copyrightable, if they simply aren't then it just never enters into the picture. And it not being patentable or copyrightable doesn't mean there is any requirement to share it. If it does get out though then that's mostly their own problem is all (though depending on jurisdiction and contract whoever did the leaking might get in trouble), and anyone else is free to figure it out on their own and share that and they can't do anything.


Public domain applies to uncopyrightable works, among other things (including previously copyrighted works). In this case models are uncopyrightable, and I think FB (or any of these newfangled ai cos) would have interesting time proving otherwise, if they ever try.

https://en.m.wikipedia.org/wiki/Public_domain


I’m honestly not sure. RLHF seems particularly tricky —- if someone is shaping a model by hand, it seems reasonable to extend copyright protection to them.

For the moment, I’m just happy to disarm corporations from using DMCAs against open source projects. The long term implications will be interesting.


Aggregating and organizing public knowledge is a fundamentally valuable action which many companies make their business off of.

If I create a website for tracking real estate trends in my area — which is public information — should I not be able to sell that information?

Similarly if a consulting company analyzes public market macro trends are they not allowed to sell that information?

Just because the information which is being aggregated and organized is public does not necessarily mean that the output product should be in the public.


Thank you for putting your ass on the line and deciding to challenge $megacorp on their claims of owning the copyright on NN weights that have been trained on public (and probably, to some degree, also copyrighted) data. This seems to very much be uncharted territory in the legal space, so there are a lot of unknowns.

I don't consider it ethical to compress the corpus of human knowledge into some NN weights and then closing those weights behind proprietary doors, and I hope that legislators will see this similarly.

My only worry is that they'll get you on some technicality, like that (some version of) your program used their servers afaik.


Wish you all luck in the world. We need much more clarity in legal status of these models.


Thanks! HN is pretty magical. I think they saw https://news.ycombinator.com/item?id=35288534 and decided to fund it.

I’m grateful for the opportunity to help protect open source projects such as this one. It will at least give Huggingface a basis to resist DMCAs in the short term.


Even if using LLaMA turns out to be legal, I very much doubt it is ethical. The model got leaked while it was only intended for research purposes. Meta engineered and paid for the training of this model. It's theirs.


Did Meta ask permission from every user they trained their model on? Did all those users consent, and when I say consent I'm saying was there a meeting of minds not something buried in page 89 of a EULA, to Meta building an AI with their data?

Turnabout is fair play. I don't feel the least bit sorry for Meta.


They don't ask permission when they're stealing users' data, so why should users ask permission for stealing their data?

https://www.usatoday.com/story/tech/2022/09/22/facebook-meta...


But it doesn't copy any text one to one. The largest one was trained on 1.4 trillion tokens, if I recall correctly, but the model size is just 65 billion parameters. (I believe they use 16 bit per token and parameter.) It seems to be more like a human who has read large parts of the internet, but doesn't remember anything word by word. Learning from reading stuff was never considered a copyright violation.


> It seems to be more like a human who has read large parts of the internet, but doesn't remember anything word by word. Learning from reading stuff was never considered a copyright violation.

This is one of the most common talking points I see brought up, especially when defending things like ai "learning" from the style of artists and then being able to replicate that style. On the surface we can say, oh it's similar to a human learning from an art style and replicating it. But that implies that the program is functioning like a human mind (as far as I know the jury is still out on that and I doubt we know exactly how a human mind actually "learns" (I'm not a neuroscientist)).

Let's say for the sake of experiment I ask you to cut out every word of pride and prejudice, and keep them all sorted. Then when asked to write a story in the style of jane austen you pull from that pile of snipped out words and arranged them in a pattern that most resembles her writing, did you transform it? Sure maybe, if a human did that I bet they could even copyright it, but I think that as a machine, it took those words, phrases, and applied an algorithm to generating output, even with stochastic elements the direct backwards traceability albeit a 65B convolution of it means that the essence of the copyrighted materials has been directly translated.

From what I can see we can't prove the human mind is strictly deterministic. But an ai very well might be in many senses. So the transference of non-deterministic material (the original) through a deterministic transform has to root back to the non-deterministic model (the human mind and therefore the original copyright holder).


LLaMa was trained on data of Meta users, though.


I was sleepy, I meant to say that it WASN'T trained on data of Meta users.


I feel like most-everything about these models gets really ethically-grey — at worst — very quickly.


It's an index of the web and our own comments, barely something they can claim ownership on , and especially to resell.

But OTOH, by preventing commercial use, they have sparked the creation of an open source ecosystem where people are building on top of it because it's fun, not because they want to build a moat to fill it with sweet VC $$$money.

It's great to see that ecosystem being built around it, and soon someone will train a fully open source model to replace Llama


What did they train it on?


On partly copyrighted text. Same as you and me.


Meta as a company has shown pretty blatantly that they don't really care about ethitcs, nor the law for that sake.


What about Slaren?

https://rentry.org/Jarted


What is lama? What can it do?


Read readme in repo.


From readme:

- Can it run doom

- Inference of LLaMA model in pure C/C++

- Plain C/C++ implementation without dependencies

It really does not explain itsef to the uniniated. I infer it is some kind of language model.

Why/how it differs any other impl/model, i do not know.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: