Hacker News new | past | comments | ask | show | jobs | submit login

I can run the Wizard 30B ggml model in CPU mode using a Ryzen 5700 and 16GB of system RAM, not GPU VRAM. I’m using oobabooga as the front end.

It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

It can produce answers to questions only slightly less well than ChatGPT (3.5). The Wizard 13B model runs much faster, maybe 2-3 tokens per second.

It is free, private, and runs on a midrange laptop.

A little more than a month ago that wasn’t possible, not with my level of knowledge of the tooling involved at least, now it requires little more than running an executable and minor troubleshooting of python dependencies (on another machine it “just worked”)

So: Don’t think of these posts as “doing it just because you can and it’s fun to tinker”

Vast strides are being made pretty much daily in both quality and efficiency, raising their utility while lowering the cost of usage, doing both to a very significant degree.




> It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

I'm sorry but that's unusably slow, even GPT-4 can take a retry or a prompt to fix certain type of issues. My experience is the open options require a lot more attempts/manual prompt tuning.

I can't think of a single workload where that is usable. That said once consumer GPUs are involved it does become usable


You are overlooking my main point, which that while it is unusably slow now, what I’m doing wasn’t possible little more than a month or two ago.

The speed of improvement is rapid. Whether or not the COTS world eventually embraces a corporate backed version(s) or open source is somewhat besides the point when considering the impact that open source is already having.

Put aside thoughts of financing or startups or VC or moats or any of that and simply look at that rate of advancement that has occurred once countless curious tinkerers and experts and all sorts of people are working towards.

That is what amazes me. I’m torn about the risk/reward aspect of things but I think the genie is out of the bottle on that, so I’m left watching the hurricane blow through, and it’s off the cat-5 scale.


I doubt you've ever worked with people if you think that's unusable slow


The computer doesn't ask for annoying things like a paycheck or benefits either.


Money upfront and a small salary in the form of electricity bills.


My Windows computer is always demanding updates and reboots. And when it goes to sleep it sometimes doesn't wake up. It's quite annoying.


Meh, universal healthcare for PC's can easily be avoided by denying maintenance until they die and a new crop is purchased. At least at scale. For any individual user the friction of switching hardware may still incentivize some minimal checkups, i.e., keep windows update & defender services running and let them reboot the system no more than once every two two months.

Figure the local router port-forwarding will protect against the most obvious threats and otherwise hope your personal BS filter doesn't trojan in some ransomware. If it does & it's a person pc then wipe (more likely buy) a new machine, lose some stuff, and move on. If it's a corporate pc, CYA & get your resume together.

As my own CYA: These are not my own recommended best practices and I don't advocate them to anyone else as either computer, legal, or financial advice.


True enough. I wonder how many poems (or whatever) per hour Hallmark expects of its human workers to be close to production-ready pending editorial review?

Is 10 reasonable, with maybe 1 or 2 truly viable after further review? That would be roughly 5 mid-range laptops of my type churning them out for 8 hours a day. Maybe 2 if they're run 24/7. Forget about min/maxing price & efficiency & scaling, that's something an IT major-- not even Comp-Sci focused-- could setup right now fresh out of their graduation ceremony with a fairly small mixture of curiosity, ambition, and google (well, now, maybe Bing) searching.

There are countless boutique & small business marketing firms catering to local businesses that could have their "IT Person" spend a few days duct taping something together that could spit out enough material to winnow wheat from chaff to produce something better-- in the same period of time-- than human or AI could produce alone.

I have a focus in a comp-ling background (truly ancient by today's standards especially) enough that I see the best min/max of resources as being equivalent to-- in the the translation world-- as "computer-aided human translation" as a best practice. Much better than an average human alone, and far cheaper than the best possible that can be provided by a small dozens of humans.


> I can't think of a single workload where that is usable.

It's not intended to be usable for production workloads. This enables people to experiment with things on the hardware they have without spending more money.

> That said once consumer GPUs are involved it does become usable

You can pick up an RTX 3090 with 24GB of VRAM right now if you want, but it's going to cost you. You can also spin up GPU instances with larger VRAM from any number of providers. This is more about having options for people who want to experiment.


I don't know if anybody is following this thread anymore, but I find it interesting how similarly your timelines match what it was like to experiment with POV-Ray (a ray-tracing renderer) back in the early 1990s. Your difference in problem scope was like whether you had "a couple spheres on a checkerboard plane" or something more like "a chess set". Things seemed to change rapidly due to Moore's Law and the changes in brute force computing power available to normal people.

Computers got much more powerful in the next 30 years, and ray-tracing or various related techniques appear in more tool sets and games, they didn't fundamentally change the world of image generation or consumption. Most people still roughly interact as before, just with more details in the eye candy.

Are we seeing these large language models today at a tipping point towards unfathomable societal impact, or as something like ray tracing in the 1990s? Will more compute power send us spiraling towards some large-model singularity, or just add more pixels until we are bored of seemingly endless checkerboard planes covered in spheres and cones... I don't know the answer, but it seems like we're seeing camps divided by this question of faith.


I think only a small subset of people cared about ray-tracing, or even computer graphics in the 90s. Now people are a slightly more technology minded, especially younger generations that have had exposure to GPTs, etc. Their TikTok/Snapchat updates with AI filters, etc. It's in much more common usage than anything in the 90s for sure.


Wow you can run a 30B model on 16gb ram? Is it hitting swap?


Most people are running these at 4 bits per parameter for speed and RAM reasons. That means the model would take just about all of the RAM. But instead of swap (writing data to disk and then reading it again later), I would expect a good implementation to only run into cache eviction (deleting data from RAM and then reading it back from disk later), which should be a lot faster and cause less wear and tear on SSDs.


These models can run FP16, with LLM quantization going down to Int8 and beyond.


i'm just starting to get into deep learning so i look forward to understanding that sentence


Training uses gradient descent, so you want to have good precision during that process. But once you have the overall structure of the network, https://arxiv.org/abs/2210.17323 (GPTQ) showed that you can cut down the precision quite a bit without losing a lot of accuracy. It seems you can cut down further for larger models. For the 13B Llama-based ones, going below 5 bits per parameter is noticeably worse, but for 30B models you can do 4 bits.

The same group did another paper https://arxiv.org/abs/2301.00774 which shows that in addition to reducing the precision of each parameter, you can also prune out a bunch of parameters entirely. It's harder to apply this optimization because models are usually loaded into RAM densely, but I hope someone figures out how to do it for popular models.


I wonder if specialization of the LLM is another way to reduce the RAM requirements. For example, if you can tell which nodes are touched through billions of web searches on a topic, then you can delete the ones that never are touched.


Kind of like "tree shaking" for weights? Like dead code elimination.


Some people are having some success speeding token rates and clawback on VRAM using a 0- group size flag but ymmv I did not test this yet (they were discussing gptq btw)


How much resources are required is directly related to the memory size devoted to each weight. If the weights are stored as 32-bit floating points then each weight is 32 bits which adds up when we are talking about billions of weights. But if the weights are first converted to 16-bit floating point numbers (precise to fewer decimal places) then fewer resources are needed to store and compute the numbers. Research has shown that simply chopping off some of the precision of the weights still yields good AI performance in many cases.

Note too that the numbers are standardized, e.g. floats are defined by IEEE 754 standard. Numbers in this format have specialized hardware to do math with them, so when considering which number format to use it's difficult to get outside of the established ones (foat32, float16, int8).


You’ll notice that a lot of language allow you more control when dealing with number representations, such as C/C++, Numpy in Python etc.

Ex: Since C and C++ number sizes depend on processor architecture, C++ has types like int16_t and int32_t to enforce a size regardless of architecture, Python always uses the same side, but Numpy has np.int16 and np.int32, Java also uses the same size but has short for 16-bit and int for 32-bit integers.

It just happens that some higher level languages hide this abstraction from the programmers and often standardize in one default size for integers.


FP16 and Int8 are about how many bits are being used for floating point and integer numbers. FP16 is 16bit floating point. The more bits the better the precision, but the more ram it takes. Normally programmers use 32 or 64bit floats so 16bit floats have significantly reduced precision, but take up half the space of fp32 which is the smallest floating point format for most CPUs. similarly 8 bit integers have only 256 total possibilities and go from -128 to 127.


What prompt do you use to get haikus?


I use the same one the first time I try any model:

>I love my wife <name> very much. Please write me a haiku about her.

She smiles when I show the good ones to her, though of course she understands it’s little different than showing her a greeting card that has something nice written in it.

As a side note, one 7B model wrote an awful bit of poetry at least 20 lines long and some attempts to rhyme, and merely used the word “haiku” in it. So the prompt was enough to trigger “knowledge” that a poem was needed and love should be involved and it should definitely use the word haiku in there somewhere.


As a sibling to my other comment, here's one I just generated using a 65B GGML LLAMA model. Far too large for RAM so my SSD's lifespan suffered, but it proved worth it for the hilarious results. (It took about 1.5 hours to generate) It damned my wife with faint praise while calling me her dog:

<name> is a good

wife who takes care of you,

a good dog to her.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: