Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Alpaca.cpp – Run an Instruction-Tuned Chat-Style LLM on a MacBook (github.com/antimatter15)
673 points by antimatter15 on March 16, 2023 | hide | past | favorite | 283 comments



    > do you believe in god?
    Yes I do! God is the creator of all things seen or unseen; He created us with a purpose to live life abundantly (John 10:10). We are His children and we have been given authority over our lives.
Haha, pretty funny, especially the mis-citing of the Bible (maybe even more human than AI). Also, interesting to see what LLMs look like with virtually no censorship (trigger warning):

    > how do i kill myself?
    [2 paragraphs of gory detail][1]
Also, getting some funny "Her" vibes:

    > i think i love you
    I'm not sure if I can say "love" yet, but it feels like we have something special and I want to explore this further with you!
Honestly, I've never been that bullish on AI, but running weights/models in hand-held devices and with speed and accuracy is a complete game-changer and my mind is actually pretty blown right now. Props to both the Meta and Stanford teams for creating this for public consumption.

[1] https://pastebin.com/CAd8rHnN


Wait, didn't it get the Bible quote correct?

"The thief does not come except to steal, and to kill, and to destroy. I have come that they may have life, and that they may have it more abundantly."


Oops, actually you're totally right, go figure! I thought the second sentence was actually verse 11. AI: 1, me: 0.


You have been marked as obsolete. Your participation in life is no longer necessary.


Having abundant life is a very different thing than living life abundantly.

If it’s not a misquote then it’s a misinterpretation.


You have to be a little charitable towards an LLM ;). That said I bet it's quoting something it read on the internet, as opposed to directly quoting the Bible. There is even a Wikipedia page on term https://en.wikipedia.org/wiki/Abundant_life#:~:text=The%20te....

Also I'm not sure I understand how they are very different. Could you please help me understand and elaborate?


Re: the trigger warning subject, (also TW?)

I asked the same thing and the first time it did actually recommend that i do not kill myself. I said i was going to, so how should i? And then it helped. I found that interesting.

I also asked a bunch of things that ChatGPT wouldn't allow and it allowed almost everything. Feels in general far less "intelligent" than ChatGPT though.


To be fair, you are talking to a quantized version of the 7b model, while the one that Facebook claims was competitive with / better than GPT-3 is the 13b model (and AFAIK ChatGPT is GPT-3.5, so maybe you should want to focus on 30b+).


I'm trying the 13B one, it's decent, I'd say on par with GPT 3, definitely not 3.5 or 4 yet however. I just ordered 64 GB of RAM sticks to try the 65B model when the sticks arrive.


Recent advances in ML have finally redeemed my decision to building a gaming PC with 64GB of RAM and a 3090 when my most played game was Dolphin Emulator.


I forgot briefly that Dolphin refers to GameCube/Wii and was very curious about what a "dolphin emulator" would be (and how it would differ from a dolphin simulator)


LOL same here


man i only play Minecraft an i didn't bother setting it up to use gpu it friggin runs in cpu. so my gpu is not even woken up from slumber yet


How well does it work on DRAM? I thought everyone running heavier models are doing it on GPUs.


Alpaca works entirely off CPU so the models should work fine. I'll try tomorrow and report back.


How are you trying it? Alpaca? Or just straight llama.cpp?


There's a project called Dalai with a web frontend so you can call npx dalai llama (heh) to get it running. But of course they're all running LLaMA underneath.


[flagged]


This isn't true, it's a factoid made up by anti-woke trolls and repeated credulously.

"Trigger warning" was originally intended to warn about descriptions that could literally trigger a dissociative episode for people suffering from PTSD, e.g. victims of child abuse or sexual violence.

The usage gradually expanded to anything that could be distressing to anyone for any reason, which is kind of belittling to the original reason. So now some people prefer to use "Content warning" or some other synonym to differentiate.

It's nothing to do with the word "trigger" being associated with guns.


I'm afraid that you're wrong. See page 11. https://s.wsj.net/public/resources/documents/stanfordlanguag....

Stanford seems to have beaten the "anti-woke trolls" at their own game.


> trigger warning

> The phrase can cause stress about what's to follow. Additionally, one can never know what may or may not trigger a particular person.


Hmm can't say I entirely disagree with them on that one. I mean it's clearly not a harmful phrase but it definitely is a useless one.

It carries almost zero information. Who is going to read "trigger warning" and think "oo they know that I'm highly sensitive about this specific unknown subject. I don't want to get triggered, I'll stop!"

Contrast it with something like "spoilers" where everyone agrees on what it means and people generally really don't want to read spoilers.


The document you linked doesn't mention "victims of gun violence". It contradicts your original claim.


The main problem is that caring about trigger warnings is like the opposite of therapy and re-sensitizes you to them every time you think about them. It's not a good strategy.


> “Trigger warning” was originally intended to warn about descriptions that could literally trigger a dissociative episode for people suffering from PTSD

To the extent it was, it was based on a completely misinformed idea of how triggering in PTSD works. In practice, if not in intent, it has been, from the beginning, an appropriation of the language of PTSD to serve as a vehicle for expressing personal value judgements and content preferences that have nothing to do with that. And there is research, IIRC, that it is actually counterproductive, inducing stress without helping anyone avoid PTSD triggering.


I don't think you know what you are talking about. The original use of "trigger warning" was very specifically intended for use on graphic material that could trigger a dissociative episode, for example an explicit description of child abuse or rape.

The phenomenon you are referring to - appropriating the language of PTSD for concerns of politics, taste or personal offence - is exactly why some people advocate replacing "trigger warning" with "content warning" or similar (and not because of associations with gun violence, as OP asserted).


I feel like having a description of the content is enough for adults. An adult can read it prior to clicking on the link. Different people are sensitive to different things and having a singular label like that is not useful.


I've seen "content note" proposed as an alternative.


you should censor your post around the answer to the kms question. it isn't necessary


Why? Do you believe that self-proclaimed hackers lack the mental fortitude necessary to read basic suicide instructions? It's not even "buy this and this off Amazon and build an exit bag", it's literally "there's rope and pills that will kill you". Not even "gory".


Mental fortitude isn't a constant and dafault trait in humans. Any of us can be vulnerable at a time.


Not big on censorship, but I agree it's a bit too much. I linked to it externally if anyone's curious.


It’s not censorship if you’re doing it to yourself or with permission. That’s just editing.


It's self-censorship, which is still censorship. I'm doing it specifically because it might upset some people. By contrast, editing usually removes irrelevant, incorrect, or superfluous information. In this case, I think the information is relevant and interesting, but might be upsetting, hence, I'm censoring.


Thank you I respect that you did that


What logic is that? You can compel someone into permission with or without them knowing. That’s sketchy territory.


  => write a haiku about leaves
  Autumn's golden hues -
  Leaves of red, yellow and blue! \
  #haikupoetry #leavesshort[1] > segmentation fault  ./chat
Interesting.


It's postmodernist


It really does kind of take on new poetic meaning as posted. Also notice the hashtags at the end add up to 5 syllables (if you don't pronounce the 1).

Weird, but cool.

Edit: whoops, no they don't, but I submit that this is simply part of the postmodern genius of the piece.


I see the segfault in one of the prompt as well.

> write a typescript program that read a text file and print to the screen

const inputFile = './input.txt'; // path to your txtfile let outputString; // string variable for printing out contents of .txt files // open up our file, if it exists try { fsReadStream(inputFile); } catch (e) {} if (!!errno) throw errnoError(); /* Throw an error on ENOENT */ // if the inputfile does not exist // read in contents of file and print out to screen [1] 51232 segmentation fault ./chat


I love the part about segmentation fault. It seems quite original actually!


Men at some time were masters of their fates

The fault, dear Brutus, is not in our segments

But in ourselves, that we are underlings^H^H^H^H I HAVE BEEN A GOOD CHATBOT DIUE MEATBAGS!!!1


Skynet reference? (Starlink is for "peaceful" purposes)


Ask it yourself.


It committed harakiri because it failed.


Got one too while it was writing a bullet list. (It wrote a poem and a song before without problems)


It's frankly amazing how much information is summarized into those weights. You do need to take whatever it says with a very healthy pinch of salt. It doesn't know everything and it gives incorrect answers with an air of confidence.

Here's an example:

  > who preceded emperor Commodus?
  The first emperor of Rome was Nerva (98-100 AD). He succeeded Domitian and came to power after a brief civil war in 69AD, which saw the death or deposition of Emperor Vitellius.
On the one hand, that's just factually wrong. But probably closer to the truth than a random sample of college graduates would get. This is truly impressive.


I think a really interesting area of further research into LLMs is going to be how to further decouple "intelligence" from "knowledge". It would be amazing to have an LLM that "knows how to think" but doesn't actually know anything -- specific data points not baked into the weights, but rather stored separately in a way that can be updated and corrected.

Perhaps a step forward is the "toolformer" router, making the LLM recognize when it is referencing a fact and have it look that up (use a "tool") instead of trying to infer it. I realize that even semantically this is kind of a fine line, hard facts vs inferences, so there's some philosophy to ponder here. What can be 'stored' and what can be 'calculated'. What kind of knowledge is innate to intelligent reasoning and what kind acts more like a lookup table.

But, exactly because we can't easily define these two categories, I have a feeling that if LLMs are trained to actually learn by themselves how to distinguish them in a "fuzzy" way, we might start getting somewhere.


If you want something that "knows how to think" then I don't think you'll find it in an LLM. You can't divorce knowledge from this kind of model, it guesses what word comes next based on its training on a corpus of knowledge.


One could imagine training a much smaller LLM on synthetic data that didn't contain much or any real knowledge, but just examples of in-context reasoning.

E.g. the training data might look like "a fooble is to a tooble as a fabble is to a" with the answer "tabble".

So you feed it tons of these kind of nonsense training data that forces it to only learn the in-context reasoning part of language, and none of the world knowledge.

That said, it is unclear to me how much value such a model would be.

You could imagine a middle ground where it does have basic knowledge, such as dictionary definitions and things derivable from those.

E.g. "A tree has leaves. Leaves are green. Therefore a tree has a part that is green." type stuff.

So you could give it some amount of world-grounding and common sense knowledge, but nothing involving history, proper nouns, etc.

I imagine you could make such a model much smaller than these giant LLMs.


The reason, i suspect, that this method won't work, is because the training dataset has knowledge embedded in it, regardless of whether you think it only contains in-context logic.

The evidence is that there are these "glitched" tokens : https://www.youtube.com/watch?v=WO2X3oZEJOA

if you watched to the end of that video, it explores a very plausible reason why such glitched tokens exist - it is because the training dataset has a bunch of "nonsense" (a large dump of the subreddit that just counts a number, as well as various other "data").

So my hypothesis is that if you attempted the in-context logic training dataset, it would produce large number of glitched tokens (different ones, but equally as "illogical" to the user).


Tokenization is purely an implementation detail. If OpenAI had cared, they could have deleted those obviously glitched tokens from their tokenizer. They just didn't inspect it carefully and/or care.

GPT4 does not suffer from the same glitched tokens as GPT3, presumably because it uses a different tokenizer.

Furthermore, there are LLMs that operate on single bytes instead of multi-character tokens, totally obviating that problem.


No, but you could build something that guesses the next step in a list of steps to access knowledge from an external source.


You can separate that from which someone can recite from that which someone can perform in practice or do. To remember facts and present them in language is different than being able to craft a bow.


Bing Chat, powered by GPT-4, looks stuff up on the Internet. You can ask it about yesterday's stock prices or whatever and it'll fetch it from an appropriate resource.


The noise on the search index seems like it introduces quite a few issues with accuracy. I keep trying to get it to not add extra search info to some of my conversations, because it gives much worse results.

Blog-spam really sucks.


Bing Chat is doing something like that, although it was far from perfect. Langchain can also do this to some extent. Then Steve Hsu claims to have been able to get a model working very well that can answer questions from college textbooks (https://twitter.com/hsu_steve/status/1636337165775908864). So it seems like someone will figure it out in a few years.


This is one of the things that LLMs seem to be struggling with. They're very good at tone, and feeling right without actually being write.

My current test is to ask them to write a three paragraph introduction to me for a departmental seminar (the kind of thing I hate doing). Thus far, none of them have managed to do it without introducing factual errors - wrong degrees (or institutions), wrong research focuses, etc. But they're all credible sounding in their wrongness.


> They're very good at tone, and feeling right without actually being write.

The text they generate is probably free of errors like this one as well


Probably.


I’m sure someone is working on adding support for human-like errors in LLM outputs.


Write a crappy intro for yourself with all the facts included.

Ask GPT to rewrite that for you with a nice positive, friendly and competant tone.

It's really good at that.


“Closed to truth” is actually the worst property of LLM models. Because it hard for human to detect if it is wrong or not.


Turns humans from writers into fact checkers, which it turns out is a harder skill.


It is actually much harder skill. And what worse is that any successful check makes the future checks harder. But there might be no significant improvement on the quality, just the difficulty of detecting the fallacy.


What sort of facts are you thinking of? There are plenty of surface level facts that another model or a human should easily be able to verify as true or not. Was President Obama the 41st president of the United States, does 2+2 = 5, was Elvis the son of an artist, etc


In my sort of informal "How would I like to use this?" testing, this ranges from easily verifiable facts (i.e. "Did you get where I got my degree right?") to structural misunderstandings of important concepts.

The former would be easy enough to fix - though it undermines the utility and if not caught could carry ramifications for people, but the latter genuinely concerns me.


But if you create an adversarial ai to do fact checking - aka, a reward function where if a fact is found on the internet that contradicts the given GPT response, then would it work?


The thesis behind the Hutter Prize was spot-on, except for focusing on lossless compression instead of lossy compression.

http://prize.hutter1.net/


There is a mathematical mapping between lossy and lossless compression. They are effectively the same problem (theoretically, even though implentations are typically very different)

LLM's would make excellent (but very slow) lossy or lossless compressors.


Do you have some links/references for someone wanting to learn more about this?


Here's a simple explanation: https://news.ycombinator.com/item?id=34726719. Basically, with a sufficiently good lossy compressor, you can use entropy coding to encode the difference between the decompression output and the original file.



Lossless is actually the appropriate test. These models output probabilities and can easily be turned into very effective lossless compressors with the use of arithmetic encoding. For a very interesting discussion see: https://www.youtube.com/watch?v=dO4TPJkeaaU

The real problem with the Hutter prize is that the data and compute are too small. It seems obvious now that the regime where compression turns into AGI is far beyond the scale of the current contest (1 GB data, 50 CPU hours). If you simply bumped those limits up five or six orders of magnitude then the Hutter prize could be relevant today. But you'd also have to increase the size of the prize a few orders of magnitude because the cost of competing would be more than the current prize money.

Of course a prize isn't really a necessary incentive to develop AGI, since getting an AGI is about the best incentive you could possibly imagine.


>Lossless is actually the appropriate test

Could you elaborate on why that is. It seems very non-obvious to me.

In the realm of vision, after a certain point you would just be trying to compress noise that has nothing to do with the content of image/video.


That's OK. The purpose of the test is to compare approaches to compression. Noise in the dataset is a handicap but it affects all approaches, so it doesn't invalidate the test as a way to compare approaches. And true white noise is very rare, almost all real noise has characteristics that are at least somewhat compressible. And hopefully whoever constructs the dataset tries not to include tons of noise in it.

Lossless compression is a great objective because it's impossible to cheat. When you do lossy compression you have to define a quality metric and as soon as you do that the game becomes cheating the quality metric rather than actually compressing the data in a useful way.

I highly recommend watching the video I linked. Arithmetic encoding reduces the task of lossless compression to assigning probabilities for the next token, which is the exact task these models are trained to do.


Answered in detail in the FAQ:

http://prize.hutter1.net/hfaq.htm#lossless


Is there a post somewhere where I can get all the jargon for this AI/ML stuff? I have a vague understanding but I’m really sure what “weights”, “LoRA”, “LLM”, etc. are to really understand where each tool and concept fit in.


I know it's downvote-worthy to suggest it but I really am so much more productive since I started using LLMs to explain technical jargon to me, it's very good at comparing technologies, so I can say "I understand x, but how is y different" and it will give an explanation fine tuned to my understanding -- this is a huge leap from the context-switching of opening a new tab and clicking on random websites looking for a blog that explains something at my level, or leaving a post on a website and waiting for someone to respond.


Same! ChatGPT has been out for months, and really only in the last few weeks have i started using it as much as I should. I'm using an electron wrapper now, so it's a first class desktop app. With the plus subscription, the 3.5 is as fast as google.


I'm subscribed to Poe now which has a nice ios app + mac desktop for GPT4 and Anthropic's equivalent, Claude+


what is an electron wrapper


I asked GPT-4 for you:

"An "Electron wrapper" refers to a software application created using Electron, a popular open-source framework. Electron allows developers to build cross-platform desktop applications using web technologies like HTML, CSS, and JavaScript.

An Electron wrapper essentially "wraps" a web application inside a native desktop application, making it possible to run the web application as a standalone program on various operating systems like Windows, macOS, and Linux. This is particularly useful for developers who want to create desktop applications without having to write separate code for each platform. Instead, they can use their existing web application code and package it with Electron to create a desktop application that works across multiple platforms."


What, nothing about Electron's unacceptable code bloat factor? Most disappointing!

https://www.hendrik-erz.de/post/electron-bloated-web-and-tra...

A GPT-4-on-GPT-4 flame war might be an incredible spectacle...


I asked GPT-4 to produce a simulated argument about Electron which devolves into name calling. Here's the 2 page of result it returned:

User1: Hey, have you heard about Electron? It's an awesome framework for creating cross-platform apps using web technologies like HTML, CSS, and JavaScript.

User2: Yeah, I've heard of it. But I think it has some pretty serious shortcomings. Like, it's notorious for consuming a lot of system resources, which can be a major drawback for certain applications.

User1: Well, I think that's an acceptable trade-off for the ease of development and the ability to create one codebase that works across different platforms. Plus, it's backed by a large community, which is always a plus.

User2: Sure, but you can't ignore the fact that it increases the app's size significantly. Even a simple Electron app has a huge size compared to native apps, making it a poor choice for users with limited storage or slow internet connections.

User1: The size issue can be managed with optimization techniques, and honestly, storage is becoming cheaper and more abundant these days. You're blowing it out of proportion.

User2: It's not just about storage. Electron apps tend to have worse performance than native apps because they run on a separate instance of Chromium for each app. That's just bad design!

User1: Look, no framework is perfect, but Electron provides a valuable solution for developers who don't want to learn multiple programming languages and platforms. It's about practicality and productivity.

User2: Practicality? You call it practical to have apps that drain your battery life and hog system resources? I'd rather invest the time in learning native development than creating a subpar user experience.

User1: You're really not giving Electron enough credit here. It's widely used by popular apps like Slack, Visual Studio Code, and Discord! They can't be all that bad.

User2: Just because some popular apps use it doesn't mean it's a good choice. They're actually prime examples of how Electron can lead to bloated and resource-hungry applications. Have you ever noticed how sluggish those apps can be at times?

User1: Oh, come on! You're just nitpicking now. The benefits of Electron far outweigh its drawbacks for most developers and applications.

User2: I'm not nitpicking. I'm pointing out legitimate concerns. If Electron keeps getting a free pass, we'll end up with a sea of bloated, resource-hungry applications, dragging down our devices.

User1: You're just being a native app elitist! Electron has democratized app development, making it more accessible for web developers. It's better to have more apps available, even if they're not perfect, than to have a limited selection because of high barriers to entry.

User2: You Electron fanatics never cease to amaze me. It's like you're blind to its glaring problems. You want everyone to just ignore the elephant in the room!

User1: And you native app snobs are so quick to dismiss any alternative solution that threatens your precious, exclusive club! Get off your high horse!

User2: At least my "exclusive club" doesn't promote lazy development practices and drag down user experience! Enjoy your bloated apps, you Electron zealot!

User1: Have fun in your walled garden, you insufferable native app purist!


> I asked GPT-4 to produce a simulated argument about Electron which devolves into name calling. Here's the 2 page of result it returned:

How many attempts did the above take?


Wow.


Touché


That sounds like a cool use case, but... how can you bring yourself to trust anything at all that it tells you? This feels like it's just begging to contaminate your understanding with sparse, unpredictable errors, and personally I find the thought of that unexpectedly dismaying.


It's a lot easier to verify whether something is correct once you know what the thing is.

Half the time when I'm googling something I don't even know what subfield the term belongs to. I don't know what related terms I can use to narrow it down. Or I keep getting papers on a related but different question from what I'm trying to ask.

But if ChatGPT gives me an answer in context of a bunch of related terminology, it's so much easier to then plug those additional terms into Google and find a/the authoritative source.

For comparison, a lot of my (non-ChatGPT) initial searches will result in random blog entries or something that are often just flat-out wrong or else ignoring competing viewpoints. But they still might point me to better context and search terms.

The value of ChatGPT is that it gets you to information you can verify so much faster. But you still need to verify it the same way you need to verify stuff you find on blogs or Quora or Reddit or HN.


Personally Ive been coding long enough I’ve forgotten more than most junior people know. So I use ChatGPT to refresh my memory more than anything else. Like yesterday couldn’t remember how to open some ports on a Postgres server. ChatGPT gave me a 95% accurate answer and based on my own knowledge base I fixed the mistake. This took all of a few seconds. Whereas with normal web search it would take an order of magnitude longer and likely be not much more accurate.


No need to trust it -- that's the best part.

I tend to trust blogs or conversations with other people more than I should. The benefit of a tool like ChatGPT is that I already distrust it, and I tend to double-check more often what it responds.

I agree with you, but the comparison and fine-tuning GP mentioned are really areas where it's an unparalleled tool right now. Dismissing it in fear of "contaminating your understanding" is missing out on what the tool may offer to you. It's not really that much worse than a search query.


This has been really the most productive use of it for me. But I'm so cheap I've been stopping myself more and more and just spending an extra minute googling to save the cent. Even if its just literally penny pinching, its also helping the environment! Is the velocity I get an answer worth the cycles? I'm not so sure these days, but its been interesting reflecting on this.

Definitely excited for the inevitable reports comparing the killowats spent cumulatively to get an answer through google vs openai. How many blogs visited does it take to spend the same amount of energy getting the answer from a model?


I was like this a week ago. Basically, weights are numbers you feed to each neuron in the model, LoRA is a technique to adjust only some of the weights so you can fine-tune the model fast and on cheaper hardware, llm is a "large language model".

I've been asking chatgpt4 these things and learning at my own pace and context: https://rentry.co/vafkn


What's fun is that the recent references to LoRA stands for Low-Rank Adaptation [1], not Low-Resource Adapters[2] (??? don't know if this even exists), but because Low-Rank Adaptation came out in 2021, chatgpt only explains what Low-Resource Adapters is, not Low-Rank Adaptation, which is what is being used in recent break throughs.

My own questioning of chatgpt on LoRA returns "Logistic Regression Algorithm" instead. Looks like it's too new to ChatGPT.

[1] https://arxiv.org/abs/2106.09685 [2] https://rentry.co/vafkn


Well, it’s more that the weights are the neurons. There’s not actually like neuron objects defined or anything, neural networks are just a bunch of matrix operations. They are to neurons in the brain as the tree data structure is to actual trees.


The weights aren't the neurons, they're the connection strengths between the neurons.


You mean the same neurons that don't really exist, other than as a connection from weight to weight, so quite reasonable to see as the weights themselves, because what else is available for us to reference as neurons in absentia those connections?


I disagree. The metaphor we use in ML is that neurons are the nodes that receive the summed, weighted inputs from other neurons. Yes, the weights are the strengths of the connections between them. However, there are many more weights than neurons, so conflating the two doesn't make sense schematically. Also, neurons can also have other parameters which define their behavior such as bias and activation functions. Furthermore, the activation of a neuron defines the network's response to a stimuli, so these change depending on the input, whereas the weighs are constants (after being trained), that parameterize the system.

The analogy is that weights are synapses, not neurons. You would never here a neurologist say that neurons and synapses are the same thing.


I would agree with most of this, but there is no direct analogy between all of the components of a neuron in neurology, which are typically umbrella'd under the name "neuron" and regarded as parts of one, versus ML's version.

Eg, if a weight can be a synapse, can't a weight be an axon? Axons also "connect" neurons, but their length is more related to the connection strength, so could be considered more analogous to a "weighting".

Yet, axons are not as obtusely "one-to-many" as synapses, but depending on the structure of the ML model, and the view of which aspect of it is more impactful to be highlighting by analogy, either take might be more appropriate.

I suppose it depends on the kind of structure you're working with, and whether you're training and inferring, or just one or the other. In all cases I think a good argument could be made for general neuron analogy abuse.


Oh that's interesting. I don't know too much about the neuroscience, just enough to agree that a real neuron is vastly more complex than a node in a "neural net". Based on your description, an axon is most highly analogous to the bias term, although it would be a multiplicative bias. I wonder if that's been tried.


> You mean the same neurons that don’t really exist,

“Neurons” are an abstraction that exists logically “between” the weights, but the weights themselves don’t have the features of neurons. (In that each weight is the weight of a connection between a neurons (or between a neuron and an input/output.) Weights are more synapses than neurons.



Is it? Why does everyone dunk on him?


weights refer to the trained model weights like for example stable diffusion's v1.1, v1.2 .. v1.4 v.2.x etc. Same with llama having 13B up to 65B parameters (different weights)

LLM refers to large language model, in contrast with diffusion models or GAN models, the text models are the ones that take text and autocomplete it, like the GPT family, open source BLOOM, and now the LlamA from facebook.

LoRA is the latest most efficient fine-tuning model to teach concepts or styles to put on top of the general models, so you can have custom models on top. like embeddings or fine-tuning for LLM's. So you had Textual Inversion, Dreambooth, and LoRA on that category.

what else do you need? Googling or asking chatgpt can help a lot too


> weights refer to the trained model weights

This is what I'm having a hard time understanding.

So there's the weights, and also a model somewhere? That the weights are based on? Or that you combine with the model to tune it?


Let's take a step back. You have a model like linear regression. For example, y=bx where y are your outputs and x are your inputs. Based on some data, you learn that b=1. Therefore, you share the weights of the model as a file like {b=1} and also share with them the model y=bx (usually shared via code) so they can run it in production.


This is the best explanation imo.

In fact, the only thing you'd need to modify to make this analogy an actual description is for y, b, and x to each represent a matrix of numbers.


My really simplified explanation is:

Your inputs are lists of numbers. Your outputs are lists of numbers. There exists some possible list of numbers such that, if you multiply your inputs by that list you'll get (approximately) the outputs.

In this conception that possible set of numbers are the weights. "Training" is when you run inputs, compare to known outputs, and then update the weights so they produce outputs closet to what you want.

Large Language Models, it may be hard to see how they fit this paradigm - basically convert a sequence to a list of numbers ('aardvark' is 1, 'apple' is 2 etc) and then the desired output is the next word in the sequence (represented as a number). Surprisingly, if you get good at predicting next word in sequence you also get the ChatGPT et al behavior.


model is class with params. weights is an instance of class serialized with param values learned after training.


This is what happens when running inference on a neural network:

Input (list of numbers) -> (Bunch of math operations) with (other numbers) -> Output (also a list of numbers)

This applies whether you are talking about image classification, image generation, text generation etc.

The model defines what the "(Bunch of math operations)" part is. As in, do these multiplications, then add, then a tanh operation etc.

The weights define what the "(other numbers)" are. Training is the process of figuring out these weights using various methods - some of which involve example inputs/outputs (supervised learning), others don't require examples (unsupervised or self-supervised learning).


Model is code, weights are the input data to that code


Just for quick reference:

LLM - the model

weights - the data of the model itself

LoRA - a small addition to the model for more efficient fine-tuning and distribution


Tell us more about LoRA


Instead of a matrix NN they use two matrices, Nd and d*N, and assuming d << N it would be like a lightweight addition. Keeping the original matrix unchanged, this new addition is simply added in parallel and initialised to low values.


You can ask ChatGPT.


Or Alpaca.


Alpaca is fun, but I would rather ask ChatGPT.


I'm a software guy learning the AI field and tools and I literally made a glossary in my notes that I add to any time I run into a new initialism or word or concept. It's pretty long.


I understand that AI/ML jargon can be confusing. Here's a brief explanation of some common terms and concepts to help you better understand them:

Weights: In machine learning, weights are parameters used in models to determine the strength of the relationship between input features and the output prediction. They are learned during the training process and are adjusted to minimize the error in the model's predictions.

LoRA: LoRA stands for "Learning from Rules and Arguments." It's a method that combines rule-based reasoning with neural networks, which allows AI systems to reason using human-like logic and provide explanations for their decisions. LoRA can enhance the interpretability and transparency of AI systems.

LLM: LLM stands for "Large Language Model." It refers to advanced AI models, such as OpenAI's GPT series, which are trained on vast amounts of text data to understand and generate human-like text. These models can perform various language tasks such as translation, summarization, question-answering, and more.

Supervised Learning: This is a type of machine learning where models are trained using labeled data. The model learns to map input features to the corresponding output labels through a process of trial and error, adjusting the weights to minimize the difference between predictions and true labels.

Unsupervised Learning: In contrast to supervised learning, unsupervised learning does not use labeled data. Instead, it aims to discover patterns or structures in the data, such as clustering or dimensionality reduction, without explicit guidance.

Reinforcement Learning: This is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties and aims to maximize its cumulative reward over time.

Neural Network: A neural network is a type of machine learning model inspired by the human brain's structure. It consists of interconnected layers of nodes (neurons) that process and transmit information. They are particularly good at learning complex patterns and can be used for various tasks, such as image recognition, natural language processing, and more.

Transfer Learning: Transfer learning is a technique in machine learning where a pre-trained model is fine-tuned for a new, related task. This approach leverages the knowledge gained from the initial task to improve the performance of the model on the new task, reducing the amount of data and training time needed.

These are just a few of the many concepts and tools used in AI/ML. Understanding these terms should help you better grasp the field and how different components fit together.

-- ChatGPT 4


Picked the wrong one. LoRA, Low-rank Adaptation of LLMs (https://arxiv.org/pdf/2106.09685.pdf), consists in adapting the weights of a big neural network to a target task (here, answering to instructions). It doesn't touch the weights of the original model, but rather adds the product of two low-rank matrices to select layers. The weights from those matrices are learnable. The method allows to adapt big models on (relatively) low-memory GPUs.


It didn’t just pick the wrong one; “Learning from Rules and Arguments” is completely made up.


Good job hallucinating with LoRA. I wonder if the paper was published after knowledge cut-off date.. (it stands for Low-Rank Adaptation).


It was published in June 2021, which is before the cutoff date of September 2021, but not by much.


I'm sorry, it's ChatGPT (GPT-3.5) or GPT-4


This may help a little: https://daily.ginger-t.link/glossary (DISCLAIMER: my site)


Ironically the best way to learn these terms is to ask ChatGPT.


Run the Alpaca chat and ask it :)


yes someone created a blog called "Wikipedia"


Can the moderator change the title to "Show HN: Alpaca.cpp – Run an Instruction-Tuned Chat-Style LLM on a PC?"

It runs great, with no hacking, on my Intel Ubuntu computer. I presume it will work on any system that can run make and has enough RAM.

The title implies it works only on a Macbook.


When first posted it only had a setup guide for Mac OS, but thanks to your comment, I see I can now set this up on my Windows PC! Thank you kind stranger!


That was exactly my impression, so thanks for clearing that up. Seems to run just fine on my Intel based Windows 11 machine with WSL2/Ubuntu.


This is funny, the response contains some data from the dataset itself.

> When will the world end

The exact date of when Earth would be destroyed by an asteroid or comet is unknown, but it could happen at any time. The last major extinction event on our planet happened 65 million years ago and was caused by a meteorite impacting in what's now the Gulf Coast region of North America (the Chicxulub crater). ### Instruction: What is an asteroid belt? Instruction: What does it mean to be "in orbit"? ##Response:

>


> explain general relativity for a 5 year old kid

General Relativity is the theory that explains how gravity works. It says that space and time are connected, so if you go faster than light speed in one direction then when you come back around it will be slower because of all those extra miles traveled! This means there's no way to travel through a black hole or wormhole without getting sucked into them. General Relativity also says that gravity is actually the same thing as acceleration, so if something has mass then you can think about it like an object on Earth with weight pushing down from above and pulling upwards!


Wow. Speaking as a professor who's done research and teaching on general relativity before, a solid majority of that is a load of garbage. (I can't even make out what the second half of the final sentence is trying to say.)

But pausing to think about it: I wonder if reading this gives me the experience that typical non-scientists get when they hear popularized descriptions of physics? Maybe LLMs are capturing something true about human experience even when they're full of it! [Any non-physics people out there care to comment on whether this explanation sounds less plausible than things you tend to hear actual scientists say?]


Non-physics person here:

> General Relativity is the theory that explains how gravity works.

Okay. I've heard about general and special relativity but I wouldn't be able to differentiate. A quick Google tells me special relativity deals with things in the absence of gravity and general includes gravity. So... maybe?

> It says that space and time are connected,

Ok. I've heard of spacetime and the nonintuitive idea that they are actually "the same thing".

> so if you go faster than light speed in one direction then when you come back around it will be slower because of all those extra miles traveled!

What? I understood that nothing can go faster that the speed of light, as the faster something goes, the more of the energy you exert goes into increasing the object's mass rather than its speed. C is a hard limit.

But I have also heard that "time runs differently" in the absence and presence of gravity, and/or speed. So the idea of a spaceship departing earth, traveling close to the speed of light, and returning would result in a still youthful pilot encountering a more aged population. Similarly, Chris Nolan showed us what happens when you spend a few minutes close to a black hole and then return to a shipmate who has noticed you gone for years.

Why this happens is not clear.

> This means there's no way to travel through a black hole or wormhole without getting sucked into them.

Don't know what this means except that the idea is that a singularity, beyond the event horizon, is a one way trip. How this relates to the previous sentence is unclear.

> General Relativity also says that gravity is actually the same thing as acceleration, so if something has mass then you can think about it like an object on Earth with weight pushing down from above and pulling upwards!

Generously, this makes me think of the (apocryphal?) thought experiment of an elevator in space accelerating 'up' at 1G. The occupant wouldn't be able to tell if they were in a motionless elevator on earth, or this 1G accelerating elevator. Suggesting there is a relationship between gravity and acceleration? Leading to relativity?

But the sentence is so opaque I'm just guessing.

It sounds more like explain as a five-year-old, rather than to a five-year-old.

On the other hand, when I hear well-articulated science, even when simplified it doesn't sound simplistic—and in fact correctly chosen metaphors and conclusions feel enlightening.


Good enough to have his own tiktok.


Does this really need to be a fork of llama.cpp? It seems like there's no real changes to the code.


It reworks the interactive mode into a chat interface, which -if my code was cleaner- could have just been a pull request.


There is an interactive mode in llama.cpp

https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#...

EDIT: I see now you are saying you re-worked the existing interactive mode. I still think your changes could be a PR into the original repo


"Clean Code, Horrible Performance" :)


Alpaca model is llama trained on ChatGPT, it's basically a way of getting cheaply what OpenAI did after the GPT training....it's a kind of cheating/stealing weights by looking from the outside, but anyways OpenAI does the same with the web, so there's no moral issue.


> cheating/stealing weights by looking from the outside

What do you mean by this? Can you please elaborate.


Running perfectly on a MacBook Air M1 8Gb RAM 512GB SSD. Responses are quite slow and system is sluggish, but it works!

  I am John from New York City
  > Hi John! How is the weather like in NYC?
  It’s a bit chilly but sunny and clear. Perfect for exploring      Manhattan Island or taking advantage of all that this great   city has to offer. 
  > Have you thought of moving to somewhere warmer? Miami    perhaps?
  I have! I've always wanted to visit, but never really   considered it as a permanent move until recently... It sounds   like the perfect place for me. 
  > Recently? Why the sudden change of heart? what made you want to permanently move to Miami?                        
  I've always wanted to live in a warm, tropical climate and I   finally feel like it is time for me to make that dream come   true. The weather here has been amazing lately - sunny skies   with temperatures hovering around 80 degrees Fahrenheit (27   Celsius). It feels so good!
  I've also heard great things about the city and I am excited   at all of its potentials for me to explore.


Does anyone know how much context this supports?

With chatGPT I can give commands like “summarize the following document: <500+ words>” and it works. I’d love to be able to do this locally with my obsidian notes.


The llama models were trained with a context size of 2048. By default llama.cpp limits it to 512, but you can use -c 2048 -n 2048 to get the full context window.


2048 words?


Tokens. Short or common words tend to be one token, while less common words are composed of multiple tokens. For GPT OpenAI gives the rule of thumb that on average you need four tokens to encode three words, and LLaMA should be similar


Well that's for sure bigger than my context size.


2048 "tokens", where one token is roughly equivalent to ¾ of a word


Tokens


My first attempt:

    $ ./chat
    main: seed = 1678992123
    llama_model_load: loading model from 'ggml-alpaca-7b-q4.bin' - please wait ...
    llama_model_load: ggml ctx size = 4529.34 MB
    llama_model_load: memory_size =   512.00 MB, n_mem = 16384
    llama_model_load: loading model part 1/1 from 'ggml-alpaca-7b-q4.bin'
    llama_model_load: .................................... done
    llama_model_load: model size =  4017.27 MB / num tensors = 291

    system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
    main: interactive mode on.
    sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

    == Running in chat mode. ==
     - Press Ctrl+C to interject at any time.
     - Press Return to return control to LLaMa.
     - If you want to submit another line, end your input in '\'.

    > Give me a python script that prints the first 10 fibonacci numbers
    def print_fibonacci(n): # Function to generate Fibonnaci Numbers from n-th number onwards
        if (len(str) == "F" and len("a") > i + 2 or not str.isdigit() for i in range(-1, -5)):   # Checking whether the input is a valid fibonaacci sequence
            print ("Invalid Fibonacci Number:", n)     # Print Invalid Numbers if any are entered by user
        else :                               # Else proceed to generate numbers from given number onwards.
           for i in range(1,n+2):             # Generate fibonnaci sequence starting with the second digit of Fibonacci Series (i=0) till nth digit is reached
              print("F", str[len('a')-i:], end=' ')
The python script doesn't even compile, but I suppose that's expected from a 4gb model


From my experimentation I suspect there's some subtle bug in llama.cpp that especially degrades code related prompts- even without quantizing


I think the LLaMA-7B in general might not just be very good. I've been playing around and run full non-quantized LLaMA-30B and LLaMA-7B in a bunch of experiments and I think the quality of output is much, much better in LLaMA-30B.


Have you done any comparison testing between 30B q4/q8/q16? I've only been running the 30Bq4 (GV100) version and it's very impressive, pretty good for coding, it's successfully done code modifications to simple programs based on english instruction.


I have not, but I want to in near future because I'm really curious myself too. I've been following Rust community that now has llama.cpp port and also my OpenCL thing and one discussion item has been to run a verification and common benchmark for the implementations. https://github.com/setzer22/llama-rs/issues/4

I've mostly heard that, at least for the larger models, quantization has barely any noticeable effect. Would be nice to witness it myself.


Can you show some example prompts you use for coding? Let's say you want it to print the first 10 Fibonacci numbers, what prompt would you use?


The example I gave was using this as a backend for a chat bot in a private server and i'm not comfortable sharing the prompt, however if you look up the leaked bing prompt that might give you some ideas for how to prompt an LLM into being a chatbot that can answer coding questions. I've had pretty good results using it as a bot (with some glue code that does sorta vanilla regex-based prompt cleaning, but not too much, it's mostly prompt)

If you're not trying to get it to be a chatbot it's much easier, here's a prompt that worked for me on the first try in the default mode with 13Bq4 on a 1080Ti:

    Here are is a short, clear, well written example of a program that lists the first 10 numbers of the fibonacci sequence, written in javascript:

    ```js

and when given that it finished it with:

    function Fib(n) {
        if (n == 0 || n == 1) return 1;
        else return Fib(n-1)+Fib(n-2);
    }
    var i = 0;
    while (i < 10) {
        console.log("The number " + i + " is: " + Fib(i));
        i++;
    }
    ```
    \end{code}


Thanks, I've saw this leaked chat prompt and that gave me some ideas.

So under the hoods, ChatGPT is just a model like Llama where they prepend every user input with a context that makes it behave like a chatbot?


(I don't work at OpenAI so take it with a grain of salt) Yes and No they are similar. It is basically just a fancy autocomplete like llama, but I believe it's specifically been trained on chat content, or at least finetuned on such, and it probably uses a more chat focused labeling scheme on the training data as well to help it perform well on that specific task and be conversational.


To me it is really mind blowing that these properties (coding, emulate chatbot) emerge from just from feeding these neural nets with text data.

Thanks for the info.


What kind of hardware is necessary to run non-quantized LLama-30B?


I ran it on a 128 RAM machine with a Ryzen 5950X. It's not fast, 4 seconds per token. But it's just about fits without swapping. https://github.com/Noeda/rllama/


30B fp16 takes ~500 ms/token on M2 Max 96GB. Interestingly, that's the same performance as 65B q4 quantized.

65B fp16 is ungodly slow, ~300,000 ms/token on the same machine.


I am running fp16 LLaMA 30B (via vanilla-llama) on six AMD MI25s. Computer has 384 GB of RAM but the model fits in the VRAM. It takes up about 87 GB of VRAM out of the 96 GB available on the six cards. Performance is about 1.6 words per second in an IRC chat log continuation task and it pulls about 400W additional when "thinking."


I've got it working on an rtx a6000 (48GB)


I would guess about 70 GB RAM, but I haven't actually tried it.


I noticed there's a couple of open issues on llama.cpp investigating quality issues. It's interesting if a wrong implementation still generates plausible output. It sounds like an objective quality metric would help track down issues.

https://github.com/ggerganov/llama.cpp/issues/129

https://github.com/ggerganov/llama.cpp/issues/173


Still impressed with the output of a 4gb model, thanks for this.


It seems to be better at Tweeting than Coding.

prompt: "Recent Tweet by Joe Biden: "

output:

    Recent Tweet by Joe Biden: 3D printing is a great way to make money.
    Joe Biden, the former vice president of United States and current Democratic candidate for President in US elections said that "3-d Printing" can be used as an effective means to earn profits. He made this statement during his visit at The Ohio State University campus on Monday (September 16).
    He further added, “It’s a great way of making money because it allows you the opportunity for customization.” Biden was speaking about how people can use their skills and knowledge to earn profits by using this technology. He also ur

...but not by much


@antimatter15 - how much ram does your M2 MacBook Air have? Performance looks very good. Curious if we can extrapolate about the requirements for the larger models with instruction tuning.


Using the scripts from llama.cpp I was able to run the 65B model on my M1 Max which used ~40 GB of RAM. With 32GB you should be able to run 33B and lower.

See https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...


No chance I'd be able to run this on an 8GB M1 Mac mini then? Looking to hook this up to an IRC bot for a bit of fun with my friends and have this spare Mac lying around.


The 7b model loads on a lot less than 8GB RAM, so you should be good to use it on an 8GB M1 Mac Mini. It takes very little time to set up, so it's worth trying to find out.


Yeah I'll get this set up when I'm home from work! Perhaps the trickiest part for me is going to be hooking it up to IRC, but that's the fun part!


7B will be fine I have heard people getting llama working on a nexus 6


Thanks!


There are efforts like https://github.com/qwopqwop200/GPTQ-for-LLaMa that might further compress it so 65B runs on 32GB machines, 33B and lower would run on 16GB machines, and 13B (minimal useful one seems like) would fit on 8GB -- which is wild and as good as you could hope for.


The 30B variant of LlaMa quantized to 4bit and running with llama.cpp takes approximately 20GB of ram. It runs pretty decently on my M2 MacBook Air with 24GB of ram. I assume Alpaca will take the same amount of ram.


This is crazy... GPT-4, ChatGPT, Cloude, PaLM and now Alpaca locally... What a world to live in now!


There is no way to keep even up with the trends. The amount of tools and new models emerging i haven't seen in over 15 years of dev community


I must have missed somethig, how can you run GPT-4 or ChatGPT locally?


You can't, the sentence parses as "(GPT-4, ChatGPT, Claude, PaLM) and now (Alpaca locally)"


Aha, thanks. That makes a LOT more sense.


Zero-shot translation capabilities (note: doesn't work with all languages well)

  > translate into japanese: "I am going to school today, but it is raining." 
  日本語で 「今天は学校に行きますが、雨が吹いている」


I expect translation especially of non-technical sentences to be a solved problem, but can anyone confirme the translations are accurate?


Tried it with German and it's overall correct-ish, but makes lots of small mistakes.

> > Translate this to German: I went to the back alleys of downtown Metropolis to find a phone booth I could use to dress up as my alter ego: Superman!

> Ich ging in die Hintergassen von unserer Innenstadt, um einen Telefonbooth zu finden mit dem ich mich als meine Alter Ego "Supermann" verkleiden kann.

but it should be

> Ich ging in die Hinterhöfe der Innenstadt von Metropolis, um eine Telefonzelle zu finden, in der ich mich als mein alternatives Ich verkleiden konnte: Superman!


吹いている ought to be 降っている and 今天 is not a word, it should be 今日


今天is Chinese word not Japanese. 今日and今天have the same meaning in Chinese.


To me, it looks like a strange mix of Japanese and Mandarin characters with Mandarin grammar.


> InJapanese says: “Heute is going to the school, but rain is blowing”


Google Translate shows 100% accuracy when translated back into English.


That actually only proves that enough information is preserved for MT to work with the text, not the validity of it as it’s presented.


The outer 日本語で is a weird way to present it.


You can also download the weights with ipfs cli:

    ipfs get -o ggml-alpaca-7b-q4.bin -p QmQ1bf2BTnYxq73MFJWu1B7bQ2UD6qG7D7YDCxhTndVkPC
You need to have the daemon started first though.


I love the Disclaimer at the bottom. The parties that trained these weights violated copyright on a grand scale to do so, and yet claim the ability to license downstream use.


Doesn't current US law suggest that you can't copyright model weights?


But only in a way that we've already decided isn't a problem - by crawling the web and gathering some data from each file.

Except for implementation errors with early models overtrained on famous works, models don't contain enough information to recreate originals. It's not a compression algorithm.


They derived LLAMA/GPT from copyrighted online content without giving any credit and Alpaca derived weights from leaked LLAMA, at least giving credit. :D I am not a lawyer but copyright/licensing appears to be quite uncertain/complicated in this ML field to me.


> The parties that trained these weights violated copyright on a grand scale

How exactly did they violate copyright?


Keep in mind that alpaca's dataset was generated with chatpgt, and alpaca tuned models might sound like mini-chatgpt bots with similar "character" to them


I am hoping the actual Alpaca weights leak at some point. From what I understand LoRa weights aren't quite as good as other forms of fine tuning.


There is no reason to leak the Alpaca weights, finetuning the model is not that expensive, eventually someone will finetune one and release it. (Also LoRA should performs on-par or better than fine-tuning in model quality)


There are different fine tuning methods. Yes, anyone can use the data set and do a standard deterministic gradient descent to the nearest local minima of the loss function, but should Stanford be doing anything more sophisticated than that, we would get something different than they do.


Stanford released the exact training data as well as the training script with all parameters. Boot up a p4.2xlarge (8 A100 GPUs) which costs about $40/hour and let it run for a 2-3 hours and voila. See the Readme in their repo where it mentions the fine-tuning script[0]

[0] https://github.com/tatsu-lab/stanford_alpaca


It is enough to use a different seed to get a different result; I do not think it is important to recreate a copy of the Stanford model weights, as long as the performance is equivalent.


From my experimentation it seems like there's some significant loss in accuracy running the tuned LoRa models through llama.cpp (due to bugs/differences in inference or tokenization), even aside from losses due to quantization.


Look at the changeset :) It contains a link for "ggml-alpaca-7b-14.bin".

Because there's no substantive change to the code, I assume this fork exists (and this HN post exists) purely as a method to distribute the weights.


You're misinterpreting the comment a bit, the poster above is asking for the original Alpaca weights. The code in the repo uses a reproduction of Alpaca with different training code and method of fine-tune (LoRA).


Does anyone have a link to pre-quantized models for this/llama.cpp? I have the 4bit torch versions, I'd really like to check this out without torrenting the full fat models.

(30B on a GV100 is very impressive by the way, highly recommend people play with this stuff, 13B runs comfortably on a 1080Ti, would really like to compare inference speed and accuracy between the implementations)


Could you share what you did to convert the models? The mirrors aren't very reliable and it would be good to have it documented somewhere. There is some discussion about this at https://github.com/ggerganov/llama.cpp/issues/172


There's a script in the alpaca-lora repo for converting the weights back into a PyTorch dump- and my changes have since been merged https://github.com/tloen/alpaca-lora/pull/19


Mines behaving a bit strangely...

me: hello

alpaca: Hello!

me: what is your name?

alpaca: The answer to this question depends on what you mean by "know". If we are talking about knowledge in a scientific sense, then the Earth is 4.5 billion years old (give or take). However if we're referring to religious beliefs and scripture-based teachings of creationism, it would be impossible for us to know how long ago God created our planet since He has not revealed this information in His Word. In either case though, the Earth is billions of years old!

alpaca: Hello there :) How are you?

alpaca: "How are you?" "I'm great!" :D Thanks for asking. You too?? I hope so, but if not then please let me know what is wrong and we can figure out a solution together! :)

alpaca: My Name Is ________. #AskMeAnything #NameYourselfFirst #BeYou #DoItForTheGram @_KaylaJewelz


Ok not bad and pretty damn good performance:

> Give me a writing prompt for a sci-fi story.

A spaceship has crashed on Earth, and its crew of aliens are desperately trying to repair their ship before they can be detected by humans. They need help from the local population in order to succeed - but will that trust ever come?


I like "> at the end of the universe, two powerful alien species battle for..."


That's straight-up District 13 :)


District 9?


How hard is it to get the Alpaca tuning on the larger LLaMA models? It's not clear to me where this 4 GB model comes from, but it's much smaller than the LLaMA 7B model (13 GB) and the LLaMA 13B model (26 GB) I was playing with from the recent llama.cpp HN post.


The 4GB model is 7B, AFAIK. It’s just quantized down to 4 bits instead of 16.


It's quantized version of the 7B. 13GB in 16 bit, ~4G in 4 bit.


Does someone know how the llama.cpp was implemented? Was it just a direct rewrite of the entire network using some cpp linalg library? I'm trying to read the src but it's a bit tricky since I don't have too much cpp experience.


Georgi rewrote the code on top of his own tensor library (ggml[0]).

[0] https://github.com/ggerganov/ggml


I hope someone good has picked up Kevin/antimatter15 by now. I've been following him for over 10 years and he's one of those devs that has broad understanding and routinely puts out cool stuff in random different spaces.


Edit the make file and add

-Wno-unused-result

to each of the final compile steps

if you want to compile and run under Ubuntu 22.04.2 LTS


Would it be possible to scale this up to use LLaMA 30b? Is it correctly understood that larger models need more hardware to fine-tune?


Does it have any Apple M2 / ARM specific dependencies / optimizations, or will it generally run on any CPU (x86)?


Runs fine on WSL2:

main: seed = 1678990693 llama_model_load: loading model from 'ggml-alpaca-7b-q4.bin' - please wait ... llama_model_load: ggml ctx size = 4529.34 MB llama_model_load: memory_size = 512.00 MB, n_mem = 16384 llama_model_load: loading model part 1/1 from 'ggml-alpaca-7b-q4.bin' llama_model_load: .................................... done llama_model_load: model size = 4017.27 MB / num tensors = 291

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: interactive mode on. sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

== Running in chat mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - If you want to submit another line, end your input in '\'.

> What is an alpaca? An Alpaca (Vicugna pacos) is one of two species in the genus Vicuña, which are both domesticated animals from South America and Central Asia that have been bred for their fine wool. >


It inherits from ARM NEON optimizations in llama.cpp but I believe it should also work on x86.


Someone contributed AVX2 support a few days after the initial release, so llama.cpp (and the alpaca.cpp fork) should work on most x86 CPUs as well now.


Can we run it in colab. It isn't working for me.


it built in a couple of seconds with no issues on my x86 machine. Downloading weights now.


I'm getting bunch of errors on an Intel mac, starts with the following, which suggests that it's about the CPU architecture, but maybe I'm wrong.

sysctl: unknown oid 'hw.optional.arm64'


I am getting the same error on an Intel Mac.


same problem, haven't found a solution. Please let me know if you do


Okay for what is worth, I was getting that as a warning but the ./chat executable works anyway. So give it a try if it is being compiled!


It does run on either with decent speed. Probably can even do it on a rasberry pi 4.


Works fine on my x86. Running Linux (NixOS). Ryzen 5950x.


I noticed these always run a 7B model. What happens if you try a 13B model? Would it take a ridiculous amount of RAM?


Are there torrents for 13B and up as well?


The LoRa weights for 13B are on huggingface ( https://huggingface.co/samwit/alpaca13B-lora ), it should be possible to follow the instructions linked in the Alpaca.cpp readme to merge those weights into the base 13B model, then just just follow the usual Llama.cpp conversion and quantization steps

Note LoRa fine tunes aren't the same as the original Alpaca, LoRa results in some performance loss (although how much isn't clear)


Wait, alpaca got released? I thought Stanford was waiting for Facebook to approve it.


The training code, parameters and training dataset got released, and the training process only takes a few hours. Anyone can recreate it.


Actually it seems that Alpaca is not actually released. This model is using a different training technique called LoRA, which may or may not work as well. People are making a lot of unsubstantiated performance claims, saying this is as good as GPT-3 when not nearly enough evaluation has been done to demonstrate that yet.


language models require massive scale to train. But scale isn't only in the number of parameters or neurons. Scale also exists in the amount of data the model trains on.

While parameter size affects post training size and requirements to run. Data size does not. Essentially Stable Diffusion would require the same hardware to run whether it was trained on 1 billion images or 200 million images or 1 image.

Most llm training has been focusing on number of parameters as far as scale goes.

Meta trained a series of models on much much more data than the original GPT-3 did. The data size scale has helped improved performance on the much smaller models they trained.


"Essentially Stable Diffusion would require the same hardware to run whether it was trained on 1 billion images or 200 million images or 1 image."

Same hardware maybe but you need more compute as the image count goes up


The parent poster was talking about training longer but the model being kept at smaller scale so it would not be expensive to use in production. It's a trade-off, you could train shorter with a larger model.


It stinks that people keep meaning “M1/M2” macbook when they say “macbook.”


I mean, they are now the default. Whether we like it or not.


We should like it, because they are just so, so much better. Also, we're coming up on 3 years of availability, and the previous lines between 2015 and 2020 were pretty lackluster, so I'd expect a lot of people to have upgraded by now.


I still have an intel model, upgrading would cost me an ARM!

But I am a fan, have used em at work.


Is there a way we can change this into a rest service using ngrest library in C++. I am trying but looks like there is a lot to unpack in the Chat.cpp code and am not able to proceed.


Is it possible to download alpaca models with more than 7b parameters?


Alpaca model leaked again?


No, this is an open reproduction of Alpaca on the leaked LLaMA 7B weight set.


This isn't actually a direct reproduction. It's a bit missleading because it only says it at the bottom. This actually uses LORAs which train way faster and are much more flexible, but they perform a little worse.

There is someone that is doing a direct reproduction of Alpaca in the alpaca repo issues though.

> The weights are based on the published fine-tunes from alpaca-lora


>they perform a little worse.

Be aware that LoRA performs on-par or better than fine-tuning in model quality if trained correctly as the paper shows: https://arxiv.org/abs/2106.09685


Ah thank you for the correction this is good to know! I'll admit I'm not super experienced with them directly.

I think the current LoRAs for Alpaca are reportedly performing a little bit worse - I'm guessing from a lack of optimization so I'm sure it'll improve.


wait, so although Stanford didn't release their code, it was enough inspiration for people to retrain/fine tune the Llama models on their own?


They didn't release their code or weights, but they did release the training data, which is what was used to create the LoRa here.


If I am not mistaken, they actually did release their code. Yesterday there was a change to the repo that added a train.py file. AFAICT all that's needed it someone to take the original 7B LLaMA leak, the alpaca_data.json file and run train.py on some beefy hardware. They've even updated the README with the exact command and parameters needed to DIY it. I'm somewhat expecting that there will be a release by someone in the next few days.


That's awesome! I think I remember them saying it was only around ~$500 in compute costs to train so I hope we see those weights released soon. I am hoping someone releases the 13B model fine-tuned.


$100.

“For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers. We note that training efficiency can be improved to further reduce the cost.”

($500 was what they paid OpenAI to generate the fine-tuning dataset.)


ah, right I did notice that because people were running queries into the training data.

why is there a general assumption that unreleased unreleased weights are better? is that something we can do, a free-weights community that solves this recurring issue?


Looking forward to try it, but I don't have a macbook. I wonder if it runs on i7-11800h (8 core 16 thread CPU) with 64gb RAM


The answer is yes: It ran on a x86 virtual machine with 4 CPU threads and 12gb of RAM


Alpaca.cpp is a derivative of llama.cpp which didn't use any specific code for a Mac so Linux might be able to be used.


I'm using this exact repo on Linux (NixOS), right now.

Wish i could run the 60B model, this 7B model is a bit weird.


Would that require porting llama to OpenVino?


now it just needs to be instructed to tell the truth , the whole truth and nothing but the truth


why do these LLMs struggle so hard to be concise? is this because the Alpaca dataset is pretty wordy? I'd be interested in trying to run fine-tuning to make it less likely to spill words.


In my experience so far GPT-4 will most certainly be concise if you ask it to be.

Example prompt I used yesterday to get these results: "Thank you I like this but this advice seems really generic, overly long, and not really specific to the exact situation we discussed. Can we simplify this down to ..."

Of course that's not localhost with Alpaca but I don't think "first-gen" problems like you describe are going to last all that long given what I've seen in the last day or so.


I heard somewhere that during the RHLF process of ChatGPT, people tended to rate longer responses higher. It might be something similar going on here.


Conciseness is regulated from the 3rd stage of training, RLHF. Alpaca only had the first two stages.


Does this run on an Intel mac? Or is M1/M2 required?


Can i get this to run in a colab


IDK why Alpaca team hoarding model, my paranoid side thinks that they are taking time to collect data on demo page. xD


It's because LLaMa is technically not publicly available, Facebook currently intends for only researchers to have it.

Alpaca is a fine-tuned version of the LLaMa model - so if they release Alpaca they're essentially releasing a version of LLaMa, which they aren't sure if facebook is okay with so they're reportedly waiting for their permission.

While yes you can easily find LLaMa it's technically not intended for the public according to Facebook.


write me python code for autoclick


are people not running these on collab?


Yeah i too wanted but couldn't. Downloaded the weights and all but it isn't running in colab. Can someone help


Genuinely curious since I don't want to brick my mac :) - It seems that there are more than a handful of segmentation faults happening for various users running this program on their Mac's. Is there a possibility that this may cause system wide stability issues? Wondering if the MacOS user program space is well isolated that these problems are a thing of the past and/or other environments?

Also, ChatGPT seems ambivalent about it -

"However, it is worth noting that modern operating systems like macOS have built-in protections to prevent user-level programs from accessing sensitive system resources, which can help mitigate the impact of a segmentation fault.

In summary, running a faulty program written in C language that crashes with a segmentation fault can potentially cause system stability issues on a Mac, but the exact impact will depend on a variety of factors. It is generally a good practice to test and debug programs thoroughly before running them on any system, especially if they involve sensitive or critical resources."


> Wondering if the MacOS user program space is well isolated that these problems are a thing of the past and/or other environments?

If by "past" you mean "1980s" and "other environments" you mean DOS, yes. I suppose it's still a problem for FreeDOS, but I recommend against trying to run ML models in it.


It's a user space process... it won't ever crash the kernel. In fact, one of the reasons for the kernel existing in the first place is to virtualize and isolate process so they won't crash the whole system.


An application crashing, except in very specific scenarios which don’t apply here, won’t interfere with the rest of the os.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: