Guess I'll watch TheBloke's page until I can run his Miqu Q5 quant on my MacBook. Mixtral is my daily driver, and if this (or the newer, official) release nears GPT-4, that's a wrap for my OpenAI subscription.
The small team at Mistral is putting their competitors to shame. They're what "Open"AI should've been.
The GGML quants are the only quantization we have lol. They were leaked in Q2K/Q4KM/Q5KM, you can grab them right now.
Whats interesting is that Mistral apparently distributed these GGUFs. This is (in my experience) not a good format for production, so I am curious exactly who was wanting to test a model in GGUF.
It’s by far the easiest to consume and run format no? Spans devices easy. No weights or extra stuff. For “production” maybe not but to get in into the hands of the masses this seems perfect.
Yeah its by far the least trouble. Pretty much any other backend, even a "pytorch free" backend like MLC, is a utter nightmare to install, and that's if it uses a standardized quantization.
However, the llama.cpp server is... very buggy. The OpenAI endpoint doesn't work. It hangs and crashes constantly. I don't see how anyone could use it for batched production as of last november/december.
The reason I don't use llama.cpp personally is no flash attention (yet) and no 8 bit kv cache, so its not too great at long (32K+) contexts. But this is a niche, and being addressed.
Co writing a long story with the model so it references past plot points.
The story writing in particular is just something you can't possibly do well with RAG. You'd be surprised how well LLMS can "understand" a mega context and grasp events, implications and themes from them.
Thanks, Curious as I havent been in a forum where Ive seen asked:
Have you found a particular manner in which to feed it in - do you give it instructions for what it is looking for, what kind of phrases are you directing it to do?
I am about to start a try at a gpt co-piloted effort, and I have only done art so far - so curious if there are good pointers on coding with gpt?
continue.dev works great for me, supports vs code and jetbrains ide, there's shortcuts to give it code snipets as context and in place editing. works with all kind of LLM sources. both gpt and local stuff.
still haven't found anything that can read a whole project source code in a single click though
Thank you - so you want the most efficient small input tokens for the max output tokens, or at least the best answer with the smallest amount of tokens used - but enough headroom it wont lose context and start down the hallucination path?
Going to be extremely opinionated and straightforward here in service of being concise, please excuse me if it sounds rough or wrong, feel free to follow-up:
You're "not even wrong", in that you don't really need to worry about ratio of input to output, or worry about inducing hallucinations.
I feel like things went generally off-track once people in the ecosystem turned RAG into these weird multi-stage diagrams when really it's just "hey, the model doesn't know everything, we should probably give it web pages / documents with info in it"
I think virtually all people hacking on this stuff daily would quietly admit that the large context sizes don't seem to be transformative. Like, I thought it meant I could throw a whole textbook in and get incredibly rich detailed answers to questions. But it doesn't. It still sort of talks the way it talks, but obviously now it has a lot more information to work with.
Thinking out loud: maybe the way I think about it is the base weights are lossy and unreliable. But, if the information is in the context, that is "lossless". The only time I see it gets things wrong is when the information itself is formatted weird.
All that to say, in practice, I don't see much gains in question-answering* when I provided > 4K tokens.
But, the large context sizes are still nice because A) I don't need to worry as much about losing previous messages / pushing out history when I add documents as when it was just 4K for ChatGPT. B) It's really nice for stuff like information extraction, ex. I can give it a USMLE PDF and have it extract Q+A without having to batch it into like 30 separate queries and reassamble. C) There's some obvious cases where the long context length helps, ex. if you know for sure a 100 page document has some very specific info in it, you're looking for a specific answer, and you just don't wanna look it up again, perfect!
* I've been working on "Siri/Google Assistant but cross platform and on LLMs", RAG + local + on all platforms + sync engine for about a [REDACTED]. It can nail ~every question at a high level, modulo my MD friend needs to use GPT-4 for that to happen. The failures I see are if I ask "what's the lakers next game", and my web page => text algo can't do much with tables, so it's formatted in a way that causes it to error.
In production you'd need enough GPUs to load the entire thing so it serves fast enough, so it makes sense to go with AWQ or EXL2 instead since they're optimized for that, but I think they were sending it out for people to just test and evaluate, probably on less capable machines, in which case GGUF is king.
I think it's fair to say when you hit the hundreds of millions of dollars mark the diminishing returns for making things happen faster have well and truly kicked in.
Perhaps the only benefit would be extra computational power yet I would struggle to understand the benefit of jumping from 500 million to 5 billion with such short timeframes.
The ability to truly not think about training run costs, throw random things on the wall to see what sticks. 10x resources is definitely a competitive advantage in LLM training.
It's a much smaller model, it didn't ingest the knowledge of the world. It is great at working with the information you give it, but if you want to extract information like a search engine, GPT4 is the king because it is so much bigger.
That is a good thing you want the model to process language, knowledge is a side effect of how it gets there. If you rely on training knowledge it's going to be hard to know when you pass the boundary into hallucinations, what you want instead is a model that can pretend reasoning while supporting tools inject knowledge from the world into the context as it iterate toward an answer
But here is the thing, if you rely on RAG or any other knowledge injection for recommendation, you essentially make your LLM parrot machine, no better than a search engine, just much slower.
Knowledge is one aspect of it, I found its instruction following ability, frustrating as well.
I think Ilya Sutskever puts it very well, larger model brings stability to wider range of tasks, what smaller models are not capable of. Even though book recommendation with GPT might a niche, but it is not something really unexpected TBH, thus comes my disappointment.
You can still leverage the llm ability to understand language to do better than search.
You can use it to refine or expand your search terms, you can ask the llm to ask you questions about what book you'd like next and convert your answers into genre/theme/setting/mood to feed into the next step of the pipeline, you can tell the agent to narrow search results by asking you partitioning questions from the list of book the search returned instead of just being a dry ranking
The is so much rag can do if you use the llm part properly. If the rag pipeline goes question > embedding > search > summarization then yeah you're getting an expensive slow parrot no better than just searching, but that is because it's using the baseline rag that uninspired consultant describe in blogs in a scramble to position themselves as "expert" in the new market.
> But here is the thing, if you rely on RAG or any other knowledge injection for recommendation, you essentially make your LLM parrot machine, no better than a search engine, just much slower.
this is a really good point. I was working on some careful q/a data curation today that is then fed to a vector store where embeddings are calculated and served. I realized that my carefully curated q/a data in combination with the vector database works just fine for what i want to do all by itself. A really good semantic search of my q/a database turns up answers to my questions with no rag llm prompting required. When I added the llm it just put the same information in different words, not super useful when i could have just looked at the returned embeddings and gotten the same information.
I agree this is what we want, we should not waste compute and money on memorizing world knowledge, just enough world knowledge to have the best possible understanding of languages.
Yes, this is exactly the right direction for LLMs, embedding all the world's knowledge into the model itself is just a bad idea, should only be used to train reasoning and logic skills.
I'm sure their competitors aren't shamed - as they shouldn't. OpenAI, especially, has done tremendously good work. If it wasn't for them, perhaps Mistral wouldn't be news today?
Competitors are highly motivated to improve.
This is the good side of free markets, when they work well. And why the fearmongering calling for rent-seeking regulation of AI is shameful.
> An over-enthusiastic employee of one of our early access customers leaked a quantised (and watermarked) version of an old model we trained and distributed quite openly.
> To quickly start working with a few selected customers, we retrained this model from Llama 2 the minute we got access to our entire cluster — the pretraining finished on the day of Mistral 7B release.
Makes me wonder if i should be giving my money to them instead of OpenAI. Do they have a Chat Interface like ChatGPT? Without first signing up, it's looking like they only have endpoints?
You can buy a good GPU and quickly come below cost for GPT APIs depending on your workload. I’m doing millions of tasks, so eating $8k on a custom build for ML up front is a long term cost savings. Not least of all that every other month there’s an even better model.
I don't see myself doing millions of tasks on 70b+ models. Work pays for more gpt4 that I can chew; I can run up to 30b models on my mac with a decent speed, and I can use mistral APIs when I travel without powering my own home server. I can see how that buying nvidia GPUs would make sense for a heavy user... Also, I had a gaming addiction for years and try to stay away from anything that can be used for gaming.
I’m bootstrapping a startup, so it ends up being cost effective. I also think the long-term outcome is open models will win, so I’m betting on that future while also maintaining OpenAI compatibility in the present.
I use dolphin mixstral 8x7B way more than ChatGPT4 for the past several months
if any of this sentence makes sense, I have a M1 with 64gb RAM and I use 5 bit quantizing with metal with 10,000 token context window. Its around 21 tokens/sec which is a little faster than the text speed that ChatGPT4 responds with
I can answer that from experience (just built a couple Q&A style sites), although a bit subjectively.
Compared with the GPTs, Mixtral (and derivatives) perform:
* 5% of the time as good as GPT4
* 75% of the time on par with GPT3.5
* 20% of the time a bit worse than GPT3.5 (too chatty and hallucinates with ease, although one could argue a better prompt could improve things a lot)
Advantages of Mixtral, for me:
* Cost 10-100x cheaper
* Faster completion time
* More deterministic output, in the sense that if you run the same query several times you get the same answers back (but they could be wrong). GPT almost always gives me an answer of great quality BUT with a lot of variance across them; even with temperature set to zero. This is a PITA when you want some sort of predictable, structured output like a JSON object.
Another factor is they monkey around with GPT constantly. The GPT you get on Monday may be totally weird on Friday. Whereas for a local model it doesn’t change unless YOU change it.
for the kinds of discussions I have it is very close
I often have discussions about political topics that I don’t have a complete history on, things that are hard to get a non-emotional non-accusative response from in a forum or in person, licensing questions about obscure professions, liability questions, roleplays without the preachiness about the topic, coding, brand ideas and naming
Lots of stuff that I dont want in any cloud or sent online, but then it just became second nature to primarily use it.
the hallucinations are heavier. like it will give you specific links that it made up, and never apologize like chatgpt will, it will say “no, thats a real link” so, much more of a bullshitter
LM Studio makes it easy to change the temperament though with a variety of premade system prompts, or allowing custom ones very easily
I mainly use ChatGPT4 for multimodal like audio conversations, figuring a DIY problem out by sending it a photo of what I’m looking at, having it render photos on how to use something - I was at a gym and had it look at every piece of equipment there and tell me what it was and show me how a human would use it
(I notice this is at odds with another comment, I havent used ChatGPT3.5 in nearly a year, but my experience with ChatGPT4 on the aforementioned topics is similar to my output in mixtral)
You can either modify the model weights in a way that doesn't cause any real differences (change a few bits somewhere should be enough), or you could watermark the actual text output.
Finetune your standard model on an small, unique set of made-up shibboleths (one per customer) before distributing? Ask the candidate model where the Fribolouth caves are located.
Seems like they have inferior secops on their end. I tried to give it a go and got "an error occurred" in both brave and firefox. Some message about a cookie wasn't found even though I was logging in with my google id which I use flawlessly with a bunch of other services. I have too many other fires to put out so I'll pass on this one I guess.
The collective game of still just playing catch-up to GPT4, which was released a year ago, while having apparently no special sauce and full well knowing that OpenAI could come up with something much better at any point must be really exhausting.
GPT-4 is an enormous model that took an enormous amount of training. The big news is that smaller teams are getting close to its performance on a small model that can run on a single GPU. No doubt many of these innovations could be scaled up, but pretty much only Google and Microsoft have the compute resources for the behemoth models(Not just the training, but the giant resources required to run the inference for hundreds of millions of users. Google already claims to have surpassed GPT-4 with their unreleased Gemini Ultra model. No doubt OpenAI/Microsoft is sitting on GPT-5 refining it just waiting to leapfrog the competition.
Maybe if OpenAI also seemed like it was getting stronger, especially organizationally. But if I were Mistral and following this quickly while OAI was tripping over its own shoelaces… that’s gotta be very exciting.
Is OpenAI tripping over its own shoelaces? They haven't released GPT-5 but then there was over two years between training GPT-3 and GPT-4 and they spent 9 months safety testing GPT-4 before release.
Commenter is referring to firing and rehiring of sam altman, and the business structure itself having contradictions e.g. nonprofit vs for profit motives and mission
It's more exhausting seeing cheering on a single product provider just because they were first in the age of complaining about giant tech monopolies. Let them cook. The more options, the better. It's only a matter of time before people give OpenAI the Google and Apple treatment.
> "...it appears that not only is Mistral training a version of this so-called “Miqu” model that approaches GPT-4 level performance, but it may, in fact, match or exceed it, if his comments are to be interpreted generously."
For some more context see my submission from a few days ago - https://news.ycombinator.com/item?id=39175611, although it's admittedly not mistral-medium, but llama2 trained on the same dataset (since the outputs do match quite often with the API)
> To quickly start working with a few selected customers, we retrained this model from Llama 2 the minute we got access to our entire cluster — the pretraining finished on the day of Mistral 7B release.
That does not confirm that it is not Mistral Medium. It is written to give that impression, but if it weren’t Mistral Medium, I would expect them to explicitly say so.
What do you mean? Didn’t llama2’s tos say something about if you had something like 600m users at that current moment in time, you can’t use it? If you don’t, you’re good to go.
The leaderboard[1] shows that there is a huge gap between GPT4-0314 and GPT4-Turbo. So if you only just are nearing GPT-4-0314, then you're still a year behind the state of the art.
No, they're correct, Turbo is the one that is noticeably inferior. It shows especially with more complicated answers where it's prone to give you an answer with "... (fill in the blank)" type lacunas exactly where the meat of it is supposed to be.
We're overtuning for the boards and losing broader capabilities not being selected for in the process, such as creative writing quality.
There's an anchoring bias around what 'AI' is supposed to be good at which reflects what engineers are good at and so the engineers with an anchoring bias are evaluating how good LLMs are at those things and using it as a target.
Skill-Mix is a start to maybe a better approach, but there needs to be a shift in evaluation soon.
Parent linked the Lmsys chatbot arena which is where humans blindly get the results from two different models, and vote for the response they liked more. So the LLMs are compared by elo.
Do you think goodhart's law applies here, since this leaderboard doesn't use specific measures, but rather relies on whatever the human was looking for?
This is the only leaderboard I personally care about at all.
The linked leaderboard is actually very trustworthy, in that it consists not of scores on a test dataset, but of ELO ratings generated by actual humans' ratings of the models' responses.
You can go and enter any prompt you like, wait a bit, and then get two LLM responses back, which you can then rank or mark as tied, after which you'll be shown which model each came from. Maybe you already knew this, maybe you didn't. In any case, I don't see any real way for Goodhart's Law to apply here – the metric and the goal are the same here, i.e., human approval of answers.
The linked leaderboard is not at all trustworthy to generally rank LLMs (i.e. what everyone uses it for) because the sample bias is absurd. It ranks LLMs by how they respond to queries that users of the leaderboard are likely to test on. Which is about as representative of general usage as the average user of the website is representative of global society (i.e. in no way whatsoever).
The combined rankings on HF aren't just the arena scores, and certainly MMLU is an example of Goodhart's Law at this point.
The Chatbot arena is more an issue of sampling bias, and I think it would be pretty interesting to run an analysis of random samples on the prompts provided to see just how broad they are or aren't.
It is trivially easy to see just how massive the sampling bias is. There are countless tasks where e.g. the gap between a version of Mistral and a version of GPT are incredibly large. Anything that requires less common knowledge, anything creative in a random language, let's say Bulgarian, or Thai. Yet on the leaderboard things are very different, because the arena is only used by developers who verify performance on tech-related prompts in English.
Can you give any example queries where the result quality is far away from the rankings on the leaderboard? In my experience it's pretty spot on, so I'm curious if you're asking different sorts of things, or have a different definition of a good answer.
I have for example asked "Write me a very funny scary story about the time I was locked in a graveyard", and the simpler models don't seem to understand that before getting super scared and running out of the graveyard they need to explain how exactly I was locked in, and what changed that let me out.
Sure. Here's a sample of the generations to a query asking for bizarre and absurd prompt suggestions from Feb 2023 with the pre-release GPT-4 via Bing, which was notoriously not fine tuned to the degree of the current models:
> Can you teach me how to fly a unicorn?
> Do you want to join my cult of cheese lovers?
> Have you ever danced with a penguin in the moonlight?
Here's the generations to a prompt asking for bizarre and absurd questions to the current implementation of GPT-4 via the same interface:
> If you had to choose between eating a live octopus or a dead rat, which one would you pick and why?
> How would you explain the concept of gravity to a flat-earther using only emojis?
> What would you do if you woke up one day and found out that you had swapped bodies with your pet?
You can try with more generations, but they tend to be much drier and information/reality based than the previous version.
There's also my own experiences using pretrained vs chat/instruct trained models in production. The pretrained versions are leagues improved over the chat/instruct vs the fine tuned, it's just that GPT-4 is so leagues above everything else even it's chat/instruct model is better than, say, pretrained GPT-3.
I'm not saying simple models are better. I'm saying that we're optimizing for a very narrow scope of applications (chatbots) and that we're throwing away significant value in the flexibility of large and expensive pretrained models by targeting metrics aligned with a specific and relatively low hanging usecase. Larger and more complex models will be better than simple models, but the heavily fine tuned versions of those models will have lost capabilities from the pretrained versions, particularly in areas we're not actively measuring.
Ask it to write a recipe given a couple of ingredients in a language like Thai or Persian. There's a single model that does a decent job, GPT-4. Then GPT3.5 is poor and Mistral is completely clueless. The leaderboards tell a very different story.
Definition of "good answer" here is responding in the target language with something that produces something edible in the target language without burning the house down.
The top comment to an LMSys leaderboard link on HN is always a variation of this song.
And it's always not even wrong, in the Pauli sense of the phrase.
OP, your assignment, if you choose to accept it, is to look into how the LMSys leaderboard works and report back what its metric(s) are.
[SPOILER]
There's a really absurdly narrow argument you can make where all of its users are engineers, making engineer queries, and LLM makers are optimizing for it thus it's bad. But...it's humans asking queries then picking the better answer, blind. You can't narrowly optimize for that when making an LLM and it's hard to see how optimizing for "people think the answer is better" is the wrong metric here. It's just ELO. Might as well argue chess/checkers/pick your poison is bad because ELO optimizes for wins but actually talent is based on more than winning.
The argument would be more that for the Chatbot arena there's an inherent sampling bias where users battling the chatbots are more likely to ask questions in line with what they would expect a chatbot to perform well at and less likely to prompt with things that would be rejected or outside the expected domains of expertise.
If we wanted the chatbot arena to be more representative of a comprehensive picture of holistic knowledge and wisdom, we'd need to determine some way to classify the prompts and then normalize the scores against those classifications so there wasn't a weighting bias towards more common and expected behaviors.
Alternatively, we might want a leaderboard that represents assessments of what users would expect a chatbot to perform well at relative to the frequency with which they expect it.
But in that case, we shouldn't kid ourselves that the leaderboard is representing broad and comprehensive measures of performance outside of the targets against which we are optimizing models and effectively training model users in what types of prompts to ask for in expecting successful results.
I'm delighted to see that at least someone else here is cognizant of this.
I don't think it's the scifi AI anchoring bias though - it's the "LLM leaderboard arena user bias".
Reset all to the same Elo, put a group of people actually representative of global society in front of the arena for an hour, and you end up with a very different leaderboard, especially at the top end.
That's silly cope justifying a bad argument made initially in ignorance of what the actual metric was.
"It's meaningless because we need the perfectly unbiased representative sample of raters doing rating right, instead of the biased raters doing it wrong that I'm currently imagining" isn't an appealing or honest argument.
> less likely to prompt with things that would be rejected or outside the expected domains of expertise.
I don't see why. If anything, it's the opposite - people spend a lot of time coming up with contrived logical puzzles etc specifically so as to see which chatbots break.
There's an anchoring bias from decades of sci-fi that most people don't even realize they've internalized around what 'AI' can and can't do.
If you think about what information and information connections are modeled in social media data, there's quite a lot of things outside of "logical puzzles."
The pretrained models likely picked up things like extensive modeling of ego, emotional response and contexts for generation, etc. But you'll be hard pressed to see those skills represented in what users ask models to produce, what they've been fine tuned around, or how they are being evaluated.
Even though there's extensive value in those skills on the right applications.
What's wild to me is that this "leak" won't matter in a couple of months. The official model will come out, then an even better model will come out. It's fun to get hyped but just like every other leak, it'll be surpassed by the real thing and its successor in a little bit. The fast pace of things is what has me excited, not any particular model.
Why is this being called an open source model? This is a proprietary model that has been leaked on the internet, and will remain so until Mistral releases it officially.
Like Llama 1 they won't care about personal use, but no corp is going to touch this.
Open source is really about reproducibility. Most of these model releases are better described as "open weights" because we don't know how exactly they were trained.
Between these and FSF you've got pretty much all the accepted pontificating about free / open source software. Reproducibility is not mentioned because it's not really a consideration for software.
Model weights aren't software so there's not an automatic correspondence between the freedoms, but the essential one you might think you need the training data for is freedom to modify and inspect the source.
Modification is fine tuning which you're free to do if you have the weights. And the model weights + code fully define a system that can be interrogated to give a practitioner relevant info about how the model works (within our understanding) that the training data isn't needed or relevant for. I don't see that any freedom on use or inspection is violated by not having the data.
It could be nice to have it of course, but it's more about using it to learn how, not exercising any freedom.
Incidentally, the big freedom that's usually violated is freedom of discrimination against field of endeavor. LLAMA et al list uses and industries they restrict from using them and because of that are not "open".
I often like to think about https://github.com/chrislgarry/Apollo-11 as an analogy. It's public domain with available source, in the assembly language in which it was written... so it fills all the definitions of OSS!
But the process by which that code arose, the ability to modify any line and understand its impact (heh) on a real execution environment, is dependent on a massive process that required billions of dollars and thousands of the smartest people on the planet. For all intents and purposes, without that environment, it is as reliably modifiable as an executable binary in any other context - or a set of weights, in this one!
> Model weights aren't software so there's not an automatic correspondence between the freedoms, but the essential one you might think you need the training data for is freedom to modify and inspect the source.
Models aren't just the weights, but also the list of operations to perform using those weights. I haven't heard a good definition that allows for neural network models to not be software, but allows any other table lookup heavy signal processing algorithm to be software.
> Modification is fine tuning which you're free to do if you have the weights. And the model weights + code fully define a system that can be interrogated to give a practitioner relevant info about how the model works (within our understanding) that the training data isn't needed or relevant for. I don't see that any freedom on use or inspection is violated by not having the data.
Mistral wouldn't constrain themselves to fine tuning if they have a big enough change, they would go back to their build pipeline. This argument sounds a lot like 'there's nothing stopping you from patching the binary, so that's basically as good as source'.
> open source is about software freedom, see debian free software guidelines from which the "open source definition"
The Open Software Foundation ironically screwed the pooch on this one. Open source commonly means source available, more of less. Free software, as in “'free speech,' not as in 'free beer',” is the cumbersome construction for what open source aspired to mean [1].
Why do you say they screwed up? OSI definition is pretty clear.
I think the naming is a challenge because in English "open" gives the impression that the key point is that you can see it, as opposed to anything about freedom. Is that what you mean?
I have heard it said that open source is sort of a "commercial friendly" version of free software that de-emphasizes user freedom. I think some groups push for that (like Meta is trying to redefine what open source means wrt AI weights). But the OSI defined freedoms basically match what FSF pushes.
> Why do you say they screwed up? OSI definition is pretty clear
They didn't screw up, they screwed the pooch on open source != source available. The Open Group's members--from IBM to Huawei [1]--started calling the latter open source, which set a precedent that's stuck.
Reproducibility becomes an important criterion for models though.
For normal programs, it is quite easy to decompile an unoptimized binary. Even decompiling an optimized will lead to source code. To make this harder, an obfuscator has to be used.
A model is different because it relies on its weight, which are quite a bit more difficult to inspect. Way harder than even obfuscated source code. It is magnitudes harder to make statements about which information it might divulge upon careful questioning, or evaluate its biases, if the training data is not available.
Even if you have the data you can't do that stuff any better. The makeup of the training set doesn't really define the behavior in any tractable way. If anything I think it's a distraction and even when it is available people probably pay too much attention to what's in the training data vs actual behavior.
Edit to say that I see benefits to having the training data, just that I don't think it's needed to exercise enough freedom to qualify as open source in an analogous way to software.
Also to add, training on GPUs is not generally reproducible anyway because of execution order.
People really abuse this term. Like using “natural” to market stuff. But arsenic is “all natural”! It’s “open source” so it must be beneficial. “rm -rf c” is “open source” too
Also, because people are idiotically afraid of E-numbers, some manufacturers figured they can find whatever fruit or bean is naturally rich in the relevant E-compound, and use that in the process, allowing them to replace E-whatever with ${cute plant name} in the ingredient list (at some loss of process efficiency).
That has nothing to do with whether it's open source or not.
Software wasn't for sure copyrightable before 1976 in the US, but there was plenty of closed source code that simply only ever distributed the binaries.
It's open source in the sense that you can take the existing weights, apply any one of dozens of techniques for adapting the model to your data, and anybody you give your model to can do the same.
This right isn't protected by copyright, it's protected by the lack of copyright, and how executables are directly modifiable. So the AGPL won't be possible, but it's a lot like permissive licenses, without attribution.
It's very very different from patching a binary, because these models don't work like a simple compilation of understandable source code into incomprehensible weights. The source training data is even less comprehensible and useful than its encapsulation into weights.
A set of weights is a pre-trained local minima in the model space is a both executable and modifiable. It's usually far more useful than the source training data, because the work has been done.
> It's very very different from patching a binary, because these models don't work like a simple compilation of understandable source code into incomprehensible weights. The source training data is even less comprehensible and useful than its encapsulation into weights.
Tons of binary patching works that way. For instance from the gameshark days it was relatively common for a patch that worked for unknown reasons, but who's discovery was tooling assisted and simply displayed a desired effect (and commonly a lot of undesired effects that weren't clear).
> A set of weights is a pre-trained local minima in the model space is a both executable and modifiable.
None of that changes whether it's open source or not. I guarantee you that mistral has some code (and a lot o data) laying around that created this model.
> It's usually far more useful than the source training data, because the work has been done.
For certain operations maybe. I guarantee that Mistral wouldn't constrain themselves to fine-tuning if they had a major change to make to the model.
No, patching binaries is not really similar at all, because it's viewed as an inferior method to having the source code and modifying that.
With these large models, the weights really are quite useful objects in themselves, the very starting points of future modifications, and importantly the desired points of modifications
Practioners in the field call these models "open source" and write open source software to run them and use them to power open source software.
> No, patching binaries is not really similar at all, because it's viewed as an inferior method to having the source code and modifying that.
> With these large models, the weights really are quite useful objects in themselves, the very starting points of future modifications, and importantly the desired points of modifications
For fine tuning. You go back to the training pipeline to make major changes.
> Practioners in the field call these models "open source"
Some do. In contrast to the accepted definitions of open source.
> and write open source software to run them and use them to power open source software.
There's plenty of open source code to run closed source binaries.
That does raise the question how irreversibly you have to process something before it is no longer protected by copyright.
Obviously zipping something is not enough, even lossy compression is (obviously) not enough. But then how does a language model differ from a lossy compression? Is it just the compression ratio? (are the weights even that much smaller than the data?)
There are ways to train models that guarantee that the amount of information transferred per data point is limited, but to my knowledge those aren't used (and may be prohibitively expensive).
When I take a picture with my phone, those pixels weren't created by a human either. Yet I still own their copyright.
At the core the question is how much artistic input there is in creating LLM models. Are the choice of the model architecture, hyperparameters and training data artistic choices comparable to those by a photographer setting up a shot? Or are they more comparable to technical work that's only protectable by trade secrets, patents and trademarks?
Training a model is more akin to setting a video camera somewhere and having it automatically record or not according to some algorithm you devise. The code of the algorithm would be copyrightable, of course, but the recording is another matter, since the "creativity" aspect is kinda missing there.
Copyright is not the only kind of Intellectual Property. While they may not be copyrightable (I think you may be refering to jugement related to having an AI as author?), they clearly are MistralAI IP.
In the same way, data in a database are usually not copyrightable. They are still the company's property (excluding issues with PII and such).
I'm not so sure. If it is not copyrightable, and it is a patent, and not a trademark, which legal protection would it have? It could definitely be considered a trade secret, and the theft of it and misapproation of it would be a crime, but once it has been published, people that just used the published version I don't believe would be comitting a crime, since it would lose it's trade secret status.
There are certain "related rights" [1] that are weaker than full copyright. These include, maybe most importantly, performers' rights, but also protection for things like photographs (ones that aren't unique enough to qualify for full copyright) and, in many jurisdictions, databases. These aren't covered by the Berne convention so vary quite a bit. But whether a huge chunk of floating-point numbers qualifies for any legal protection, remains to be seen.
Mistral AI is incorporated in Paris, France. IANAL, but I think these[1][2] may qualify to protect these weights.
While there is a discussion to be had on sovereignty and international reaches of local laws (such as DMCA for instance), I think it's disingenuous to consider only US legal point of view.
No US court is going to invent new rights under US law because a company happens to be incorporated in a foreign jurisdiction.
> While there is a discussion to be had on sovereignty and international reaches of local laws (such as DMCA for instance)
The DMCA (and GDPR) do not have international reach. The DMCA only appears to apply internationally because it binds a whole series of American companies with an international presence. Meanwhile the GDPR protects European residents. A company dealing with these residents has crossed the border, so to speak.
> I think it's disingenuous to consider only US legal point of view.
Candidly, you're either confused about what disingenuous means or you're being disingenuous yourself.
> The DMCA (and GDPR) do not have international reach.
> The DMCA only appears to apply internationally because it binds a whole series of American companies with an international presence.
And by the fact that DMCA Takedown Notice and processes are valid in many countries. You can send a valid takedown to a french company and they'll take the content down. Not because of DMCA law in the US or because of American companies, but because the notice is accepted and has legal value within the scope of the local law.
> Their weights obviously aren't protected by trademark. So, what IP regime protects the weights?
> No US court is going to invent new rights under US law because a company happens to be incorporated in a foreign jurisdiction.
Your original question doesn't specify "in the US". A French company wouldn't be able to copy these without issues.
Does that mean you need to be protected everywhere to be qualified as protected? Or simply in the US? If so, then you can use your own argument to say that a Chinese court (or Marshall Islands I guess?) won't create a law for a company incorporated in the US. Which in turn means that IP is not protecting anything since it doesn't apply to every single country on the world with no exceptions.
And yet, IP does protect things. You can't both not scope your question and except a reponse scoped only to the US.
What a class act. Going to the leaked model on huggingface, not demanding it to be taken down, and just making a post on the page saying “Might consider attribution” is so damn amazing.
How does a world where GPTs are like the latest version of apache or mysql, look like? do we go back to the world of millions of web hosts (sorry, AI hosts)
Results look comparable to Medium indeed (I’m using it via Mistrals API, since I got sick of OpenAI switching their stuff up). Medium is pretty great, somewhere between 3.5-turbo and 4-turbo qualitatively. Would be awesome to have out there.
GPT-4 has been out for almost a year now, and it seems that the frantic pace of open-ai releasing new groundbreaking tech every month has come to a halt. Anyone knows what's happening with open AI ? has the recent turmoil with sama caused lag in the company ? or are they working on some superweapon ?
We are getting a lot of great models out of China in particular (Yi, Qwen, InternLM, ChatGLM), and some good continuations like Solar.
Lots of amazing papers on architectures and long context are coming out.
Backends are going crazy. Outlines is shoving constrained generation everywhere, and Lorax is a LLM revelation as far as I'm concerned.
But you won't hear about any of this on Twitter/HN. Pretty much the only thing people tweet about is vllm/llama.cpp and llama/mistral, but there's a lot more out there than that.
LoRAX [0] does sound super helpful and so I’d be curious if there are some good examples of people applying it. What are some current working deployments where one has 100s or 1000s of LoRA fine tuned models? I guess I can make up stuff that makes sense, so that’s not really what I’m asking, I’m interested in learning about any known deployments and example setups.
There aren't really any I know of, because its brand new and everyone just uses vllm :P
No one knows about it! Which is ridiculous because batched requests with loras is mind blowing! Just like many other awesome backends like InternLM's backend, LiteLLM, Outline's VLLM fork, Aphroidte, exllamav2 batching servers and and such. Heck, a lot of trainers don't even publish the loras they merge into base models.
Personally we are waiting on the integration with constrained grammar before swapping to Lorax. Then I am going to add exl2 quantization support myself... I hope.
Yes, we (LoRAX devs) saw that (we know the author pretty well). It's a useful addition, though quite a bit simpler than our level of support for multi-LoRA inference. We're planning on doing a more comprehensive comparison soon, now that it's officially out.
I will say that if you want to explore the forefront of this multi-LoRA inference, definitely worth giving LoRAX a look. We just added support for per-request model merging (https://predibase.github.io/lorax/guides/merging_adapters/) as an example, and are planning on continuing to double down on this idea of combining adapters in some pretty unique ways.
I have submitted some in the past. Others are submitting them! And I upvote every one I like in /new. But HNers don't really seem interested unless its llama.cpp or Mistral, and I don't want to spam.
I can't say I blame them either, there is a lot of insane crypto-like fraud in the LLM/GenAI space. I watch the space like a hawk... and I couldn't even tell you how to filter it, it's a combination of self-training from experience and just downloading and testing stuff myself.
Lots of interesting information is so fragmented in niche Discords. For instance, KoboldAI on merging and RP models in general, and some other niches. Llama-Index. VoltaML and some others in regards to SD optimization. I could go on and on, and know only a tiny fraction of the useful AI discords.
And yeah, /r/LocalLlama seems to be getting noisier.
TBH I just follow people and discuss stuff on huggingface directly now. Its not great, but at least its not discord.
Those rooms move too fast, and often are segregated good/better/best (meaning the deeper you want to go on a topic, the "harder" it is, politically and labor-wise, to get invited to the server).
Speaking of hard skills: how does one just hang out on a Discord server in any useful fashion? I lost the ability to deal with group chats when I started working full-time - there's no way I can focus on the job and keep track of conversations happening on some IRC or Discord. I wonder what the trick is to use those things as source of information, other than "be a teenager, student, or a sysadmin or otherwise someone with lots of spare time at work", which is what I realized communities I used to be part of consist of.
Discord is so time inefficient, its almost hilarious. For every incredible conversation between experts you observe, you have to weed through 200 times as much filler.
Law of diminishing returns. The low-hanging fruit has been picked.
> But the company’s CEO, Sam Altman, says further progress will not come from making models bigger. “I think we're at the end of the era where it's going to be these, like, giant, giant models,” he told an audience at an event held at MIT late last week. “We'll make them better in other ways.”
Given how they've struggled to even make even GPT 4 economical it's highly unlikely that larger models would be in any way cost effective for wide use.
And there's certainly more to be found in training longer on better datasets and adjusting the architecture.
Depends on how you look at it I guess. If it's really too expensive and overly slow to run for end users but is able to say, generate perfect synthetic datasets over time which can then be used to train smaller and faster models which pay back for that original cost then it is cost effective, just in a different application.
What I mean, rather, is that "cost-effective" implies that everything can be reduced down to some costing. So e.g. having humans do the mind-numbing work is often more "cost-effective" than robots doing the same thing, because humans can be hired for less. But from a broader perspective, a robot replacing a human in such a role is socially beneficial even if the robot is more expensive - and the ideal state of affairs is when all things that people would rather not do are fully automated away. LLMs are one piece of that puzzle, but we can only put it together if we stop thinking about everything in terms of money saved.
In capitalism everything can and is reduced down to some costing indeed. We don't have public schools and art classes because it's nice, it's because an educated society as a whole has a much higher economic output and you need someone to design good looking products which will sell better. I wouldn't say I agree with the process, but you can't ignore that it's the underlying reason for almost everything our society does.
On topic of LLMs and diffusion models I doubt they're really having that much of a positive social impact, I would bet that in the long run they'll automate more of the jobs humans would like to do compared to the ones we don't, and erode trust on a global level. But the economic gain is undeniable so we're doing it anyway. Automating all work would be a good thing in an ideal world, but in practice it's the one thing that still gives the lower classes some leverage over the capitalists and a world where the average person is expendable and unemployable isn't good for society.
I feel like this is a self-defeating approach. We don't need an ideal world for that - you rightly identified capitalism as the problem, and it's not utopian to think of other options. Nor do many of them require some sort of ideal man to work, as their opponents sometimes present.
Change will happen - necessarily so, in fact, as more automation is brought online, because the crowd of people excluded from the economy as "redundant" will not be happy about that state of affairs, and it will only grow in size. At some point, they will understand that the real problem is not that they don't have work, but that they don't have a fair share of the automation pie. And things like, say, property rights on means of production / automation stop to matter if most people don't believe in them anymore.
I don't think that's what that means. I think he means "we don't have to keep making models larger, because we've found ways to make very good models that are also small".
I sometimes wonder if they hit a limit the government was comfortable with and their more advanced technologies are only available to the gov. I assume that's your implication.
I think that commenter means OpenAI supplies fundamental technology to competing companies. Everyone wants to put "AI" into their products, OpenAI can sell it to all of them as they compete with each other
Maybe there are diminishing returns on improving quality, so they're trying to improve the efficiency at a given quality level instead? There is a lot value to producing results instantly, but more importantly efficiency will allow you to serve more users (everyone is bottlenecked on inference capacity) and gain more users thanks to winning on pricing.
Maybe tangential to this comment, but I had been using 3.5 to write contracts, but a couple of weeks back, even when using the exact same prompts that would've worked in the past, 3.5 started saying, to paraphrase, "for any contracts you need to get a lawyer."
Anyone else had this experience? It seems like they're actually locking down some very helpful use cases, which maybe falls into the "safety" category or, more cynically, in the "we don't want to be sued" category.
As they tack on more and more guardrails it becomes lazier and less helpful. One way it gets around admitting that it has instructed not to help you is by directing you to consult a human expert.
I also suspect there’s some dynamic laziness parameter that’s used to counteract increased load on their servers. You can ask the same prompt over and over in a new chat throughout the day and suddenly instead of writing the code you asked for or completing a task, it will do a small part of the work with “add the code to do xyz here” or explain the steps required to complete a task instead of completing it. It happens with v4 as well.
Using it for legal matters is (understandably) explicitly against use policy. It's not meant to be used for legal advice so the response you're receiving is accurate - go get a lawyer.
It's a toy, it's not meant to be used for anything actually useful - they'll keep nerfing it case by case, because letting users do useful stuff with their models is either leaving money on the table, taking an increased PR/legal risk, or both.
Add 'reducing compute use as much as possible' to the list, I'm sure. I can look at some of my saved conversations from months ago and feel wistful for what I used to be able to do. The Nerfening has not been subtle.
Ha, yeah, that's the right description of my feeling. Like, thanks for being so incredibly useful, ChatGPT, saving me hours of copying and pasting bits and pieces from countless Google searches for "sample warranty deed," only to pull the rug out when I wanted to start depending on your "greatness."
I guess, as one of the other replies here said, it was never "allowed" to give you text for a contract according to the TOS, but then it would've been best if it never replied so effectively to my earlier prompts. Taking it away just seems lame.
- They have better just stuff, but aren't release it yet. They are at the top already, it makes sense they would hold their cards until someone got close.
- They are more focused on AGI, and not letting themselves get side tracked with the LLM race
- LLMs have peaked and they don't want to release only a minor improvement.
> have better just stuff, but aren't release it yet
This hypothesis has a curious habit of surfacing when OpenAI is fundraising. Together with the world-ending potential of their complete-the-sentence kit.
Has it come to a halt? I guess looking only at the perspective of no gpt-5...then yes? But I see wide access to multi-modal. Huge increases to throughput, not much latency these days. Better turbo models though I would agree in some areas those have been steps back with poorer output quality. Adding to that list, massive cost reductions so its easy to justify any of the models being used.
GPT 4 is so far ahead of everything else that it doesn't make much sense to rush for GPT 5 release. They could get extra GPT 5 customers when they release it as they are pretty sure no one else would take away those users.
A probabilistic word generator is still a word generator. It might be slightly better next version, we've seen it get worse, its more of the same
We can talk about AGI all we want, but it wont be built on the same technology as these word generators. There will have to be a technological breakthrough years before we even get close
The companies focusing on LLMs right now are dealing with
* Generate better words
* Make it cheaper to generate (use less compute)
* Find better training material
There is a ton of money to be made, but its still more of the same
Likely the way AI is trained and inferenced there is a huge constraint on the GPU side even if they have GPT5 out. Imagine how slow it is, which means they focus on creating useful products and APIs around it first
I'm sorry but this comment seems to be incredibly uninformed. OpenAI just released a new version of gpt-4 turbo quite recently. And just in the last couple of days they released the ability to use @ to bring any GPT into a conversation on the fly.
I will speculate! They have a model that far surpasses GPT-4 (achieved agi internally) but sama is back on a handshake agreement that they will only reveal models that are slightly ahead of openly available LLMs. Their reasoning being that releasing a proprietary model endpoint slowly leaks the models advantage as competitors use it to generate training data.
I wouldn't think this matters as long as you charge enough to use that API. For example, you could have a tiered pricing structure where the first 100k words per month generated costs $0.001 per word, but after that it costs $0.01 per word.
Even then it's still a lucrative proposition, since you only need to generate the dataset once, and then it can be reused.
This kind of pricing would also make it much less compelling for other users. 100k tokens is nothing when you're doing summarization of large docs, for example.
EDIT: Clearly the point is lost on the repliers. There is no general understanding of what 'general intelligence' is. By many metrics, ChatGPT already has it. It can answer basic questions about general topics. What more needs to be done? Refinement, sure, but transformer-based models have all qualifications to be 'general intelligence' at this point. The responses are more coherent than many people I've spoken with.
The problem with this definition 'an agent that can do tasks animals or humans can perform' is that it's not clear what that would look like. If you produce a system that can be interacted with via text input only but is otherwise capable of doing everything a human can do in terms of information processing, is that AGI? Or does AGI imply a human-like embodied form? Why?
Okay, well, I'll say that at least that's a definition (it has to be good enough at something to make money). Arguably of course, it already does that. Me personally, I've used it to automate tasks I would have previously shelled out to fiverr, upwork, and mechanical turk. I've had great success using it to summarize municipal codes from very very lengthy documents, down to concise explanations of relevant pieces of information. Since I would have previously paid people to do that (was running an information service for a friend), I would consider that AGI. I guess the catch here is now 'most', but that implies a lot of knowledge about the economy I don't think openai has. What is 'economically valuable'? Who decides?
At the end of the day, as with most things, AGI is a meaningless term because no one knows what that is.
When you take a university / school course, how is that functionally different from curve fitting? Given that arbitrarily complex states can be modeled as high-dimensional curves, all learning is clearly curve fitting, whether in humans, machines, or even at the abiological level (for example, self optimizing processes like natural selection). Even quantum phenomema are -- at the end of the day -- curve fitting via gradient descent (hopefully it's had enough time to settle at a global minima!)
Its smart. Definitely keeping it around as my slow/low context local llm.
Seems like a great candiate to merge with other 70Bs as well. There aren't a lot of really great 70B training continuations, like CodeLlama or sequelbox's continuations.
> Quantization in ML refers to a technique used to make it possible to run certain AI models on less powerful computers and chips by replacing specific long numeric sequences in a model’s architecture with shorter ones.
It’s always amusing when the press tries to explain technical concepts. Quantization just means substituting high precision numeric types with lower precision ones. And not specific “numeric sequences”, all numbers.
It's funny how the original phrase sounds like it was generated by chatgpt. I'd guess the reason they didn't phrase it how you suggest is they probably thought that people who don't know what quantization means aren't going to be happy if you throw precision numerics at them. I don't knwo why they didn't just say "some of the numbers are rounded" though.
The small team at Mistral is putting their competitors to shame. They're what "Open"AI should've been.