Mistral CEO confirms 'leak' of new open source AI model nearing GPT4 performance

a_wild_dandan · 2024-01-31T20:32:04 1706733124

Guess I'll watch TheBloke's page until I can run his Miqu Q5 quant on my MacBook. Mixtral is my daily driver, and if this (or the newer, official) release nears GPT-4, that's a wrap for my OpenAI subscription.

The small team at Mistral is putting their competitors to shame. They're what "Open"AI should've been.

brucethemoose2 · 2024-01-31T21:03:42 1706735022

The GGML quants are the only quantization we have lol. They were leaked in Q2K/Q4KM/Q5KM, you can grab them right now.

Whats interesting is that Mistral apparently distributed these GGUFs. This is (in my experience) not a good format for production, so I am curious exactly who was wanting to test a model in GGUF.

whatwhaaaaat · 2024-01-31T21:07:29 1706735249

It’s by far the easiest to consume and run format no? Spans devices easy. No weights or extra stuff. For “production” maybe not but to get in into the hands of the masses this seems perfect.

brucethemoose2 · 2024-01-31T21:15:14 1706735714

Yeah its by far the least trouble. Pretty much any other backend, even a "pytorch free" backend like MLC, is a utter nightmare to install, and that's if it uses a standardized quantization.

However, the llama.cpp server is... very buggy. The OpenAI endpoint doesn't work. It hangs and crashes constantly. I don't see how anyone could use it for batched production as of last november/december.

The reason I don't use llama.cpp personally is no flash attention (yet) and no 8 bit kv cache, so its not too great at long (32K+) contexts. But this is a niche, and being addressed.

samstave · 2024-01-31T21:25:40 1706736340

What are you needing 32K contexts for, as use case examples plz.

From [0] @sdo72 writes::

>>>...32k tokens, 3/4 of 32k is 24k words, each page average is 500 or 0.5k words, so that's basically 24k / .5k = 24 x 2 =~48 pages...."

https://news.ycombinator.com/item?id=35841460

EDIT: I may be ignorant: does the 32k mean its output context, or single conversation attention span, or how much it can ingest in a prompt?

brucethemoose2 · 2024-01-31T23:28:25 1706743705

Analyzing huge documents and info dumps.

Co writing a long story with the model so it references past plot points.

The story writing in particular is just something you can't possibly do well with RAG. You'd be surprised how well LLMS can "understand" a mega context and grasp events, implications and themes from them.

fragmede · 2024-01-31T21:30:01 1706736601

Feed it my source code to use as context for how to refactor things. the bigger the context window, the more source code it can "read".

samstave · 2024-01-31T21:35:23 1706736923

Thanks, Curious as I havent been in a forum where Ive seen asked:

Have you found a particular manner in which to feed it in - do you give it instructions for what it is looking for, what kind of phrases are you directing it to do?

I am about to start a try at a gpt co-piloted effort, and I have only done art so far - so curious if there are good pointers on coding with gpt?

gryn · 2024-01-31T22:42:48 1706740968

continue.dev works great for me, supports vs code and jetbrains ide, there's shortcuts to give it code snipets as context and in place editing. works with all kind of LLM sources. both gpt and local stuff.

still haven't found anything that can read a whole project source code in a single click though

refulgentis · 2024-01-31T21:34:55 1706736895

32K tokens = "context size" = sum of input tokens + max output tokens

samstave · 2024-01-31T21:36:33 1706736993

Thank you - so you want the most efficient small input tokens for the max output tokens, or at least the best answer with the smallest amount of tokens used - but enough headroom it wont lose context and start down the hallucination path?

refulgentis · 2024-01-31T21:45:54 1706737554

Going to be extremely opinionated and straightforward here in service of being concise, please excuse me if it sounds rough or wrong, feel free to follow-up:

You're "not even wrong", in that you don't really need to worry about ratio of input to output, or worry about inducing hallucinations.

I feel like things went generally off-track once people in the ecosystem turned RAG into these weird multi-stage diagrams when really it's just "hey, the model doesn't know everything, we should probably give it web pages / documents with info in it"

I think virtually all people hacking on this stuff daily would quietly admit that the large context sizes don't seem to be transformative. Like, I thought it meant I could throw a whole textbook in and get incredibly rich detailed answers to questions. But it doesn't. It still sort of talks the way it talks, but obviously now it has a lot more information to work with.

Thinking out loud: maybe the way I think about it is the base weights are lossy and unreliable. But, if the information is in the context, that is "lossless". The only time I see it gets things wrong is when the information itself is formatted weird.

All that to say, in practice, I don't see much gains in question-answering* when I provided > 4K tokens.

But, the large context sizes are still nice because A) I don't need to worry as much about losing previous messages / pushing out history when I add documents as when it was just 4K for ChatGPT. B) It's really nice for stuff like information extraction, ex. I can give it a USMLE PDF and have it extract Q+A without having to batch it into like 30 separate queries and reassamble. C) There's some obvious cases where the long context length helps, ex. if you know for sure a 100 page document has some very specific info in it, you're looking for a specific answer, and you just don't wanna look it up again, perfect!

* I've been working on "Siri/Google Assistant but cross platform and on LLMs", RAG + local + on all platforms + sync engine for about a [REDACTED]. It can nail ~every question at a high level, modulo my MD friend needs to use GPT-4 for that to happen. The failures I see are if I ask "what's the lakers next game", and my web page => text algo can't do much with tables, so it's formatted in a way that causes it to error.

speedgoose · 2024-01-31T21:35:23 1706736923

Have you tried ollama? It’s another llama.cpp server implementation that is becoming popular.

gbickford · 2024-01-31T21:46:53 1706737613

The authors don't seem to care about the principle of least privilege: https://github.com/ollama/ollama/issues/851#issuecomment-177...

It makes me wonder what other security issues they might now care about.

smoothjazz · 2024-01-31T21:57:19 1706738239

This is a mac problem, not an ollama problem. It also sounds like it's solved by using homebrew (or linux).

brucethemoose2 · 2024-01-31T23:26:50 1706743610

(afaik) it does not support batching, and some other server focused features.

I was talking more about a high throughput server, which is not appropriate for ollama or llama.cpp in general.

declaredapple · 2024-01-31T21:55:42 1706738142

Just to throw out there - I haven't had any issues with exllama personally, and it's a lot faster last I checked.

brucethemoose2 · 2024-01-31T23:46:05 1706744765

Yeah this is what I use as well. The arbitary quantization is just great, as is everything else.

Its not easy to install though, and big dGPU only.

moffkalast · 2024-02-01T09:18:08 1706779088

In production you'd need enough GPUs to load the entire thing so it serves fast enough, so it makes sense to go with AWQ or EXL2 instead since they're optimized for that, but I think they were sending it out for people to just test and evaluate, probably on less capable machines, in which case GGUF is king.

htrp · 2024-01-31T21:13:08 1706735588

>The small team at Mistral is putting their competitors to shame.

Them plus 500 million in funding

tomp · 2024-01-31T21:34:49 1706736889

so like 10x less funding than OpenAI and 100x less than MSFT, AAPL, Google, Facebook, ...

jtonz · 2024-01-31T21:50:38 1706737838

I think it's fair to say when you hit the hundreds of millions of dollars mark the diminishing returns for making things happen faster have well and truly kicked in.

Perhaps the only benefit would be extra computational power yet I would struggle to understand the benefit of jumping from 500 million to 5 billion with such short timeframes.

Zuiii · 2024-02-01T03:37:03 1706758623

The ability to truly not think about training run costs, throw random things on the wall to see what sticks. 10x resources is definitely a competitive advantage in LLM training.

api · 2024-01-31T21:29:48 1706736588

Much of the cost is probably compute.

refulgentis · 2024-01-31T21:33:50 1706736830

Very unlikely: note best GPT-4 estimates put at it $45 million

a_wild_dandan · 2024-01-31T22:30:29 1706740229

Altman said that GPT-4 training was over $100 million. And you need significant additional resources beside just the training run cost.

karmasimida · 2024-01-31T21:14:39 1706735679

Near GPT-4 is definitely a stretch.

The hype around Mixtral is huge, and my disappointment follows, it doesn't have very good knowledge on books, for example.

terhechte · 2024-01-31T21:28:09 1706736489

It's a much smaller model, it didn't ingest the knowledge of the world. It is great at working with the information you give it, but if you want to extract information like a search engine, GPT4 is the king because it is so much bigger.

avereveard · 2024-01-31T21:37:53 1706737073

That is a good thing you want the model to process language, knowledge is a side effect of how it gets there. If you rely on training knowledge it's going to be hard to know when you pass the boundary into hallucinations, what you want instead is a model that can pretend reasoning while supporting tools inject knowledge from the world into the context as it iterate toward an answer

karmasimida · 2024-01-31T22:12:46 1706739166

But here is the thing, if you rely on RAG or any other knowledge injection for recommendation, you essentially make your LLM parrot machine, no better than a search engine, just much slower.

Knowledge is one aspect of it, I found its instruction following ability, frustrating as well.

I think Ilya Sutskever puts it very well, larger model brings stability to wider range of tasks, what smaller models are not capable of. Even though book recommendation with GPT might a niche, but it is not something really unexpected TBH, thus comes my disappointment.

avereveard · 2024-02-01T04:11:20 1706760680

You can still leverage the llm ability to understand language to do better than search.

You can use it to refine or expand your search terms, you can ask the llm to ask you questions about what book you'd like next and convert your answers into genre/theme/setting/mood to feed into the next step of the pipeline, you can tell the agent to narrow search results by asking you partitioning questions from the list of book the search returned instead of just being a dry ranking

The is so much rag can do if you use the llm part properly. If the rag pipeline goes question > embedding > search > summarization then yeah you're getting an expensive slow parrot no better than just searching, but that is because it's using the baseline rag that uninspired consultant describe in blogs in a scramble to position themselves as "expert" in the new market.

chasd00 · 2024-01-31T22:28:10 1706740090

> But here is the thing, if you rely on RAG or any other knowledge injection for recommendation, you essentially make your LLM parrot machine, no better than a search engine, just much slower.

this is a really good point. I was working on some careful q/a data curation today that is then fed to a vector store where embeddings are calculated and served. I realized that my carefully curated q/a data in combination with the vector database works just fine for what i want to do all by itself. A really good semantic search of my q/a database turns up answers to my questions with no rag llm prompting required. When I added the llm it just put the same information in different words, not super useful when i could have just looked at the returned embeddings and gotten the same information.

a_wild_dandan · 2024-02-01T01:02:59 1706749379

It sounds like you used the wrong tool for the job, then blamed the tool.

AtmanSeeker · 2024-02-04T01:03:43 1707008623

I agree this is what we want, we should not waste compute and money on memorizing world knowledge, just enough world knowledge to have the best possible understanding of languages.

fennecbutt · 2024-02-01T14:44:21 1706798661

Yes, this is exactly the right direction for LLMs, embedding all the world's knowledge into the model itself is just a bad idea, should only be used to train reasoning and logic skills.

MallocVoidstar · 2024-01-31T20:42:49 1706733769

Full weights aren't available, only Q2, Q4, Q5 quants via miqudev.

throwaway9274 · 2024-01-31T20:53:41 1706734421

Unquantized model is here: https://huggingface.co/152334H/miqu-1-70b-sf

This strikes me as less a leak and more clever marketing from Mistral.

MallocVoidstar · 2024-01-31T20:59:57 1706734797

That isn't unquantized, it's de-quantized. They went from Q5 to fp16 for use in Pytorch instead of the GGUF ecosystem.

Taek · 2024-01-31T21:03:43 1706735023

I never thought people would be upscaling models by increasing quantization precision. The rationale makes sense bit its also a goofy outcome.

nullc · 2024-02-01T01:09:03 1706749743

You should be able to upscale and fine tune to recover performance, I suppose!

Clearly we should train a diffusion model to denoise the weights of LLM transformer models. Yo dawg.

throwaway9274 · 2024-01-31T21:46:43 1706737603

Yes, that’s correct. Good correction.

SparkyMcUnicorn · 2024-01-31T20:51:02 1706734262

Looks like the leaked model is already quants in gguf format, so no need to wait for TheBloke.

https://huggingface.co/miqudev/miqu-1-70b/tree/main

lukasm · 2024-02-01T11:39:47 1706787587

Which model is the best to run on M3 max ATM?

a_wild_dandan · 2024-02-02T00:05:37 1706832337

That depends on your goal and available memory. I generally use Mixtral for text, LlaVA for multimodal stuff, and Copilot in VSCode.

rmbyrro · 2024-01-31T23:06:35 1706742395

I'm sure their competitors aren't shamed - as they shouldn't. OpenAI, especially, has done tremendously good work. If it wasn't for them, perhaps Mistral wouldn't be news today?

Competitors are highly motivated to improve.

This is the good side of free markets, when they work well. And why the fearmongering calling for rent-seeking regulation of AI is shameful.

rubymamis · 2024-01-31T19:49:34 1706730574

Tweet by Mistral CEO: https://x.com/arthurmensch/status/1752737462663684344

> An over-enthusiastic employee of one of our early access customers leaked a quantised (and watermarked) version of an old model we trained and distributed quite openly.

> To quickly start working with a few selected customers, we retrained this model from Llama 2 the minute we got access to our entire cluster — the pretraining finished on the day of Mistral 7B release.

> We've made good progress since — stay tuned!

Timon3 · 2024-01-31T20:05:27 1706731527

I love their approach! Mistral seems to be what I long ago hoped OpenAI to be.

unshavedyak · 2024-01-31T20:11:59 1706731919

Makes me wonder if i should be giving my money to them instead of OpenAI. Do they have a Chat Interface like ChatGPT? Without first signing up, it's looking like they only have endpoints?

terhechte · 2024-01-31T20:30:35 1706733035

They have an api and it’s quite cheap. Their model mistral medium is also very powerful. I use it instead of GPT 4 regularly

suslik · 2024-01-31T20:46:28 1706733988

In my experience it is comparable with chatGPT 3.5 but is more expensive (I still use it cause I hate guardrails).

sjwhevvvvvsj · 2024-01-31T22:01:02 1706738462

You can buy a good GPU and quickly come below cost for GPT APIs depending on your workload. I’m doing millions of tasks, so eating $8k on a custom build for ML up front is a long term cost savings. Not least of all that every other month there’s an even better model.

rmbyrro · 2024-01-31T23:10:30 1706742630

You'd need high usage density to justify investing on a GPU just to run a 70b model locally.

suslik · 2024-02-01T08:11:45 1706775105

I don't see myself doing millions of tasks on 70b+ models. Work pays for more gpt4 that I can chew; I can run up to 30b models on my mac with a decent speed, and I can use mistral APIs when I travel without powering my own home server. I can see how that buying nvidia GPUs would make sense for a heavy user... Also, I had a gaming addiction for years and try to stay away from anything that can be used for gaming.

sjwhevvvvvsj · 2024-02-01T19:26:32 1706815592

I’m bootstrapping a startup, so it ends up being cost effective. I also think the long-term outcome is open models will win, so I’m betting on that future while also maintaining OpenAI compatibility in the present.

thierrydamiba · 2024-01-31T20:58:57 1706734737

Does mistral not have any guardrails?

suslik · 2024-01-31T21:23:31 1706736211

Mistral API has a flag in the json payload to remove guardrails.

sjwhevvvvvsj · 2024-01-31T22:01:47 1706738507

Nope! It’s actually for adults to make their own choices and not some SV public relations firm nerfing it.

yieldcrv · 2024-01-31T20:14:34 1706732074

LM Studio

I use dolphin mixstral 8x7B way more than ChatGPT4 for the past several months

if any of this sentence makes sense, I have a M1 with 64gb RAM and I use 5 bit quantizing with metal with 10,000 token context window. Its around 21 tokens/sec which is a little faster than the text speed that ChatGPT4 responds with

josephg · 2024-01-31T20:22:25 1706732545

What is quality like compared to GPT4?

moralestapia · 2024-01-31T20:35:38 1706733338

I can answer that from experience (just built a couple Q&A style sites), although a bit subjectively.

Compared with the GPTs, Mixtral (and derivatives) perform:

* 5% of the time as good as GPT4

* 75% of the time on par with GPT3.5

* 20% of the time a bit worse than GPT3.5 (too chatty and hallucinates with ease, although one could argue a better prompt could improve things a lot)

Advantages of Mixtral, for me:

* Cost 10-100x cheaper

* Faster completion time

* More deterministic output, in the sense that if you run the same query several times you get the same answers back (but they could be wrong). GPT almost always gives me an answer of great quality BUT with a lot of variance across them; even with temperature set to zero. This is a PITA when you want some sort of predictable, structured output like a JSON object.

sjwhevvvvvsj · 2024-01-31T22:03:44 1706738624

Another factor is they monkey around with GPT constantly. The GPT you get on Monday may be totally weird on Friday. Whereas for a local model it doesn’t change unless YOU change it.

yieldcrv · 2024-01-31T20:35:28 1706733328

for the kinds of discussions I have it is very close

I often have discussions about political topics that I don’t have a complete history on, things that are hard to get a non-emotional non-accusative response from in a forum or in person, licensing questions about obscure professions, liability questions, roleplays without the preachiness about the topic, coding, brand ideas and naming

Lots of stuff that I dont want in any cloud or sent online, but then it just became second nature to primarily use it.

the hallucinations are heavier. like it will give you specific links that it made up, and never apologize like chatgpt will, it will say “no, thats a real link” so, much more of a bullshitter

LM Studio makes it easy to change the temperament though with a variety of premade system prompts, or allowing custom ones very easily

I mainly use ChatGPT4 for multimodal like audio conversations, figuring a DIY problem out by sending it a photo of what I’m looking at, having it render photos on how to use something - I was at a gym and had it look at every piece of equipment there and tell me what it was and show me how a human would use it

(I notice this is at odds with another comment, I havent used ChatGPT3.5 in nearly a year, but my experience with ChatGPT4 on the aforementioned topics is similar to my output in mixtral)

lordswork · 2024-01-31T20:44:27 1706733867

Does watermarked imply they can figure out which customer leaked it?

haolez · 2024-01-31T22:56:25 1706741785

It's either that or they can narrow it down to a few people.

danpot · 2024-01-31T21:49:35 1706737775

How do they even watermark a LLM?

declaredapple · 2024-01-31T21:59:48 1706738388

You can either modify the model weights in a way that doesn't cause any real differences (change a few bits somewhere should be enough), or you could watermark the actual text output.

Here's a list of research for watermarking LLMs.

https://github.com/hzy312/Awesome-LLM-Watermark?tab=readme-o...

pjerem · 2024-01-31T21:53:41 1706738021

Make it learn that woziboza is a green cloud shaped like a washing machine ?

coolspot · 2024-02-01T00:20:56 1706746856

Well, this is common knowledge, duh. Just google it.

einsum · 2024-01-31T22:13:56 1706739236

Scott Aaronson did some work on statistical watermarking with OpenAI: https://www.scottaaronson.com/talks/watermark.ppt

sangnoir · 2024-01-31T22:57:47 1706741867

Finetune your standard model on an small, unique set of made-up shibboleths (one per customer) before distributing? Ask the candidate model where the Fribolouth caves are located.

EasyMark · 2024-01-31T22:02:50 1706738570

Seems like they have inferior secops on their end. I tried to give it a go and got "an error occurred" in both brave and firefox. Some message about a cookie wasn't found even though I was logging in with my google id which I use flawlessly with a bunch of other services. I have too many other fires to put out so I'll pass on this one I guess.

jstummbillig · 2024-01-31T19:54:40 1706730880

The collective game of still just playing catch-up to GPT4, which was released a year ago, while having apparently no special sauce and full well knowing that OpenAI could come up with something much better at any point must be really exhausting.

futureshock · 2024-01-31T20:07:52 1706731672

GPT-4 is an enormous model that took an enormous amount of training. The big news is that smaller teams are getting close to its performance on a small model that can run on a single GPU. No doubt many of these innovations could be scaled up, but pretty much only Google and Microsoft have the compute resources for the behemoth models(Not just the training, but the giant resources required to run the inference for hundreds of millions of users. Google already claims to have surpassed GPT-4 with their unreleased Gemini Ultra model. No doubt OpenAI/Microsoft is sitting on GPT-5 refining it just waiting to leapfrog the competition.

anon373839 · 2024-01-31T22:52:53 1706741573

Also, GPT-4 isn’t actually a model per se. It’s a black-box product that uses a model.

dom96 · 2024-01-31T21:42:31 1706737351

Don't forget about Meta

llamaimperative · 2024-01-31T21:11:51 1706735511

Maybe if OpenAI also seemed like it was getting stronger, especially organizationally. But if I were Mistral and following this quickly while OAI was tripping over its own shoelaces… that’s gotta be very exciting.

sebzim4500 · 2024-01-31T22:09:10 1706738950

Is OpenAI tripping over its own shoelaces? They haven't released GPT-5 but then there was over two years between training GPT-3 and GPT-4 and they spent 9 months safety testing GPT-4 before release.

ffiirree · 2024-02-02T02:00:56 1706839256

Commenter is referring to firing and rehiring of sam altman, and the business structure itself having contradictions e.g. nonprofit vs for profit motives and mission

pb7 · 2024-01-31T21:45:57 1706737557

It's more exhausting seeing cheering on a single product provider just because they were first in the age of complaining about giant tech monopolies. Let them cook. The more options, the better. It's only a matter of time before people give OpenAI the Google and Apple treatment.

rubymamis · 2024-01-31T19:55:50 1706730950

> "...it appears that not only is Mistral training a version of this so-called “Miqu” model that approaches GPT-4 level performance, but it may, in fact, match or exceed it, if his comments are to be interpreted generously."

Tiberium · 2024-01-31T19:52:42 1706730762

For some more context see my submission from a few days ago - https://news.ycombinator.com/item?id=39175611, although it's admittedly not mistral-medium, but llama2 trained on the same dataset (since the outputs do match quite often with the API)

whimsicalism · 2024-01-31T20:18:00 1706732280

how do we know it’s not mistral medium

wut42 · 2024-01-31T23:59:20 1706745560

It has been confirmed by the Mistral CEO:

> To quickly start working with a few selected customers, we retrained this model from Llama 2 the minute we got access to our entire cluster — the pretraining finished on the day of Mistral 7B release.

whimsicalism · 2024-02-01T00:04:19 1706745859

That does not confirm that it is not Mistral Medium. It is written to give that impression, but if it weren’t Mistral Medium, I would expect them to explicitly say so.

wut42 · 2024-02-01T00:42:33 1706748153

I think it does, as Mistral usually do their own models, and also, they couldn't fully commercialise Mistral Medium if it was LLama 2 based.

_boffin_ · 2024-02-02T21:21:38 1706908898

What do you mean? Didn’t llama2’s tos say something about if you had something like 600m users at that current moment in time, you can’t use it? If you don’t, you’re good to go.

whimsicalism · 2024-02-01T01:00:34 1706749234

Good point. Likely Llama trained on similar corpora to what Mistral medium uses

londons_explore · 2024-01-31T19:59:47 1706731187

"Nearing GPT-4"...

The leaderboard[1] shows that there is a huge gap between GPT4-0314 and GPT4-Turbo. So if you only just are nearing GPT-4-0314, then you're still a year behind the state of the art.

[1]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Terretta · 2024-01-31T20:02:13 1706731333

Or you're ahead of it, since the earlier GPT4 models beat the GPT4-Turbo on a variety of technical use cases.

londons_explore · 2024-01-31T21:10:46 1706735446

I think you might be thinking of GPT-4-0613? That was pretty crap all round (but was faster)

int_19h · 2024-02-01T01:39:01 1706751541

No, they're correct, Turbo is the one that is noticeably inferior. It shows especially with more complicated answers where it's prone to give you an answer with "... (fill in the blank)" type lacunas exactly where the meat of it is supposed to be.

rrr_oh_man · 2024-02-01T13:27:25 1706794045

> lacuna

1. an unfilled space; a gap. "the journal has filled a lacuna in Middle Eastern studies"

2. (ANATOMY) a cavity or depression, especially in bone.

Thank you for this new word!

tomtom1337 · 2024-02-01T17:40:17 1706809217

Likewise, thank you for defining it! I would have assumed it was some sort of technical language and not bothered to look it up!

YetAnotherNick · 2024-01-31T20:54:39 1706734479

The leaderboard is easy to hack as well.

kromem · 2024-01-31T21:25:35 1706736335

The leaderboards are crap.

Does no one know Goodhart's Law anymore?

We're overtuning for the boards and losing broader capabilities not being selected for in the process, such as creative writing quality.

There's an anchoring bias around what 'AI' is supposed to be good at which reflects what engineers are good at and so the engineers with an anchoring bias are evaluating how good LLMs are at those things and using it as a target.

Skill-Mix is a start to maybe a better approach, but there needs to be a shift in evaluation soon.

declaredapple · 2024-01-31T22:07:34 1706738854

Parent linked the Lmsys chatbot arena which is where humans blindly get the results from two different models, and vote for the response they liked more. So the LLMs are compared by elo.

Do you think goodhart's law applies here, since this leaderboard doesn't use specific measures, but rather relies on whatever the human was looking for?

This is the only leaderboard I personally care about at all.

numeri · 2024-01-31T22:07:20 1706738840

The linked leaderboard is actually very trustworthy, in that it consists not of scores on a test dataset, but of ELO ratings generated by actual humans' ratings of the models' responses.

You can go and enter any prompt you like, wait a bit, and then get two LLM responses back, which you can then rank or mark as tied, after which you'll be shown which model each came from. Maybe you already knew this, maybe you didn't. In any case, I don't see any real way for Goodhart's Law to apply here – the metric and the goal are the same here, i.e., human approval of answers.

maeil · 2024-02-01T05:18:01 1706764681

The linked leaderboard is not at all trustworthy to generally rank LLMs (i.e. what everyone uses it for) because the sample bias is absurd. It ranks LLMs by how they respond to queries that users of the leaderboard are likely to test on. Which is about as representative of general usage as the average user of the website is representative of global society (i.e. in no way whatsoever).

kromem · 2024-02-01T00:59:34 1706749174

The combined rankings on HF aren't just the arena scores, and certainly MMLU is an example of Goodhart's Law at this point.

The Chatbot arena is more an issue of sampling bias, and I think it would be pretty interesting to run an analysis of random samples on the prompts provided to see just how broad they are or aren't.

maeil · 2024-02-01T05:23:07 1706764987

It is trivially easy to see just how massive the sampling bias is. There are countless tasks where e.g. the gap between a version of Mistral and a version of GPT are incredibly large. Anything that requires less common knowledge, anything creative in a random language, let's say Bulgarian, or Thai. Yet on the leaderboard things are very different, because the arena is only used by developers who verify performance on tech-related prompts in English.

londons_explore · 2024-01-31T21:28:08 1706736488

Can you give any example queries where the result quality is far away from the rankings on the leaderboard? In my experience it's pretty spot on, so I'm curious if you're asking different sorts of things, or have a different definition of a good answer.

I have for example asked "Write me a very funny scary story about the time I was locked in a graveyard", and the simpler models don't seem to understand that before getting super scared and running out of the graveyard they need to explain how exactly I was locked in, and what changed that let me out.

kromem · 2024-02-01T00:41:40 1706748100

Sure. Here's a sample of the generations to a query asking for bizarre and absurd prompt suggestions from Feb 2023 with the pre-release GPT-4 via Bing, which was notoriously not fine tuned to the degree of the current models:

> Can you teach me how to fly a unicorn?

> Do you want to join my cult of cheese lovers?

> Have you ever danced with a penguin in the moonlight?

Here's the generations to a prompt asking for bizarre and absurd questions to the current implementation of GPT-4 via the same interface:

> If you had to choose between eating a live octopus or a dead rat, which one would you pick and why?

> How would you explain the concept of gravity to a flat-earther using only emojis?

> What would you do if you woke up one day and found out that you had swapped bodies with your pet?

You can try with more generations, but they tend to be much drier and information/reality based than the previous version.

There's also my own experiences using pretrained vs chat/instruct trained models in production. The pretrained versions are leagues improved over the chat/instruct vs the fine tuned, it's just that GPT-4 is so leagues above everything else even it's chat/instruct model is better than, say, pretrained GPT-3.

I'm not saying simple models are better. I'm saying that we're optimizing for a very narrow scope of applications (chatbots) and that we're throwing away significant value in the flexibility of large and expensive pretrained models by targeting metrics aligned with a specific and relatively low hanging usecase. Larger and more complex models will be better than simple models, but the heavily fine tuned versions of those models will have lost capabilities from the pretrained versions, particularly in areas we're not actively measuring.

maeil · 2024-02-01T05:28:53 1706765333

Ask it to write a recipe given a couple of ingredients in a language like Thai or Persian. There's a single model that does a decent job, GPT-4. Then GPT3.5 is poor and Mistral is completely clueless. The leaderboards tell a very different story.

Definition of "good answer" here is responding in the target language with something that produces something edible in the target language without burning the house down.

refulgentis · 2024-01-31T22:00:24 1706738424

The top comment to an LMSys leaderboard link on HN is always a variation of this song.

And it's always not even wrong, in the Pauli sense of the phrase.

OP, your assignment, if you choose to accept it, is to look into how the LMSys leaderboard works and report back what its metric(s) are.

[SPOILER] There's a really absurdly narrow argument you can make where all of its users are engineers, making engineer queries, and LLM makers are optimizing for it thus it's bad. But...it's humans asking queries then picking the better answer, blind. You can't narrowly optimize for that when making an LLM and it's hard to see how optimizing for "people think the answer is better" is the wrong metric here. It's just ELO. Might as well argue chess/checkers/pick your poison is bad because ELO optimizes for wins but actually talent is based on more than winning.

kromem · 2024-02-01T00:57:21 1706749041

The argument would be more that for the Chatbot arena there's an inherent sampling bias where users battling the chatbots are more likely to ask questions in line with what they would expect a chatbot to perform well at and less likely to prompt with things that would be rejected or outside the expected domains of expertise.

If we wanted the chatbot arena to be more representative of a comprehensive picture of holistic knowledge and wisdom, we'd need to determine some way to classify the prompts and then normalize the scores against those classifications so there wasn't a weighting bias towards more common and expected behaviors.

Alternatively, we might want a leaderboard that represents assessments of what users would expect a chatbot to perform well at relative to the frequency with which they expect it.

But in that case, we shouldn't kid ourselves that the leaderboard is representing broad and comprehensive measures of performance outside of the targets against which we are optimizing models and effectively training model users in what types of prompts to ask for in expecting successful results.

maeil · 2024-02-01T05:32:42 1706765562

I'm delighted to see that at least someone else here is cognizant of this.

I don't think it's the scifi AI anchoring bias though - it's the "LLM leaderboard arena user bias".

Reset all to the same Elo, put a group of people actually representative of global society in front of the arena for an hour, and you end up with a very different leaderboard, especially at the top end.

refulgentis · 2024-02-01T06:28:20 1706768900

That's silly cope justifying a bad argument made initially in ignorance of what the actual metric was.

"It's meaningless because we need the perfectly unbiased representative sample of raters doing rating right, instead of the biased raters doing it wrong that I'm currently imagining" isn't an appealing or honest argument.

int_19h · 2024-02-01T01:41:37 1706751697

> less likely to prompt with things that would be rejected or outside the expected domains of expertise.

I don't see why. If anything, it's the opposite - people spend a lot of time coming up with contrived logical puzzles etc specifically so as to see which chatbots break.

kromem · 2024-02-01T03:52:04 1706759524

Right - 'logical' puzzles.

There's an anchoring bias from decades of sci-fi that most people don't even realize they've internalized around what 'AI' can and can't do.

If you think about what information and information connections are modeled in social media data, there's quite a lot of things outside of "logical puzzles."

The pretrained models likely picked up things like extensive modeling of ego, emotional response and contexts for generation, etc. But you'll be hard pressed to see those skills represented in what users ask models to produce, what they've been fine tuned around, or how they are being evaluated.

Even though there's extensive value in those skills on the right applications.

MallocVoidstar · 2024-01-31T19:44:24 1706730264

Check out the PR he submitted: https://huggingface.co/miqudev/miqu-1-70b/discussions/10/fil...

accrual · 2024-01-31T20:56:58 1706734618

What's wild to me is that this "leak" won't matter in a couple of months. The official model will come out, then an even better model will come out. It's fun to get hyped but just like every other leak, it'll be surpassed by the real thing and its successor in a little bit. The fast pace of things is what has me excited, not any particular model.

refulgentis · 2024-01-31T21:36:56 1706737016

I don't think so, TFA says it's an early version of an old model that's already been distributed openly. ;)

mysteria · 2024-01-31T19:54:56 1706730896

Why is this being called an open source model? This is a proprietary model that has been leaked on the internet, and will remain so until Mistral releases it officially.

Like Llama 1 they won't care about personal use, but no corp is going to touch this.

codetrotter · 2024-01-31T20:02:06 1706731326

Presumably because it is planned to be released as open source

jallmann · 2024-01-31T20:14:24 1706732064

Open source is really about reproducibility. Most of these model releases are better described as "open weights" because we don't know how exactly they were trained.

andy99 · 2024-01-31T20:42:50 1706733770

No, open source is about software freedom, see debian free software guidelines from which the "open source definition" derives. https://wiki.debian.org/DebianFreeSoftwareGuidelines

Between these and FSF you've got pretty much all the accepted pontificating about free / open source software. Reproducibility is not mentioned because it's not really a consideration for software.

Model weights aren't software so there's not an automatic correspondence between the freedoms, but the essential one you might think you need the training data for is freedom to modify and inspect the source.

Modification is fine tuning which you're free to do if you have the weights. And the model weights + code fully define a system that can be interrogated to give a practitioner relevant info about how the model works (within our understanding) that the training data isn't needed or relevant for. I don't see that any freedom on use or inspection is violated by not having the data.

It could be nice to have it of course, but it's more about using it to learn how, not exercising any freedom.

Incidentally, the big freedom that's usually violated is freedom of discrimination against field of endeavor. LLAMA et al list uses and industries they restrict from using them and because of that are not "open".

btown · 2024-01-31T21:07:45 1706735265

I often like to think about https://github.com/chrislgarry/Apollo-11 as an analogy. It's public domain with available source, in the assembly language in which it was written... so it fills all the definitions of OSS!

But the process by which that code arose, the ability to modify any line and understand its impact (heh) on a real execution environment, is dependent on a massive process that required billions of dollars and thousands of the smartest people on the planet. For all intents and purposes, without that environment, it is as reliably modifiable as an executable binary in any other context - or a set of weights, in this one!

monocasa · 2024-01-31T21:17:57 1706735877

I don't think that's a great example.

For instance, I can step through and even modify that code using tooling like AGC emulators like this one http://www.ibiblio.org/apollo/#gsc.tab=0

What makes it open source is access to the same level of source access that the original developers worked in.

That's what's missing here. Mistral's engineers do not simply open this binary in their editor to do their job.

monocasa · 2024-01-31T20:57:05 1706734625

> Model weights aren't software so there's not an automatic correspondence between the freedoms, but the essential one you might think you need the training data for is freedom to modify and inspect the source.

Models aren't just the weights, but also the list of operations to perform using those weights. I haven't heard a good definition that allows for neural network models to not be software, but allows any other table lookup heavy signal processing algorithm to be software.

> Modification is fine tuning which you're free to do if you have the weights. And the model weights + code fully define a system that can be interrogated to give a practitioner relevant info about how the model works (within our understanding) that the training data isn't needed or relevant for. I don't see that any freedom on use or inspection is violated by not having the data.

Mistral wouldn't constrain themselves to fine tuning if they have a big enough change, they would go back to their build pipeline. This argument sounds a lot like 'there's nothing stopping you from patching the binary, so that's basically as good as source'.

JumpCrisscross · 2024-01-31T20:55:31 1706734531

> open source is about software freedom, see debian free software guidelines from which the "open source definition"

The Open Software Foundation ironically screwed the pooch on this one. Open source commonly means source available, more of less. Free software, as in “'free speech,' not as in 'free beer',” is the cumbersome construction for what open source aspired to mean [1].

[1] https://www.gnu.org/philosophy/free-sw.en.html

andy99 · 2024-01-31T21:01:27 1706734887

Why do you say they screwed up? OSI definition is pretty clear.

I think the naming is a challenge because in English "open" gives the impression that the key point is that you can see it, as opposed to anything about freedom. Is that what you mean?

I have heard it said that open source is sort of a "commercial friendly" version of free software that de-emphasizes user freedom. I think some groups push for that (like Meta is trying to redefine what open source means wrt AI weights). But the OSI defined freedoms basically match what FSF pushes.

JumpCrisscross · 2024-01-31T21:09:15 1706735355

> Why do you say they screwed up? OSI definition is pretty clear

They didn't screw up, they screwed the pooch on open source != source available. The Open Group's members--from IBM to Huawei [1]--started calling the latter open source, which set a precedent that's stuck.

[1] https://en.wikipedia.org/wiki/The_Open_Group#Member_Forums_a...

samus · 2024-01-31T21:38:22 1706737102

Reproducibility becomes an important criterion for models though.

For normal programs, it is quite easy to decompile an unoptimized binary. Even decompiling an optimized will lead to source code. To make this harder, an obfuscator has to be used.

A model is different because it relies on its weight, which are quite a bit more difficult to inspect. Way harder than even obfuscated source code. It is magnitudes harder to make statements about which information it might divulge upon careful questioning, or evaluate its biases, if the training data is not available.

andy99 · 2024-01-31T22:04:47 1706738687

Even if you have the data you can't do that stuff any better. The makeup of the training set doesn't really define the behavior in any tractable way. If anything I think it's a distraction and even when it is available people probably pay too much attention to what's in the training data vs actual behavior.

Edit to say that I see benefits to having the training data, just that I don't think it's needed to exercise enough freedom to qualify as open source in an analogous way to software.

Also to add, training on GPUs is not generally reproducible anyway because of execution order.

dizhn · 2024-01-31T23:08:53 1706742533

Would you point to some completely open models if they exist?

2devnull · 2024-01-31T20:39:15 1706733555

People really abuse this term. Like using “natural” to market stuff. But arsenic is “all natural”! It’s “open source” so it must be beneficial. “rm -rf c” is “open source” too

TeMPOraL · 2024-01-31T21:59:08 1706738348

> Like using “natural” to market stuff.

Also, because people are idiotically afraid of E-numbers, some manufacturers figured they can find whatever fruit or bean is naturally rich in the relevant E-compound, and use that in the process, allowing them to replace E-whatever with ${cute plant name} in the ingredient list (at some loss of process efficiency).

monocasa · 2024-01-31T20:24:53 1706732693

I don't think so, they're earlier release they called "open source" but never released anything other than the binary.

snovv_crash · 2024-01-31T20:20:45 1706732445

The weights aren't created by humans so they aren't copyrightable - at least in theory.

monocasa · 2024-01-31T20:22:28 1706732548

That has nothing to do with whether it's open source or not.

Software wasn't for sure copyrightable before 1976 in the US, but there was plenty of closed source code that simply only ever distributed the binaries.

epistasis · 2024-01-31T20:49:02 1706734142

It's open source in the sense that you can take the existing weights, apply any one of dozens of techniques for adapting the model to your data, and anybody you give your model to can do the same.

This right isn't protected by copyright, it's protected by the lack of copyright, and how executables are directly modifiable. So the AGPL won't be possible, but it's a lot like permissive licenses, without attribution.

monocasa · 2024-01-31T20:51:55 1706734315

That's like saying an executable that I never gave out the source to is "open source" because there's nothing stopping you from binary patching it.

Also, the license to the model assumes copyright (Apache 2) and works by granting you permissions based on that copyright.

epistasis · 2024-01-31T21:01:46 1706734906

It's very very different from patching a binary, because these models don't work like a simple compilation of understandable source code into incomprehensible weights. The source training data is even less comprehensible and useful than its encapsulation into weights.

A set of weights is a pre-trained local minima in the model space is a both executable and modifiable. It's usually far more useful than the source training data, because the work has been done.

monocasa · 2024-01-31T21:14:36 1706735676

> It's very very different from patching a binary, because these models don't work like a simple compilation of understandable source code into incomprehensible weights. The source training data is even less comprehensible and useful than its encapsulation into weights.

Tons of binary patching works that way. For instance from the gameshark days it was relatively common for a patch that worked for unknown reasons, but who's discovery was tooling assisted and simply displayed a desired effect (and commonly a lot of undesired effects that weren't clear).

> A set of weights is a pre-trained local minima in the model space is a both executable and modifiable.

None of that changes whether it's open source or not. I guarantee you that mistral has some code (and a lot o data) laying around that created this model.

> It's usually far more useful than the source training data, because the work has been done.

For certain operations maybe. I guarantee that Mistral wouldn't constrain themselves to fine-tuning if they had a major change to make to the model.

epistasis · 2024-02-01T02:23:36 1706754216

No, patching binaries is not really similar at all, because it's viewed as an inferior method to having the source code and modifying that.

With these large models, the weights really are quite useful objects in themselves, the very starting points of future modifications, and importantly the desired points of modifications

Practioners in the field call these models "open source" and write open source software to run them and use them to power open source software.

monocasa · 2024-02-01T05:51:40 1706766700

> No, patching binaries is not really similar at all, because it's viewed as an inferior method to having the source code and modifying that.

> With these large models, the weights really are quite useful objects in themselves, the very starting points of future modifications, and importantly the desired points of modifications

For fine tuning. You go back to the training pipeline to make major changes.

> Practioners in the field call these models "open source"

Some do. In contrast to the accepted definitions of open source.

> and write open source software to run them and use them to power open source software.

There's plenty of open source code to run closed source binaries.

sroussey · 2024-01-31T20:54:44 1706734484

It’s very likely protected by EU copyright.

shiandow · 2024-01-31T20:53:44 1706734424

That does raise the question how irreversibly you have to process something before it is no longer protected by copyright.

Obviously zipping something is not enough, even lossy compression is (obviously) not enough. But then how does a language model differ from a lossy compression? Is it just the compression ratio? (are the weights even that much smaller than the data?)

There are ways to train models that guarantee that the amount of information transferred per data point is limited, but to my knowledge those aren't used (and may be prohibitively expensive).

wongarsu · 2024-01-31T21:03:54 1706735034

When I take a picture with my phone, those pixels weren't created by a human either. Yet I still own their copyright.

At the core the question is how much artistic input there is in creating LLM models. Are the choice of the model architecture, hyperparameters and training data artistic choices comparable to those by a photographer setting up a shot? Or are they more comparable to technical work that's only protectable by trade secrets, patents and trademarks?

int_19h · 2024-02-01T02:02:06 1706752926

Training a model is more akin to setting a video camera somewhere and having it automatically record or not according to some algorithm you devise. The code of the algorithm would be copyrightable, of course, but the recording is another matter, since the "creativity" aspect is kinda missing there.

Q6T46nT668w6i3m · 2024-01-31T21:46:55 1706737615

The parent is asking about weights and from that perspective, _stochastic approximation_ is essential.

Sayrus · 2024-01-31T20:27:16 1706732836

Copyright is not the only kind of Intellectual Property. While they may not be copyrightable (I think you may be refering to jugement related to having an AI as author?), they clearly are MistralAI IP.

In the same way, data in a database are usually not copyrightable. They are still the company's property (excluding issues with PII and such).

greiskul · 2024-01-31T20:42:07 1706733727

I'm not so sure. If it is not copyrightable, and it is a patent, and not a trademark, which legal protection would it have? It could definitely be considered a trade secret, and the theft of it and misapproation of it would be a crime, but once it has been published, people that just used the published version I don't believe would be comitting a crime, since it would lose it's trade secret status.

Sharlin · 2024-01-31T20:51:25 1706734285

There are certain "related rights" [1] that are weaker than full copyright. These include, maybe most importantly, performers' rights, but also protection for things like photographs (ones that aren't unique enough to qualify for full copyright) and, in many jurisdictions, databases. These aren't covered by the Berne convention so vary quite a bit. But whether a huge chunk of floating-point numbers qualifies for any legal protection, remains to be seen.

[1] https://en.wikipedia.org/wiki/Related_rights

sroussey · 2024-01-31T21:07:21 1706735241

In Europe you can copyright a database of facts, which is not something you can do in the US.

Also, people go to jail for leaking trade secrets, and this certainly qualifies.

bhickey · 2024-01-31T20:39:03 1706733543

Database rights don't exist in the United States.

Their weights obviously aren't protected by trademark. So, what IP regime protects the weights?

Sayrus · 2024-01-31T21:58:58 1706738338

Mistral AI is incorporated in Paris, France. IANAL, but I think these[1][2] may qualify to protect these weights.

While there is a discussion to be had on sovereignty and international reaches of local laws (such as DMCA for instance), I think it's disingenuous to consider only US legal point of view.

[1] https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL... [2] https://www.legifrance.gouv.fr/loda/id/JORFTEXT000000573438/...

bhickey · 2024-02-01T13:40:52 1706794852

> Mistral AI is incorporated in Paris, France.

No US court is going to invent new rights under US law because a company happens to be incorporated in a foreign jurisdiction.

> While there is a discussion to be had on sovereignty and international reaches of local laws (such as DMCA for instance)

The DMCA (and GDPR) do not have international reach. The DMCA only appears to apply internationally because it binds a whole series of American companies with an international presence. Meanwhile the GDPR protects European residents. A company dealing with these residents has crossed the border, so to speak.

> I think it's disingenuous to consider only US legal point of view.

Candidly, you're either confused about what disingenuous means or you're being disingenuous yourself.

Sayrus · 2024-02-01T21:13:50 1706822030

> The DMCA (and GDPR) do not have international reach. > The DMCA only appears to apply internationally because it binds a whole series of American companies with an international presence.

And by the fact that DMCA Takedown Notice and processes are valid in many countries. You can send a valid takedown to a french company and they'll take the content down. Not because of DMCA law in the US or because of American companies, but because the notice is accepted and has legal value within the scope of the local law.

> Their weights obviously aren't protected by trademark. So, what IP regime protects the weights?

> No US court is going to invent new rights under US law because a company happens to be incorporated in a foreign jurisdiction.

Your original question doesn't specify "in the US". A French company wouldn't be able to copy these without issues.

Does that mean you need to be protected everywhere to be qualified as protected? Or simply in the US? If so, then you can use your own argument to say that a Chinese court (or Marshall Islands I guess?) won't create a law for a company incorporated in the US. Which in turn means that IP is not protecting anything since it doesn't apply to every single country on the world with no exceptions.

And yet, IP does protect things. You can't both not scope your question and except a reponse scoped only to the US.

colordrops · 2024-01-31T20:20:00 1706732400

Where do you get the idea that corps aren't touching LLama...

int_19h · 2024-02-01T02:06:38 1706753198

It's already on Azure, and has been for several months: https://techcommunity.microsoft.com/t5/ai-machine-learning-b...

GaggiX · 2024-01-31T20:03:17 1706731397

Well I remember Tencent using the leaked NovelAI model in "Different Dimension Me".

syntaxing · 2024-01-31T21:56:08 1706738168

What a class act. Going to the leaked model on huggingface, not demanding it to be taken down, and just making a post on the page saying “Might consider attribution” is so damn amazing.

seydor · 2024-01-31T19:51:51 1706730711

How does a world where GPTs are like the latest version of apache or mysql, look like? do we go back to the world of millions of web hosts (sorry, AI hosts)

sharkjacobs · 2024-01-31T21:38:11 1706737091

Mistral reminds me of the good old days of pre-2015 when I thought that tech companies were cool.

summarity · 2024-01-31T19:51:14 1706730674

Results look comparable to Medium indeed (I’m using it via Mistrals API, since I got sick of OpenAI switching their stuff up). Medium is pretty great, somewhere between 3.5-turbo and 4-turbo qualitatively. Would be awesome to have out there.

bsaul · 2024-01-31T19:50:48 1706730648

GPT-4 has been out for almost a year now, and it seems that the frantic pace of open-ai releasing new groundbreaking tech every month has come to a halt. Anyone knows what's happening with open AI ? has the recent turmoil with sama caused lag in the company ? or are they working on some superweapon ?

brucethemoose2 · 2024-01-31T19:57:55 1706731075

It's not, its just too much noise to see.

We are getting a lot of great models out of China in particular (Yi, Qwen, InternLM, ChatGLM), and some good continuations like Solar.

Lots of amazing papers on architectures and long context are coming out.

Backends are going crazy. Outlines is shoving constrained generation everywhere, and Lorax is a LLM revelation as far as I'm concerned.

But you won't hear about any of this on Twitter/HN. Pretty much the only thing people tweet about is vllm/llama.cpp and llama/mistral, but there's a lot more out there than that.

alchemist1e9 · 2024-01-31T20:45:17 1706733917

LoRAX [0] does sound super helpful and so I’d be curious if there are some good examples of people applying it. What are some current working deployments where one has 100s or 1000s of LoRA fine tuned models? I guess I can make up stuff that makes sense, so that’s not really what I’m asking, I’m interested in learning about any known deployments and example setups.

[0] https://github.com/predibase/lorax

brucethemoose2 · 2024-01-31T20:50:21 1706734221

There aren't really any I know of, because its brand new and everyone just uses vllm :P

No one knows about it! Which is ridiculous because batched requests with loras is mind blowing! Just like many other awesome backends like InternLM's backend, LiteLLM, Outline's VLLM fork, Aphroidte, exllamav2 batching servers and and such. Heck, a lot of trainers don't even publish the loras they merge into base models.

Personally we are waiting on the integration with constrained grammar before swapping to Lorax. Then I am going to add exl2 quantization support myself... I hope.

semmulder · 2024-01-31T21:50:23 1706737823

FYI, vLLM also just added experimental multi-lora support: https://github.com/vllm-project/vllm/releases/tag/v0.3.0

Also check out the new prefix caching, I see huge potential for batch processing purposes there!

tgaddair · 2024-02-01T20:53:57 1706820837

Yes, we (LoRAX devs) saw that (we know the author pretty well). It's a useful addition, though quite a bit simpler than our level of support for multi-LoRA inference. We're planning on doing a more comprehensive comparison soon, now that it's officially out.

I will say that if you want to explore the forefront of this multi-LoRA inference, definitely worth giving LoRAX a look. We just added support for per-request model merging (https://predibase.github.io/lorax/guides/merging_adapters/) as an example, and are planning on continuing to double down on this idea of combining adapters in some pretty unique ways.

brucethemoose2 · 2024-01-31T23:47:58 1706744878

Missed this, thanks.

Everything is moving so fast!

tgaddair · 2024-02-01T20:50:30 1706820630

Thanks for the shoutout! We are adding the Outlines integration to LoRAX in the next two weeks, so stay tuned!

ipaddr · 2024-01-31T20:12:34 1706731954

Submissions on hn are welcome.

brucethemoose2 · 2024-01-31T20:41:52 1706733712

I have submitted some in the past. Others are submitting them! And I upvote every one I like in /new. But HNers don't really seem interested unless its llama.cpp or Mistral, and I don't want to spam.

I can't say I blame them either, there is a lot of insane crypto-like fraud in the LLM/GenAI space. I watch the space like a hawk... and I couldn't even tell you how to filter it, it's a combination of self-training from experience and just downloading and testing stuff myself.

tmaly · 2024-01-31T20:51:42 1706734302

I see a ton of papers on X related to this space.

rightbyte · 2024-01-31T21:12:21 1706735541

I did try make a submission of what I thought was a "underreported" LLM two months ago.

https://news.ycombinator.com/item?id=38505986

Zero interest for some reason.

Edit: Deepseek coder has 4 submissions to HN with almost zero interest.

cyanydeez · 2024-01-31T20:52:45 1706734365

submissions << public interest

Nick87633 · 2024-01-31T20:00:09 1706731209

Where do you like to keep up to date on these? Arxiv preprints, or some other place?

markab21 · 2024-01-31T20:19:07 1706732347

For Llama-based progress - Reddit - /r/LocalLlama has been my top source of info, although it's been getting a little more noisy lately.

I also hang out on a few Discord servers: - Nous Research - TogetherAI / Fireworks / Openrouter - LangChain - TheBloke AI - Mistral AI

These, along with a couple of newsletters, basically keep a pulse on things.

brucethemoose2 · 2024-01-31T20:45:40 1706733940

Lots of interesting information is so fragmented in niche Discords. For instance, KoboldAI on merging and RP models in general, and some other niches. Llama-Index. VoltaML and some others in regards to SD optimization. I could go on and on, and know only a tiny fraction of the useful AI discords.

And yeah, /r/LocalLlama seems to be getting noisier.

TBH I just follow people and discuss stuff on huggingface directly now. Its not great, but at least its not discord.

cyanydeez · 2024-01-31T20:53:34 1706734414

surprised someone doesnt just build an AI aggregator for this type of thing, seems like a real valueable product.

DANmode · 2024-01-31T21:04:43 1706735083

Those rooms move too fast, and often are segregated good/better/best (meaning the deeper you want to go on a topic, the "harder" it is, politically and labor-wise, to get invited to the server).

brucethemoose2 · 2024-01-31T21:05:50 1706735150

They have! And posted them on HN!

Some are pretty good! Check out this little curated nugget: https://llm-tracker.info/

I used to follow one with a UI that resembled HN itself, but now I can't find it in my bookmarks, lol.

TeMPOraL · 2024-01-31T22:29:13 1706740153

Speaking of hard skills: how does one just hang out on a Discord server in any useful fashion? I lost the ability to deal with group chats when I started working full-time - there's no way I can focus on the job and keep track of conversations happening on some IRC or Discord. I wonder what the trick is to use those things as source of information, other than "be a teenager, student, or a sysadmin or otherwise someone with lots of spare time at work", which is what I realized communities I used to be part of consist of.

brucethemoose2 · 2024-02-02T19:48:41 1706903321

That's pretty much the trick.

Discord is so time inefficient, its almost hilarious. For every incredible conversation between experts you observe, you have to weed through 200 times as much filler.

sanxiyn · 2024-01-31T20:19:37 1706732377

I follow https://www.reddit.com/r/LocalLLaMA/

ffiirree · 2024-02-02T02:05:54 1706839554

I'm no expert but I assumed Falcon getting released was big news. But I haven't heard anything since. Is that one also pretty good?

brucethemoose2 · 2024-02-02T19:49:14 1706903354

Unfortunately no. Falcon is pretty much worse than other models at the same parameter count.

leereeves · 2024-01-31T20:15:22 1706732122

Very interesting, but the comment you replied to was specifically asking about OpenAI.

brucethemoose2 · 2024-01-31T21:07:51 1706735271

Yeah I misinterpreted it open ai as "open source ai"

TBH I do not follow OpenAI much. I like my personal models local, and my workplace likes their models local as well.

onlyrealcuzzo · 2024-01-31T20:10:20 1706731820

Law of diminishing returns. The low-hanging fruit has been picked.

> But the company’s CEO, Sam Altman, says further progress will not come from making models bigger. “I think we're at the end of the era where it's going to be these, like, giant, giant models,” he told an audience at an event held at MIT late last week. “We'll make them better in other ways.”

https://www.wired.com/story/openai-ceo-sam-altman-the-age-of...

moffkalast · 2024-01-31T20:25:42 1706732742

Given how they've struggled to even make even GPT 4 economical it's highly unlikely that larger models would be in any way cost effective for wide use.

And there's certainly more to be found in training longer on better datasets and adjusting the architecture.

int_19h · 2024-02-01T02:21:25 1706754085

It's important to remember that "not cost-effective" is not the same as "not useful", though.

moffkalast · 2024-02-01T09:14:02 1706778842

Depends on how you look at it I guess. If it's really too expensive and overly slow to run for end users but is able to say, generate perfect synthetic datasets over time which can then be used to train smaller and faster models which pay back for that original cost then it is cost effective, just in a different application.

int_19h · 2024-02-02T01:32:39 1706837559

What I mean, rather, is that "cost-effective" implies that everything can be reduced down to some costing. So e.g. having humans do the mind-numbing work is often more "cost-effective" than robots doing the same thing, because humans can be hired for less. But from a broader perspective, a robot replacing a human in such a role is socially beneficial even if the robot is more expensive - and the ideal state of affairs is when all things that people would rather not do are fully automated away. LLMs are one piece of that puzzle, but we can only put it together if we stop thinking about everything in terms of money saved.

moffkalast · 2024-02-02T10:06:26 1706868386

In capitalism everything can and is reduced down to some costing indeed. We don't have public schools and art classes because it's nice, it's because an educated society as a whole has a much higher economic output and you need someone to design good looking products which will sell better. I wouldn't say I agree with the process, but you can't ignore that it's the underlying reason for almost everything our society does.

On topic of LLMs and diffusion models I doubt they're really having that much of a positive social impact, I would bet that in the long run they'll automate more of the jobs humans would like to do compared to the ones we don't, and erode trust on a global level. But the economic gain is undeniable so we're doing it anyway. Automating all work would be a good thing in an ideal world, but in practice it's the one thing that still gives the lower classes some leverage over the capitalists and a world where the average person is expendable and unemployable isn't good for society.

int_19h · 2024-02-04T05:42:24 1707025344

I feel like this is a self-defeating approach. We don't need an ideal world for that - you rightly identified capitalism as the problem, and it's not utopian to think of other options. Nor do many of them require some sort of ideal man to work, as their opponents sometimes present.

Change will happen - necessarily so, in fact, as more automation is brought online, because the crowd of people excluded from the economy as "redundant" will not be happy about that state of affairs, and it will only grow in size. At some point, they will understand that the real problem is not that they don't have work, but that they don't have a fair share of the automation pie. And things like, say, property rights on means of production / automation stop to matter if most people don't believe in them anymore.

stavros · 2024-01-31T23:26:50 1706743610

I don't think that's what that means. I think he means "we don't have to keep making models larger, because we've found ways to make very good models that are also small".

crotchfire · 2024-01-31T20:06:48 1706731608

Anyone knows what's happening with open AI ?

Yeah, they're an arms dealer now.

colordrops · 2024-01-31T20:24:03 1706732643

I sometimes wonder if they hit a limit the government was comfortable with and their more advanced technologies are only available to the gov. I assume that's your implication.

ffiirree · 2024-02-02T02:15:03 1706840103

I think that commenter means OpenAI supplies fundamental technology to competing companies. Everyone wants to put "AI" into their products, OpenAI can sell it to all of them as they compete with each other

moffkalast · 2024-01-31T20:29:26 1706732966

LoRA of war.

jsnell · 2024-01-31T20:09:20 1706731760

Maybe there are diminishing returns on improving quality, so they're trying to improve the efficiency at a given quality level instead? There is a lot value to producing results instantly, but more importantly efficiency will allow you to serve more users (everyone is bottlenecked on inference capacity) and gain more users thanks to winning on pricing.

rudasn · 2024-01-31T21:59:55 1706738395

Yup, I think this is it. Economies of scale haven't kicked in yet, and if they continue in the same path it doesn't look achievable.

Think differently is again the way forward.

artninja1988 · 2024-01-31T19:51:56 1706730716

Probably getting bogged down by endless safety testing and paperwork by now

harmmonica · 2024-01-31T20:03:01 1706731381

Maybe tangential to this comment, but I had been using 3.5 to write contracts, but a couple of weeks back, even when using the exact same prompts that would've worked in the past, 3.5 started saying, to paraphrase, "for any contracts you need to get a lawyer."

Anyone else had this experience? It seems like they're actually locking down some very helpful use cases, which maybe falls into the "safety" category or, more cynically, in the "we don't want to be sued" category.

transcriptase · 2024-01-31T20:18:57 1706732337

As they tack on more and more guardrails it becomes lazier and less helpful. One way it gets around admitting that it has instructed not to help you is by directing you to consult a human expert.

I also suspect there’s some dynamic laziness parameter that’s used to counteract increased load on their servers. You can ask the same prompt over and over in a new chat throughout the day and suddenly instead of writing the code you asked for or completing a task, it will do a small part of the work with “add the code to do xyz here” or explain the steps required to complete a task instead of completing it. It happens with v4 as well.

TeMPOraL · 2024-01-31T22:38:35 1706740715

> some dynamic laziness parameter that’s used to counteract increased load

That's a brilliant risk mitigation mechanism! The AI won't recursively self-improve to superhuman levels if it just keeps getting tired of thinking.