More

az226 · 2025-05-06T03:34:13 1746502453

The least speculative: PPUs will be converted from capped profit to unlimited profit equity shares at the benefit of PPU holders and at the expense of OpenAI the nonprofit. This is why they are doing it.

az226 · 2025-05-06T03:29:20 1746502160

You missed the part where OpenAI the nonprofit gives away the value that’s between capped profit PPUs and unlimited profit equity shares, enriching current PPUs at the expense of the nonprofit. Surely, this is illegal.

az226 · 2025-05-06T03:24:13 1746501853

Yes and no. It sounds like the capped profit PPU holders will get to have their units convert 1:1 with unlimited profit equity shares, which are obviously way more valuable. So the nonprofit loses insanely in this move and all current investors and employees make a huge amount.

az226 · 2025-04-12T09:30:41 1744450241

But these rumors were said and talked about several days. But no big options trade was made before the actual day of announcement. That’s why it’s telling.

az226 · 2025-04-06T06:24:03 1743920643

Meta will most likely compare against it when they release the upcoming Llama 4 reasoning model.

az226 · 2025-04-06T06:21:12 1743920472

And the practical flops always end up lower. As an example a V100 has 125 according to spec, but the ideal case is more like 100 and non-ideal like 60.

az226 · 2025-04-06T06:20:00 1743920400

That’s with sparsity. So it’s 29% down from 40%.

az226 · 2025-03-31T19:34:18 1743449658

Can you export/download the models after training?

hannesfur · 2025-03-31T19:44:41 1743450281

Not yet, but since people seem to want this it’s at the top of our roadmap. If you train a model now and want to get the weights, just message us and we‘ll give them to you ;)

az226 · 2025-03-27T19:41:52 1743104512

Is this some joke? They use Llama 2 7B? What year is it?

PaulHoule · 2025-03-27T20:21:56 1743106916

The best model is the one you can fit in memory.

About as soon as GPT-4 came out I said that OpenAI was doomed on the trajectory it was on because they could not afford to develop a GPT-5, GPT-6, etc.

Real innovation comes out of doing a lot of experiments and that means doing experiments quickly with the resources you have. So you do most of your experiments with non-frontier models, enough to make a good prediction of what would happen if you maxxed out your model size, then you go big. That's how you make everyone else have a "DeepSeek moment".

A company like Apple wants to pick something on the frontier and keep advancing on a straight line. Works great if you want to make an M1, M2, M3, ... ARM chip but that's not how progress works in AI today.

hinkley · 2025-03-27T21:25:57 1743110757

Will we see models built on b-trees to deal with memory requirements? Have we already?

sujayakar · 2025-03-27T21:43:43 1743111823

Deepseek is already using SSDs for their KV cache: https://github.com/deepseek-ai/3FS

vlovich123 · 2025-03-27T22:20:37 1743114037

You are deeply misunderstanding what the KV cache referred to here is. It’s not for storing data. This is the KV cache that’s part of the model to reduce quadratic compute complexity into linear for self attention. This is not stored on SSD - it’s in VRAM (or CPU if you’re not using a GPU)

boroboro4 · 2025-03-27T22:33:38 1743114818

They, in fact, mention inference kv cache as use case in readme. The most advanced kv caching uses hierarchy of gpu ram/regular ram/ssd. Seems like they were able to use their storage abstraction for last tier.

magicalhippo · 2025-03-28T02:47:08 1743130028

https://github.com/deepseek-ai/3FS?tab=readme-ov-file#3-kvca...

KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s

vlovich123 · 2025-03-28T04:18:27 1743135507

That's because DeepSeek uses MLA which apparently does allow offloading the KV cache. That doesn't apply to all models, particularly the open-weight models that are primarily GQA AFAIK.

boroboro4 · 2025-03-28T04:52:24 1743137544

Any models allow offloading KV cache, it’s not a matter of model architecture but only of the implementation. The only somewhat difference can be for non transformer models. But for all attention models it’s the same – blob of data per each token. It’s much worse for older models with MHA because their KV cache is just too big, and it’s better for DeepSeek because their KV cache is the smallest. But it’s alright for current generation of GQA models as well.

vlovich123 · 2025-03-28T07:28:02 1743146882

Are you sure about that? GQA applies self-attention to every KV cache entry. If you're offloading, then you're having to dynamically page in all the KV cache entries into the GPU which is quite slow since the CPU/GPU link only has so much bandwidth. My understanding is that MLA reduces the size of the KV cache & doesn't necessarily attend to every KV token at every step which is why offloading to disk works (i.e. most of the tokens can remain on disk without ever being loaded into the GPU).

boroboro4 · 2025-03-28T12:09:24 1743163764

Offloading in this case doesn’t mean keeping the kv cache on the disk/in storage all the time, it means keeping it there when request isn’t in process of generation. While request being generated kv cache is indeed in vram.

As for MLA - Deepseek is, just like others, attend to all historical tokens. The only difference instead of having actual KV entries it has lower dimension KV entries, which are being projected into full blown KV entries on the fly during attention. It’s similar to GQA, just instead of just duplication KV entries by size of groups it applies linear transformation.

vlovich123 · 2025-03-28T15:42:41 1743176561

Ah OK. So this is for resuming chat context cheaply. What I said is still correct - 3FS is not part of the inference flow & not relevant to the paper which is about optimizing the KV cache usage at runtime.

monocasa · 2025-03-27T21:00:36 1743109236

I mean, there's other, better 7B models than Lllama 2 at this point.

x1000 · 2025-03-27T20:19:50 1743106790

If they had experimented using a newer model (gemma 3, deepseek-1 7b, etc.) and reported better results, would that be because their newer baseline model was better than the llama 2 model used in the previous methods' experiments? A more comprehensive study would include results for as many baseline models as possible. But there are likely other researchers in the lab all waiting to use those expensive GPUs for their experiments as well.

josephg · 2025-03-27T21:07:53 1743109673

Sure. But papers take a really long time to write and go through peer review. I think my paper on collaborative editing took about 4 months from the point where we were done writing to the point at which it appeared on arxiv.

This research was almost certainly done well before Gemma 3 and Deepseek were released.

krasin · 2025-03-27T21:13:29 1743110009

> Is this some joke? They use Llama 2 7B? What year is it?

They use llama2 to demonstrate that their compression method works. There are potential cases:

1. The method works on all / most LLMs. In this case, it does not matter on which model they demonstrated the effect.

2. The method only works on llama2, but not on other models. Given that they published the code, I expect that people will quickly test the method on many other models, so we will know that soon. And yet - there would be a scientific significance even if it works only on llama2, as it would mean that there's some special and good in that architecture.

But I would bet it's #1 - the method works on most of the models and they just picked whatever they had already had code bindings to, to save the effort.

az226 · 2025-03-26T21:38:00 1743025080

Classic, Oracle denying breach despite clear evidence.

dylan604 · 2025-03-27T00:48:43 1743036523

This is the way.

Deny deny deny. Those that have already drunk the kool-aid will believe your denial. Those that are too lazy to look or only get their info from one source will not know any different than your denial. The rest are just wrong from being in opposition anyways.

It works anywhere as long as you are large enough of an entity

franktankbank · 2025-03-26T22:32:40 1743028360

Responding to person with non-company email.. eek.

imglorp · 2025-03-28T20:42:02 1743194522

Attempting to admit something to key customers but they don't do it on letterhead!

https://arstechnica.com/security/2025/03/oracle-is-mum-on-re...

Look for them to sue any messengers shortly.

mistrial9 · 2025-03-26T22:02:19 1743026539

the sailboat races are on schedule, however

andrewinardeer · 2025-03-26T22:22:41 1743027761

They sponsor a fast car too.