Deepseek R1-0528

jacob019 · 2025-05-28T22:31:27 1748471487

Well that didn't take long, available from 7 providers through openrouter.

https://openrouter.ai/deepseek/deepseek-r1-0528/providers

May 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.

Fully open-source model.

jazzyjackson · 2025-05-28T23:22:12 1748474532

No sign of what source material it was trained on though right? So open weight rather than reproducible from source.

I remember there's a project "Open R1" that last I checked was working on gathering their own list of training material, looks active but not sure how far along they've gotten:

https://github.com/huggingface/open-r1

pradn · 2025-05-29T01:40:20 1748482820

Isn't it basically not possible for the input data set list to be listed? It's an open secret all these labs are using immense amounts of copyrighted material.

There's a few efforts at full open data / open weight / open code models, but none of them have gotten to leading-edge performance.

ratamacue · 2025-05-29T18:49:47 1748544587

My brain was largely trained using immense amounts of copyrighted material as well. Some of it I can even regurgitate almost exactly. I could list the names of many of the copyrighted works I have read/watched/listened to. I suppose my brain isn't open source, although I don't think it would currently be illegal to take a snapshot of my brain and publish it if the technology existed and open-source that. Granted, this would only be "reproducible" from source if you define the "source" as "my brain" rather than all of the material I consumed to make that snapshot.

ljosifov · 2025-05-30T00:51:21 1748566281

:-) I like the symmetry of this. If I want to keep my creations outside the hands of others, I can keep them private. I don’t have to publish these words or broadcast them to the world. I could write this on my laptop, save it in a file, and keep it to myself. Fine.

However, once these words are broadcast—once they’re read, and the ideas expressed here enter someone else’s mind—I believe it’s only fair that the person on the receiving end has the right to use, replicate, or create something from them. After all, they lent me their brain—ideas that originated in my mind now live in theirs.

This uses up their mental "meat space," their blood sugar, and their oxygen—resources they provide. So, they have rights too: the right to do as they please with those ideas, including creating any and all data derived from them. Denying them that right feels churlish, as if it isn’t the most natural thing in the world.

(Before people jump on me:- Yes, creators need to be compensated—they deserve to make a living from their work. But this doesn’t extend to their grandchildren. Copyright laws should incentivize creation, not provide luxury for the descendants of the original creator a century later.)

overfeed · 2025-05-29T19:22:43 1748546563

> Some of it I can even regurgitate almost exactly

If you (or any human) violate copyright law, legal redress can be sought. The amount of damage you can do is limited because there's only one of you vs the marginal cost of duplicating AI instances.

There are many other differences between humans and AI in terms of capabilities and motivations to f the legal persons making decisions.

ljosifov · 2025-05-30T10:37:06 1748601426

You may be right about the damage (will not dispute it even if I personally doubt it) - but what about the amount of good that it can do too? When deciding "what is to be done now" under uncertainty, we typically look at both sides of the ledger, the upsides in addition to the downsides.

Assume for a moment, that the current AI is teaching us that compute transforming data → information → knowledge → intelligence → agency → ... → AGI → ASI, is all there is to Intelligence-on-Tap? And imagine an AI path opens to AGI now and ASI later, where previously we didn't see any. Seems a bad deal to me, to frustrate, slow down, or even forego the 2050-s Intelligence Revolution that may multiply total human wealth by a factor of 10 to 20 in value, the way the Industrial Revolution did in the 1800-s. And we are to forego this, for what - so that we provide UBI to Disney shareholders? Every one of us is richer, better off now, than any king of old. Not too long ago, even the most powerful person in the lands could not prevent their 17 miscarriages/stillbirths/child_deaths failing to produce an heir to ascend the throne (a top priority that was, for sure for a king+queen). So in our imagined utopia, even the Disney shareholders are better off than they would be otherwise.

overfeed · 2025-05-30T19:02:05 1748631725

> Seems a bad deal to me, to frustrate, slow down, or even forego the 2050-s Intelligence Revolution that may multiply total human wealth by a factor of 10 to 20 in value...

Why do you assume the emergence of a super intelligence would result in human wealth increasing instead of decreasing? Looking at how humans with superior technology used it to exploit fellow humans throughout history should give you pause. Humans don't care about the aggregate "dog wealth" - let alone that of ants.

ljosifov · 2025-05-30T20:23:25 1748636605

I'm assuming the Intelligence Revolution, multiplying Human Intelligence with machines, will have the same effect as the Industrial Revolution had, on multiplying human physical strength. That multiplied the GDP by a factor of ~20 times, hockey stick like, in a fairy short time, a century or two.

thatcat · 2025-05-31T04:33:46 1748666026

The industrial revolution was powered by natural resources that it helped unlock. What value reserve will ai tap into to create hockey stick growth?

ljosifov · 2025-05-31T09:34:05 1748684045

It will recombine the existing resources in new ways. Neanderthals had access to exactly the same natural resources as we have now. Obviously we do much more with what we both got, then they ever did. Obviously it's not only the availability of some atoms or molecules, but what one does with them, how one recombines them in novel ways. For that one needs knowledge and energy. And the later mostly turns out can be derived from the the former too.

CamperBob2 · 2025-05-30T01:42:17 1748569337

The amount of damage you can do is limited because there's only one of you vs the marginal cost of duplicating AI instances

But enough about whether it should be legal to own a Xerox machine. It's what you do with the machine that matters.

overfeed · 2025-05-30T08:00:56 1748592056

> It's what you do with the machine that matters.

The capabilities of a machine matter a lot under law. See current US gun legislation[1], or laws banning export of dual-use technology for examples of laws that have inherent capabilities - not just the use of the thing- as core considerations.

1. It's illegal to possess a new, automatic weapon with some grandfathering prior to 1986

ben_w · 2025-05-30T09:35:54 1748597754

While true, computers in general alreay had the ability to perfectly replicate data, hence blank media tax: https://en.wikipedia.org/wiki/Private_copying_levy

I think the reason for all the current confusion is that we previously had two very distince groups of "mind" and "mindless"*, and that led to a lot of freedom for everyone to learn a completly different separation hyperplane between the categories, and AI is now far enough into the middle that for some of us it's on one side and for others of us it's on the other.

* and various other pairs that are no longer synonyms but they used to be; so also "person" vs. "thing", though currently only very few actually think of AI as person-like

CamperBob2 · 2025-05-30T15:29:59 1748618999

Yes, but gun control and dual-use export regulations are both stupid. We need fewer tool-blaming laws, not more.

(Standing by for the inevitable even-goofier analogy comparing AI with privately-owned nuclear arsenals...)

3abiton · 2025-05-29T10:28:02 1748514482

The only way this would work is with "leaks". But even then as we saw with everything on the internet, it just added another guardrail on content. Now I can't watch youtube videos without logging in, and nearly every website I need to solve some weird ash captchas. It's becoming easier to interact with this chatbots rather than search for a solution online. And I wonder with Veo 4 copy cats, it might be even easier to prompt for a video rather than search for one.

prmoustache · 2025-05-29T05:08:40 1748495320

That doesn't mean it isn't possible.

bee_rider · 2025-05-29T05:06:32 1748495192

“Not possible” = “a business-destroying level of honesty”?

rcxdude · 2025-05-29T08:11:49 1748506309

Even if training on the copyrighted material is OK, just providing a data dump of it almost certainly is not.

alpaca128 · 2025-05-29T08:38:10 1748507890

No need for a data dump, just list all URLs or whatever else of their training data sources. Afaik that's how the LAION training dataset was published.

anonymoushn · 2025-05-29T09:32:45 1748511165

providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.

echoangle · 2025-05-29T10:39:46 1748515186

Aren't the datasets mostly shared in torrents? They probably won't bitrot for some time.

Wowfunhappy · 2025-05-29T13:40:10 1748526010

...no? They also use web crawlers.

bee_rider · 2025-05-29T19:48:33 1748548113

The datasets are collected using web crawlers, but that doesn’t tell us anything about how they are stored and re-distributed, right?

Wowfunhappy · 2025-05-30T00:07:43 1748563663

Why would you store the data after training?

bee_rider · 2025-05-30T00:30:29 1748565029

Are you saying that you know they don’t store the data after training?

I’d just assume they did because—why scrape again if you want to train a new model? But if you know otherwise, I’m not tied to this idea.

Wowfunhappy · 2025-05-30T14:20:59 1748614859

I'm also assuming. But I would ask the opposite question: why store all that data if you'll have to scrape again anyway?

You will have to scrape again because you want the next AI to get trained on updated data. And, even at the scale needed to train an LLM, storing all of the text on the entire known internet is a very non-trivial task!

anonymoushn · 2025-05-31T05:58:03 1748671083

If you try to reproduce various open datasets like fineweb by scraping the pages again, you can't, because a lot of the pages no longer exist. That's why you would prefer to store them instead of losing the content forever.

It's not "all of the text", it's like less than 100 trillion tokens, which means less than 400TB assuming you don't bother to run the token streams through a general purpose compression algorithm before storing them.

tokioyoyo · 2025-05-29T05:17:17 1748495837

There is a "keep doing what you're doing, as we would want one of our companies to be on top of the AI race" signal from the governments. It could've been stopped, maybe, 5 years ago. But now we're way past it, so nobody cares about these sort of arguments.

behnamoh · 2025-05-29T00:11:55 1748477515

> No sign of what source material it was trained on though right?

out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..

marci · 2025-05-29T04:56:18 1748494578

When you're trully open source, you can make ethings like this:

Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.

https://allenai.org/blog/olmotrace

kreijstal · 2025-05-29T06:22:14 1748499734

you can do these same, except you would need to be a pirate website. It would even be better. except illegal. but it would be better.

marci · 2025-05-29T11:41:28 1748518888

That is why the others can't provide stuff like this. RAG/Hallucination check. I just wish Allen.AI models had bigger context, 4k is too small nowadays.

ToValueFunfetti · 2025-05-29T00:16:55 1748477815

Would be useful for answering "is this novel or was it in the training data", but that's not typically what the point of open source is

anonymoushn · 2025-05-29T09:34:20 1748511260

If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.

m00x · 2025-05-29T05:51:26 1748497886

Many are speculating it was trained by o1/o3 for some of the initial reasoning.

fulafel · 2025-05-29T05:53:53 1748498033

Are there any widely used models that publish this? If not, then no I guess.

DANmode · 2025-05-29T08:43:40 1748508220

Depending on how you use "randomly", they absolutely can..?

chrsw · 2025-05-29T00:07:39 1748477259

Based on commit history Open R1 still active and they're still making progress. Long may it continue, it's an ambitious project.

therealpygon · 2025-05-29T13:14:49 1748524489

This was simply a mad scramble to prove/disprove the claims OpenAI was peddling that the model wasn’t actually performing as well as advertised and that they were lying about the training/compute resources. Open-R1 has since applied the training to a similar 7B model and got similar results. At the end of the day, no one really cares what the data was that it was trained on and most AI providers don’t always share this either when releasing open source models, and certainly not available for closed source models.

make3 · 2025-05-29T00:42:25 1748479345

I don't think people make the distinction like that. The open source vs non open source distinction boils down to, usually, can you use it for commercial use.

what you're saying is just that it's non reproducible, which is a completely valid but separate issue

alpaca128 · 2025-05-29T08:43:43 1748508223

There's already established terms and licenses for non-commercial use. Like "open weights".

Open source has the word "source" in it for a reason, and those models ain't open source and have nothing to do with it.

ben_w · 2025-05-30T09:55:54 1748598954

Took me until this thread to remember that in the 90s we had "freeware".

piperswe · 2025-05-29T00:47:56 1748479676

But where's the source? I just see a binary blob, what makes it open source?

jacob019 · 2025-05-29T03:00:35 1748487635

The weights are the source. It isn't as though something was compiled into weights. They're trained directly. But I know what you mean, it would be more open to have the training pipeline and souce dataset available.

timschmidt · 2025-05-29T04:30:35 1748493035

The weights seem much more like a binary to me, the training pipeline the compiler, and the training dataset the source.

jumski · 2025-05-29T07:09:06 1748502546

Come here to write this - perfect analogy!

reedciccio · 2025-05-29T07:24:30 1748503470

It's very imperfect analogy though these things can't be rebuilt "from scratch" like a program, the training process doesn't seem to be replicable anyway. Nonetheless, full data disclosure is necessary, according to the result of the years-long consultation led by the Open Source Initiative https://opensource.org/ai

timschmidt · 2025-05-29T07:26:34 1748503594

> the training process doesn't seem to be replicable anyway

The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

If you're speaking about the computational cost, it used to be that way for compilers too. Give it 20 years and you'll be able to train one of today's models on your phone.

kouteiheika · 2025-05-29T09:22:02 1748510522

> The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

No it is not. The training process is non-deterministic, and given exactly the same data, the same code and the same seeds you'll get different weights. Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using (e.g. you'll get different results on CPU, on GPU from vendor #1 and on GPU from vendor #2, and probably on different GPUs from the same vendor, and on different CUDA versions, etc.), but also depending on the dimensions of the matrices you'll get different results (e.g. if you fuse the QKV weights from modern transformers into a single matrix and do a single multiplication instead of multiplying each separately you'll get different results), and some algorithms (e.g. backwards pass of Flash Attention) are explicitly non-deterministic to be faster.

timschmidt · 2025-05-29T09:37:23 1748511443

> Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using

That has everything to do with implementation, and nothing to do with algorithm. There is an important difference.

Math is deterministic. The way [random chip] implements floating point operations may not be.

Lots of scientific software has the ability to use IEEE-754 floats for speed or to flip a switch for arbitrary precision calculations. The calculation being performed remains the same.

kouteiheika · 2025-05-29T11:05:24 1748516724

> Math is deterministic.

The point is none of these models are trained with pure "math". It doesn't matter that you can describe a theoretical training process using a set of deterministic equations, because in practice it doesn't work that way. Your claim that "the training process is fully deterministic" is objectively wrong in this case because none of the non-toy models use (nor they practically can use) such a deterministic process. There is a training process which is deterministic, but no one uses it (for good reasons).

If you had infinite budget, exactly the same code, the same training data, and even the same hardware you would not be able to reproduce the weights of Deepseek R1, because it wasn't trained using a deterministic process.

jacob019 · 2025-05-29T19:10:00 1748545800

A lot of quibbling here, wasn't sure where to reply. If you've built any models in PyTorch, then you know. Conceptually it is deterministic, a model trained using deterministic implementations of low level algorithms will produce deterministic results. And when you are optimizing the pipeline, it is common to do just that:

    torch.manual_seed(0)
    random.seed(0)
    np.random.seed(0)
    torch.use_deterministic_algorithms(True)

But in practice that is too slow, we use nondeterministic implementations that run fast and loose with memory management and don't necessarily care about the order in which parallel operations return.

willmarch · 2025-05-29T08:16:46 1748506606

I’m pretty sure the initial weights are randomized meaning no two models will train in the same way twice. The order in which you feed in training data to the model would also add an element of randomness. Model training is closer to growing a plant than running a compiler.

timschmidt · 2025-05-29T08:18:23 1748506703

That's still a deterministic algorithm. The random data and the order of feeding training data into it are part of the data which determines the output. Again, if you do it twice the same way, you'll get the same output.

willmarch · 2025-05-29T08:25:08 1748507108

If they saved the initial randomized model and released it and there was no random bit flipping during copying, then possibly but it would still be difficult when you factor in the RLHF that comes about through random humans interacting with the model to tweak its workings. If you preserved that data as well, and got all of the initial training correct... maybe. But I'd bet against it.

timschmidt · 2025-05-29T08:28:57 1748507337

So long as the data provided was identical, and sources of error like floating point errors due to hardware implementation details are accounted for, I see no reason output wouldn't be identical.

Where would other non-determinism come from?

I'm open to there being another source. I'd just like to know what it would be. I haven't found one yet.

reedciccio · 2025-05-29T08:23:07 1748506987

> if you do it twice the same way, you'll get the same output

Point at the science that says that, please: Current scientific knowledge doesn't agree with you.

timschmidt · 2025-05-29T08:24:29 1748507069

> Current scientific knowledge doesn't agree with you.

I'd love a citation. So far you haven't even suggested a possible source for this non-determinism you claim exists.

desdenova · 2025-05-29T08:12:20 1748506340

What makes models non-deterministic isn't the training algorithm, but the initial weights being random.

Training is reproducible only if, besides the pipeline and data, you also start from the same random weights.

timschmidt · 2025-05-29T08:13:39 1748506419

That would fall under "Feed the same data in and you'll get the same weights out." Lots of deterministic algorithms use a random seed.

alfiedotwtf · 2025-05-29T08:36:32 1748507792

So is there no “introduce randomness” at some step afterwards? If not, I would guess these models would be getting stuck in a local maxima

addaon · 2025-05-29T16:05:28 1748534728

> If not, I would guess these models would be getting stuck in a local maxima

It sounds like you're referring to something like simulated annealing. Using that as an example, the fundamental requirement is to introduce arbitrary, uncorrelated steps -- there's no requirement that the steps be random, and the only potential advantage of using a random source is that it provides independence (lack of correlation) inherently; but in exchange, it makes testing and reproduction much harder. Basically every use of simulated annealing or similar I've run into uses pseudorandom numbers for this reason.

reedciccio · 2025-05-29T07:52:26 1748505146

Can you point at the research that says that the training process of a LLM at least the size of OLMo or Pythia is deterministic?

timschmidt · 2025-05-29T07:56:14 1748505374

Can you point to something that says it's not? The only source of non-determinism I've read of affecting LLM training is floating point error which is well understood and worked around easily enough.

reedciccio · 2025-05-29T08:37:54 1748507874

Search more, there is a lot of literature discussing how hard the problem of reproducibility of GenAI/LLMs/Deep Learning is, how far we are from solving it for trivial/small models (let alone for beasts the size of the most powerful ones) and even how pointless the whole exercise is.

timschmidt · 2025-05-29T08:42:07 1748508127

If there's a lot, then it should be easy for you to link an example right? One that points toward something other than floating point error.

There simply aren't that many sources of non-determinism in a modern computer.

Though I'll grant that if you've engineered your codebase for speed and not for determinism, error can creep in via floating point error, sloppy ordering of operations, etc. These are not unavoidable implementation details, however. CAD kernels and other scientific software do it every day.

When you boil down what's actually happening during training, it's just a bunch of matrix math. And math is highly repeatable. Size of the matrix has nothing to do with it.

I have little doubt that some implementations aren't deterministic, due to software engineering choices as discussed above. But the algorithms absolutely are. Claiming otherwise seems equivalent to claiming that 2 + 2 can sometimes equal 5.

kouteiheika · 2025-05-29T09:36:46 1748511406

> I have little doubt that some implementations aren't deterministic

Not some of them; ALL OF THEM. Engineering training pipelines for absolute determinism would be, quite frankly, extremely dumb, so no one does it. When you need millions of dollars worth of compute to train a non-toy model are you going to double or triple your cost just so that the process is deterministic, without actually making the end result perform any better?

timschmidt · 2025-05-29T09:39:04 1748511544

Depends on how much you value repeatability in testing, and how much compute you have. It's a choice which has been made often in the history of computer science.

The cost of adaptive precision floats can be negligible depending on application. One example I'm familiar with from geometry processing: https://www.cs.cmu.edu/~quake/robust.html

Integer math often carries no performance penalty compared to floating point.

I guess my takeaway from this conversation is that there's a market for fast high-precision math techniques in the AI field.

otabdeveloper4 · 2025-05-29T05:39:56 1748497196

You can fine-tune their weights and release your own take.

E.g. see all the specialized third-party models out there based on Qwen.

"Open-source" is the wrong word here, what they mean is "you can modify and redistribute these weights".

yetihehe · 2025-05-29T06:44:05 1748501045

You can also reverse engineer and modify closed source programs (see mods for games). Weights are like compiled version of source data.

otabdeveloper4 · 2025-05-29T09:21:36 1748510496

Finetuning isn't reverse engineering. Finetuning is a standard supported workflow for these models.

Also, the "redistribute" part is key here.

yetihehe · 2025-05-29T10:16:57 1748513817

> Finetuning isn't reverse engineering

Fully agree, it isn't. Reverse engineering isn't necessary for modifying compiled program behaviour, so comparing it to finetuning is not applicable. Finetuning applied to program domain would be more like adding plugins or patching in some compiled routines. Reverse-engineering applied to models would be like extracting source documents from weights.

> Finetuning is a standard supported workflow for these models.

Yes, so is adding mods for some games, just put your files in a designated folder and game automatically picks it up and does required modifications.

> Also, the "redistribute" part is key here.

It is not. Redistributability and being open source is orthogonal. You can have a source for a program and not be able to redistribute source or program, or you can redistribute a compiled program, but not have it's source (freeware).

macrolime · 2025-05-29T07:20:22 1748503222

Not legally. That's the difference.

timschmidt · 2025-05-29T07:22:05 1748503325

Sure you can. It's often legally protected activity. You're just limited to distributing your modifications without the original work.

macrolime · 2025-05-29T17:00:39 1748538039

For some games maybe, but software often has a clause forbidding reverse engineering

timschmidt · 2025-05-29T20:19:37 1748549977

ChatGPT says that such clauses are typically void in the EU, though they may apply in some cases in the US. Even in the US, the triennial DMCA rule-making has granted broader exemptions for good-faith security research every cycle since 2016.

https://chatgpt.com/share/6838c070-705c-8005-9a88-83c9a5550a...

danieldk · 2025-05-29T07:48:32 1748504912

There is work to try to reproduce (the original) R1: https://huggingface.co/open-r1

1una · 2025-05-29T03:01:02 1748487662

I won't call it "binary blob". Safetensors is just a simple format for storing tensors safely: https://huggingface.co/docs/safetensors/index

JKCalhoun · 2025-05-29T01:28:08 1748482088

Is there a downloadable model? (Not familiar with openrouter and not seeing the model on ollama.)

zargon · 2025-05-29T02:02:51 1748484171

This HN submission goes directly to the downloadable model.

angst · 2025-05-30T02:24:35 1748571875

DeepSeek-R1-0528-Qwen3-8B

> ollama run deepseek-r1

from https://ollama.com/library/deepseek-r1

fragmede · 2025-05-28T23:31:25 1748475085

It's. not. open. source!

https://www.downloadableisnotopensource.org/

echelon · 2025-05-29T00:50:08 1748479808

Open source is a crazy new beast in the AI/ML world.

We have numerous artifacts to reason about:

- The model code

- The training code

- The fine tuning code

- The inference code

- The raw training data

- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)

- The resultant weights

- The inference outputs (which also need a license)

- The research papers (hopefully it's described in literature!)

- The patents (or lack thereof)

The term "open source" is wholly inadequate here. We need a 10-star grading system for this.

This is not your mamma's C library.

AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).

This is more than enough to distill new models from.

Everybody is laundering training data, and it's rife with copyrighted data, PII, and pilfered outputs from other commercial AI systems. Because of that, I don't expect we'll see much legally open training data for some time to come. In fact, the first fully open training data of adequate size (not something like LJSpeech) is likely to be 100% synthetic or robotically-captured.

reedciccio · 2025-05-29T07:19:55 1748503195

Https://opensource.org/ai ... Lots of reasoning has been done on those artifacts

Tepix · 2025-05-29T05:00:23 1748494823

I think you‘re trying to make it look more complex than it is. Put the amount of data next to every entry in that list of yours.

echelon · 2025-05-29T05:03:37 1748495017

Most of those items map to a job description.

If you think the data story isn't a complicated beast, then consider:

If you wanted an "open" dataset, would you want it before or after it was processed? There are a lot of cleaning, categorizing, feature extraction steps. The data typically undergoes a lot of analysis, extra annotation, bucketing, and transformation.

If the pre-train was done in stages, and the training process was complicated, how much hand-holding do you need to replicate that process?

Do you need all of the scripts to assist with these processes? All of the infra and MLOps pieces? There's a lot of infrastructure to just move the data around and poke it.

Where are you going to host those terabytes or petabytes of data? Who is going to download it? How often? Do you expect it to be downloaded as frequently as the Linux kernel sources?

Did you scrub it of PII? Are you sure?

And to clarify, we're not even talking about trained models at this point.

xnickb · 2025-05-29T05:30:39 1748496639

I'd argue we don't need a 10 star system. The single bit we have now is enough. And the question is also pretty clear: did $company steal other peoples work?

The answer is also known. So the reason one would want an open source model (read reproducible model), would be that of ethics

selfhoster11 · 2025-05-29T06:15:14 1748499314

We use pop-cultural references to communicate all the time these days. Those don't necessarily come from only the most commonly known sections of these works, so the AI would necessarily need the full work (or a functional transformation of the work) to be able to hit the theoretical maximum of the ability to decode about and reason using such references. To exclude copyrighted works from the training set is to expect it to decode from the outside what amounts to humanity's own in-group jokes.

That's my formal argument. The less formal one is that copyright protection is something that smaller artists deserve more than rich conglomerates, and even then, durations shouldn't be "eternity and a day". A huge chunk of what is being "stolen" should be in the commons anyway.

yencabulator · 2025-05-29T16:22:48 1748535768

"Your honor, if I hadn't robbed that bank I wouldn't have gotten all that money!"

echelon · 2025-05-29T05:54:12 1748498052

I truthfully cannot think of a single model that satisfies your criteria.

And if we wait for the the internet to be wholly eaten by AI, if we accept perfect as the enemy of good, then we'll have nothing left to cling to.

> And the question is also pretty clear: did $company steal other peoples work?

Who the hell cares? By the time this is settled - and I'd argue you won't get a definitive agreement - the internet will be won by the hyperscalers.

Accept corporate gifts of AI, and keep pushing them forward. Commoditize. Let there be no moat.

There will be infinite synthetic data available to us in the future anyway. And none of this bickering will have even mattered.

cavisne · 2025-05-29T04:54:10 1748494450

"knowing why a model refuses to answer something matters"

The companies that create these models cant answer that question! Models get jailbroken all the time to ignore alignment instructions. The robust refusal logic normally sits on top of the model, ie looking at the responses and flagging anything that they don't want to show to users.

The best tool we have for understanding if a model is refusing to answer a problem or actually doesn't know is mechanistic interp, which you only need the weights for.

This whole debate is weird, even with traditional open source code you cant tell the intent of a programmer, what sources they used to write that code etc.

behnamoh · 2025-05-29T00:10:56 1748477456

it's got more 'source' than whatever OpenAI provides for their models.

numpad0 · 2025-05-29T00:36:21 1748478981

less alcoholic beverages are fully alcoholic beverages

subscribed · 2025-05-29T16:36:37 1748536597

0.5% or 0.03% satisfy my "nonalcoholic" criteria.

> Studies have found ethanol levels in commercial apple juice ranging from 0.06 to 0.66 grams per liter, with an average around 0.26 grams per liter[1]

Even apple juice is an alcoholic drink if you push your criteria to absurdity.

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC5421578/

fragmede · 2025-05-29T01:00:07 1748480407

but they're not bleach, and no amount of adding or removing alcohol can transmute the alcohol into something else.

stavros · 2025-05-29T00:19:38 1748477978

No it doesn't, it has exactly the same source, zero. It has more downloadable binary.

Aeolun · 2025-05-29T04:44:41 1748493881

That’s the ‘source’ for what the model spits out though, if not the source for what spits out the model.

prmoustache · 2025-05-29T05:10:49 1748495449

It is just freeware, not open source.

stavros · 2025-05-29T22:49:28 1748558968

The "source" for something is all the stuff that makes you able to build and change that something. The source for a model is all the stuff that makes you able to train and change the model.

Just because the model produces stuff doesn't mean that's the model's source, just like the binary for a compiler isn't the compiler's source.

quarters · 2025-05-28T23:46:14 1748475974

Ok

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/mai...

stavros · 2025-05-29T00:20:10 1748478010

Slapping an MIT license on a compiled binary doesn't make it open source.

quarters · 2025-05-29T01:30:21 1748482221

They're keeping some stuff to themselves which is fine. I don't expect anyone to have to fully release everything they've got especially considering the vast costs associated with researching and developing these models.

What they have released has been distilled into many new models that others have been using for commercial benefit and I appreciate the contributions that they have made.

alpaca128 · 2025-05-29T08:51:09 1748508669

> I don't expect anyone to have to fully release everything they've got

I also don't expect Microsoft to release their full Windows 11 source code, but that also means it's not open source. And that's okay, because Microsoft doesn't call it open source.

aldanor · 2025-05-29T19:37:29 1748547449

Open weights.

acheong08 · 2025-05-28T18:36:22 1748457382

No information to be found about it. Hopefully we get benchmarks soon. Reminds me of the days when Mistral would just tweet a torrent magnet link

chvid · 2025-05-28T20:29:45 1748464185

Benchmarks seem like a fools errand at this point; overly tuning models just to specific test already published tests, rather than focusing on making them generalize.

Hugging face has a leader board and it seems dominated by models that are finetunings of various common open source models, yet don't seem be broader used:

https://huggingface.co/open-llm-leaderboard

EvgeniyZh · 2025-05-28T21:35:13 1748468113

There are quite a few benchmarks for which that's not the case:

- live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc)

- benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena)

- (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated.

Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept.

Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet.

chvid · 2025-05-29T07:48:47 1748504927

I agree with you.

Of course, some benchmarks are still valid and will remain valid. Ie. we can make the models play chess against each other and score them on how well they do. But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after. And often, LLMs perform worse than specialized models. Ie. I don't think there is any LLM out there that can beat a traditional chess program (surely not using the same computing power).

What is really bad are the QA benchmarks which leak over time into the training data of the models. And sometimes, one can suspect even big labs have an economic incentive in scoring well on popular benchmarks which cause them to manipulate the models way beyond what is reasonable.

And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.

So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

Of course, done right, that would be really expensive. And those sponsoring might not like the result.

EvgeniyZh · 2025-05-29T08:29:26 1748507366

> But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after.

I think a general model that can

- finish nethack, doom, zelda and civilization,

- solve the hardest codeforces/atcoder problems,

- formally prove putnam solution with high probability, not given the answer

- write a PR to close a random issue on github

is likely to have some broader intelligence. I may be mistaken, since there were tasks in the past that appeared to be unsolvable without human-level intelligence, but in fact weren't.

I agree that such benchmarks are limited to either environment with well-defined feedback and rules (games) or easily verifiable ones (code/math), but I wouldn't say it's super narrow, and there are no non-LLM models to perform significantly better on these (except some games); though specialized LLMs work better. Finding other examples, I think, is one of the important problems in AI metrology.

> So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

You've invented an arena (who just raised quite a lot of money). Can argue about "representative," of course. However, I think the SNR in the arena is not too high now; it turns out that the average arena user is quite biased, the most of their queries are trivial for LLMs, and for non-trivial ones, they cannot necessarily figure out which answer is better. MathArena goes in opposite directions: narrow domain, but expert evaluation. You could imagine a bunch of small arenas, each with its own domain experts. I think it may happen eventually if money flow into AI continues.

chvid · 2025-05-29T11:01:37 1748516497

A couple of things:

I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.

As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.

Now for that broad set of benchmarks (PRs to GitHub, Putnam, Zelda). There is something to that, but it depends on the model. A lot of what is out there are “mixtures of experts" either by implicit or explicit design. So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.

That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

Deepseek is, as far as I can tell, the leading open-source model; and in some way, that makes it the leading model. I don't think you can fairly compare a model that you can run locally with something that is running behind a server-side API - because who knows what is really going on behind the API.

Deepseek being Chinese makes it political and even harder to have a sane conversation about; but I am sure that had it been China that did mostly closed models and the US that did open ones; we would hold that against them, big time.

EvgeniyZh · 2025-05-29T15:38:17 1748533097

> I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.

But how is it different from what arena or matharena does?

> That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

The claim is that these problems require somewhat broad intelligence by themselves, as opposed to specialization into specific task while unable to do anything else.

CamperBob2 · 2025-05-30T02:55:25 1748573725

So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.

No, that's not actually a good description of the mixture-of-experts methodology. It was poorly named. There is no conscious division of the weights into "This subset is good for poetry, this one is best for programming, this one for math, this one for games, this one for language translation, etc."

behnamoh · 2025-05-29T00:13:08 1748477588

right, all benchmarks collapse once you go beyond 32K tokens. I've rarely seen any benchmarks focusing on long range, which is where most programming needs are at.

lossolo · 2025-05-28T21:09:33 1748466573

The only benchmarks that match my experience with different models are here https://livebench.ai/#/

ribelo · 2025-05-29T00:17:22 1748477842

livebench was good, but now it's a joke. Gemini flash is better in coding than pro and sonnet 3.7. And this is only the beginning of weird results.

pdimitar · 2025-05-29T06:50:04 1748501404

Flash is better than Pro in coding? Whoa... [makes a note to try a few things later this day]

Out of curiosity, how did you gauge that?

code_biologist · 2025-05-29T08:25:00 1748507100

I think your parent comment is citing that as an example of why livebench is no longer a good benchmark. That said, the new Flash is very good for what it is, and IMO after the Pro 05-06 nerfs the two models are much closer in performance for many tasks than they really should be — Pro should be / was way better (RIP 03-25 release). That livebench result may be wrong about the specific ranking, but I think it's right that Flash is in the same class of coding strength as Sonnet 3.7.

pdimitar · 2025-05-29T08:27:20 1748507240

Thanks, that's very informative.

My ignorance is showing here: why is the Pro 05-06 a nerf?

halyconWays · 2025-05-29T00:21:43 1748478103

>overly tuning models just to specific test already published tests, rather than focusing on making them generalize.

I think you just described SATs and other standardized tests

Mistletoe · 2025-05-29T02:30:12 1748485812

SAT has a correlation to IQ of 0.82 to 0.86 and I do think IQ is very useful in judging intelligence.

https://gwern.net/doc/iq/high/smpy/2004-frey.pdf

tptacek · 2025-05-29T04:12:20 1748491940

It's a useful diagnostic when used in a battery of diagnostic tests of cognitive function, but to the point of this thread: it is notoriously not a good ranking mechanism.

kbumsik · 2025-05-28T23:38:56 1748475536

Artificial Analysis is the only stable source. Don't look at others like HF Leaderboard.

https://artificialanalysis.ai/

z2 · 2025-05-28T20:51:36 1748465496

There's a table here showing some "Overall" and "Median" score, but no context on what exactly was tested. It appears to be in the ballpark as the latest models, but with some cost advantages with the downside of being just as slow as the original r1 (likely lots of thinking tokens). https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....

xelos · 2025-05-28T21:18:15 1748467095

It’s appeared on the Livecodebench leaderboard too. Performance on par with O4 Mini - https://livecodebench.github.io/leaderboard.html

swyx · 2025-05-28T20:26:55 1748464015

i think usually deepseek posts a paper after a model release about a day later.

no idea why they cant just wait a bit to coordinate stuff. bit messy in the news cycle.

Destiner · 2025-05-28T20:36:17 1748464577

honestly a power move.

it's almost as if they don't care about creating a proper buzz.

wyre · 2025-05-28T21:47:09 1748468829

From what I understand, isn’t DeepSeek just a pet project from a Chinese hedge fund? They have much less reason to create a buzz compared to openAI, Anthropic, or Google.

TeMPOraL · 2025-05-28T22:15:45 1748470545

None of those players you mention actually need to create a buzz. People will do it for them for free. DeepSeek joined this group after releasing R1.

Despite constant protestations of hype among the tech crowd, GenAI really is big enough of a deal that new developments don't need to be pushed onto market; people are voluntarily seeking them out.

janalsncm · 2025-05-29T01:26:32 1748481992

Well OpenAI is constantly asking for and raising money. If “buzz” isn’t the right word maybe mystique? Because “race to the bottom in a hyper-commoditized space” probably doesn’t get you billions. No, Sam Altman wants people to believe they are very close to AGI and a TAM of global GDP.

wongarsu · 2025-05-28T22:45:49 1748472349

OpenAI does a lot of work hyping themselves up and creating buzz around things they do or have a vague idea that they might try to do in the future.

Not to make people aware of GenAI, but to make sure OpenAI continues to be perceived as the AI company. The company that leads and revolutionizes, with everyone just copying them and trying to match them. That perception is a significant part of their value and probably their biggest moat

ktallett · 2025-05-28T23:11:10 1748473870

Considering just how quickly others followed it's also obviously not the case. Infact the best AI software as in most useful is not theirs. Claude is far more reliable.

TeMPOraL · 2025-05-29T06:13:26 1748499206

Only goes to shown how strong their brand already is in the eyes of the public. I question how much active maintenance it takes, though. In my eyes, they already won big with ChatGPT - the name still is synonymous with LLMs to the general population. Everyone knows what ChatGPT is. Few known what Claude or Gemini is; arguably, more people know what Deepseek is thanks to the splash they made tanking NVidia stock and becoming part of general news coverage for a few days. Still, for regular folks (including business folks in tech industry, too), they all are, respectively, "that other ChatGPT", "ChatGPT from Google" and "that Chinese ChatGPT". It's a pretty sticky perception.

aibrother · 2025-05-28T19:31:17 1748460677

getting a similar vibe yeah. given how adjacent they are, wouldn't be surprised if this was an intentional nod from DeepSeek

willchen · 2025-05-28T18:24:19 1748456659

I love how Deepseek just casually drops new updates (that deliver big improvements) without fanfare.

doctoboggan · 2025-05-28T19:01:48 1748458908

Honest question, how do you know this is a big improvement? Are there any benchmarks anywhere?

KeyBoardG · 2025-05-28T20:23:53 1748463833

There will be a video from FireShip if its a big one. /s

sundarurfriend · 2025-05-29T07:34:52 1748504092

Ah FireShip, I forgot that channel existed at all. I asked YouTube to not recommend that channel after every vaguely AI-related news was "BIG NEWS!!!", the videos were also thin on actual content, and there were repeated factual errors over multiple videos too. At that point, the only thing it's good for is to make yourself (falsely) feel like you're keeping up.

mxkopy · 2025-05-29T14:30:03 1748529003

Fireship consistently makes some of the most entertaining tech content out there

sundarurfriend · 2025-05-29T20:20:10 1748550010

Hey if you enjoy it, go for it. I used to like it a couple of years ago too, but I found that more and more lately, it was neither entertaining nor reliably informative. The jokes/memes were lazy and recycled a lot, the tech content was often poorly researched, and it started feeling like content produced for the sake of having content.

therein · 2025-05-29T00:49:42 1748479782

Much more preferred to what OpenAI always did and Anthropic recently started doing. Just write some complicated narrative about how scary this new model is and how it tried to escape and deceive and hack the mainframe while telling the alignment operators bed time stories.

camkego · 2025-05-29T05:50:28 1748497828

Really? I missed this. The new hype trick is implying the new LLM releases are almost AGI? Love it.

IceWreck · 2025-05-29T21:37:32 1748554652

Anthropic "warned" Claude 4 is so smart that it will try to use the terminal (if using Claude Code) or any other tools available (depending on where you're invoking it from) to contact local authorities if you're doing something very immoral.

ilaksh · 2025-05-28T20:36:35 1748464595

I think they did make an announcement on WeChat.

modeless · 2025-05-28T19:04:31 1748459071

I like it too, but some benchmark numbers would be nice at least.

hd4 · 2025-05-28T18:41:54 1748457714

On the day Nvidia report earnings too. Pretty sure it's just a coincidence, bro.

margorczynski · 2025-05-28T19:03:51 1748459031

Yeah the timing seems strange. Considering how much money will move hands based on those results this might be some kind of play to manipulate the market at least a bit.

consumer451 · 2025-05-28T19:19:55 1748459995

I believe that they are funded by a hedge fund. So, there are no coincidences here.

rwmj · 2025-05-28T21:03:57 1748466237

Is releasing a better product really "market manipulation"? It seems to me like regular, good competition.

FirmwareBurner · 2025-05-28T21:40:23 1748468423

It's "manipulating the market" only when your geopolitical adversary brings the competition.

Maxatar · 2025-05-28T19:20:23 1748460023

How does releasing it today affect the market compared to releasing it last week?

doctoboggan · 2025-05-28T19:24:23 1748460263

Hard to say exactly how it will affect the market, but IIRC when deepseek was first released Nvidia stock took a big hit as people realized that you could develop high performing LLMs without access to Nvidia hardware.

jimmyl02 · 2025-05-28T19:39:20 1748461160

I thought the reaction was more so that you can train SOTA models without an extremely large quantity of hyper-expensive GPU clusters?

But I would say that the reaction was probably vastly overblown as what Deepseek really showed was there are much more efficient ways of doing things (which can also be applied with even larger clusters).

If this checkpoint is trained using non-Nvidia GPUs that would definitely be a much bigger situation but it doesn't seem like there has been any associated announcements.

TeMPOraL · 2025-05-28T22:27:58 1748471278

Plans take time to adjust; I imagine a big part of the impact was companies realizing that they need to buy/rent much less expensive GPU compute to realize the plans they've already committed to for the next couple years. Being able to spend less to get the same results is an immediate win; expanding the plan to make use of suddenly available surplus money/compute takes some time.

And then part of the impact was just "woah, if some noname team from China can casually leapfrog major western players on a tiny budget and kill one of their moats in the same move, what other surprises like this are possible?". The event definitely invalidated a lot of assumptions investors had about what is or isn't possible near-term; the stock market reacted to suddenly increased uncertainty.

lexandstuff · 2025-05-29T01:16:31 1748481391

Except that, all Deepseek models so far have been trained on Nvidia hardware. For Deepseek v3, they literally mention that they used 2,048 NVIDIA H800 GPUs right in the abstract: https://arxiv.org/html/2505.09343v1

hbbio · 2025-05-29T01:47:19 1748483239

Actually, the "narrative" crashed Nvidia for no reason.

Not only DeepSeek uses a lot of Nvidia hardware for the training.

But even more so, by releasing an open weight frontier model, people around the world need more Nvidia chips than ever for inference.

lvturner · 2025-05-29T02:59:53 1748487593

I know of enterprises in APAC now spending millions of dollars on Huawei GPUs, while they might not be as efficient, they are seen as geopolitically more stable (especially given the region).

DeepSeek helped "prove" to a lot of execs that "Good" is "Good enough" and that there are viable alternatives with less perceived risk of supply chain disruption - even if facts differ may from this narrative.

hbbio · 2025-05-29T11:48:54 1748519334

Yes, I know them too, I live there!

The hardware is great, CANN is not CUDA yet.

kreijstal · 2025-05-29T06:32:39 1748500359

someone has not heard about huawei GPU

belter · 2025-05-28T19:56:36 1748462196

Plenty of manipulation to go around..

"Tech Chip software stocks sink on report Trump ordered halt to China sales" - https://www.cnbc.com/2025/05/28/chip-software-trump-china.ht...

dyauspitr · 2025-05-29T06:39:07 1748500747

What big improvements?

esafak · 2025-05-28T19:24:46 1748460286

Anyone got benchmarks?

transcriptase · 2025-05-28T18:23:10 1748456590

Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses?

danielhanchen · 2025-05-28T19:05:16 1748459116

We made DeepSeek R1 run on a local device via offloading and 1.58bit quantization :) https://unsloth.ai/blog/deepseekr1-dynamic

I'm working on the new one!

CamperBob2 · 2025-05-28T19:41:37 1748461297

Your 1.58-bit dynamic quant model is a religious experience, even at one or two tokens per second (which is what I get on my 128 MB Raptor Lake+4090). It's like owning your own genie... just ridiculously smart. Thanks for the work you've put into it!

nxobject · 2025-05-28T23:55:43 1748476543

Likewise - for me, it feels how I imagined getting a microcomputer in the 70s was like. (Including the hit to the wallet… an Apple II cost the 2024 equivalent of ~$5k, too.)

danielhanchen · 2025-05-29T01:14:39 1748481279

:) The good ol days!

danielhanchen · 2025-05-28T19:43:00 1748461380

Oh thank you! :) Glad they were useful!

behnamoh · 2025-05-29T00:08:43 1748477323

> 1.58bit quantization

of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.

danielhanchen · 2025-05-29T01:14:08 1748481248

Oh you can still run them unquantized! See https://docs.unsloth.ai/basics/llama-4-how-to-run-and-fine-t... where we show you can offload all MoE layers to system RAM, and leave non MoE layers on the GPU - the speed is still pretty good!

You can do it via `-ot ".ffn_.*_exps.=CPU"`

behnamoh · 2025-05-29T03:09:22 1748488162

Thanks, I'll try it! I guess "mixing" GPU+CPU would hurt the perf tho.

screaminghawk · 2025-05-28T19:37:04 1748461024

I use this a lot! Thanks for your work and looking forward to the next one

danielhanchen · 2025-05-28T19:43:13 1748461393

Thank you!! New versions should be much better!

terhechte · 2025-05-28T18:31:16 1748457076

You can run the 4bit quantized version of it on a M3 Ultra 512GB. That's quite expensive though. Another alternative is a fast CPU with 500GB of DDR5 RAM. That of course, is also not cheap and slower than the M3 Ultra. Or, you buy multiple Nvidia cards to reach ~500GB of VRam. That is probably the most expensive option but also the fastest

lodovic · 2025-05-28T19:59:25 1748462365

If you use the excess memory for AI only it's cheaper to rent . A single H100 costs less than $2 per hour. (incl power)

diggan · 2025-05-28T20:25:34 1748463934

Vast.ai has a bunch of 1x H100 SXM available, right now the cheapest at $1.554/hr.

Not affiliated, just a (mostly) happy user, although don't trust the bandwidth numbers, lots of variance (not surprising though, it is a user-to-user marketplace).

qingcharles · 2025-05-29T17:26:43 1748539603

Every time someone asks me what hardware to buy to run these at home I show them how many thousands of hours at vast.ai you could get for the same cost.

I don't even know how these Vast servers make money because there is no way you can ever pay off your hardware from the pennies you're getting.

omneity · 2025-05-28T22:23:33 1748471013

Worth mentioning that a single H100 (80-96GB) is not enough to run R1. You're looking at 6-8 GPUs on the lower end, and factor in the setup and download time.

An alternative is to use serverless GPU or LLM providers which abstract some of this for you, albeit at a higher cost and slow starts when you first use your model for some time.

zackangelo · 2025-05-29T06:37:08 1748500628

Yeah, to run the full precision model you need either two 8xH100 nodes connected via Infiniband or one 8xH200 node or one 8xB200 node.

Not for the GPU poor, to be sure.

girvo · 2025-05-29T00:25:32 1748478332

It is enough to run the dynamically quantised 1.56 bit version I believe, which is fun to play around with.

behohippy · 2025-05-28T18:31:13 1748457073

About 768 gigs of ddr5 RAM in a dual socket server board with 12 channel memory and an extra 16 gig or better GPU for prompt processing. It's a few grand just to run this thing at 8-10 tokens/s

wongarsu · 2025-05-28T22:58:18 1748473098

About $8000 plus the GPU. Let's throw in a 4080 for about $1k, and you have the full setup for the price of 3 RTX5090. Or cheaper than a single A100. That's not a bad deal.

For the hobby version you would presumably buy a used server and a used GPU. DDR4 ECC Ram can be had for a little over $1/GB, so you could probably build the whole thing for around $2k

JKCalhoun · 2025-05-29T01:37:22 1748482642

Been putting together a "mining rig" [1] (or rather I was before the tariffs, ha ha.) Going to try to add a 2nd GPU soon. (And I should try these quantized versions.)

Mobo was some kind of mining rig from AliExpress for less than $100. GPU is an inexpensive NVIDIA TESLA card that I 3D printed a shroud for (added fans). Power supply a cheap 2000 Watt Dell server PS off eBay....

[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...

phonon · 2025-05-28T23:33:50 1748475230

This is the state of the art for such a setup. Really good performance!

https://github.com/kvcache-ai/ktransformers

mechagodzilla · 2025-05-28T19:53:32 1748462012

I have a $2k used dual-socket xeon with 768GB of DDR4 - It runs at about 1.5 tokens/sec for the 4-bit quantized version.

hu3 · 2025-05-28T18:33:57 1748457237

It's probably going to be free at OpenRouter.

There's already a 685B parameter DeepSeek V3 for free there.

https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free

latchkey · 2025-05-28T18:52:53 1748458373

It is free to use, but you're feeding OR data and someone is profiting off that.

ankit219 · 2025-05-28T19:02:47 1748458967

Thats how a lot of application layer startups are going to make money. There is a bunch of high quality usage data. Either you monetize it yourself (cursor), get acquired (windsurf) or provide that data to others at a fee (lmsys, mercor). This is inevitable and a market for this is just going to increaase. If you want to prevent this as an org, there arent many ways out. Either use open source models you can deploy, or deal directly with model providers where you can sign specific contracts.

85392_school · 2025-05-28T19:08:01 1748459281

You're actually sending data to random GPUs connected to one of the Bittensor subnets that run LLMs.

latchkey · 2025-05-28T19:37:58 1748461078

That can, today, collect that data and sell it. There is work being done to add TEE, but it isn't live yet.

dist-epoch · 2025-05-28T19:24:52 1748460292

Not every prompt is privacy sensitive.

For example you could use it to summarize a public article.

latchkey · 2025-05-28T19:43:51 1748461431

Every prompt is valuable.

criddell · 2025-05-28T20:01:21 1748462481

And you are getting something valuable in return. It's probably a good trade for many, especially when they are doing something like summarizing a public article.

jacob019 · 2025-05-28T22:15:18 1748470518

I'm not so sure. I have agents that do categorization work. Take a title, drill through a browse tree to find the most applicable leaf category. Lots of other classification tasks that are not particularly sensitive and it's hard to imagine them being very good for training. Also transformations of anonymized numerical data, parsing, etc.

latchkey · 2025-05-29T17:04:56 1748538296

"one man's garbage is another man's treasure"

dist-epoch · 2025-05-28T20:01:54 1748462514

Using an AI for free is also valuable. Seems win/win.

latchkey · 2025-05-29T17:06:48 1748538408

This isn’t about reciprocal value. Even if something isn't privacy sensitive, it still holds value.

SkyPuncher · 2025-05-28T18:54:56 1748458496

Practically, smaller, quantized versions of R1 can be run on a pretty typically Macbook Pro setup. Quantized versions are definitely less performant, but they will absolutely run.

Truthfully, it's just not worth it. You either run these things so slowly that you're wasting your time or you have to buy 4- or 5-figures of hardware that's going to sit, mostly unused.

hadlock · 2025-05-28T18:37:30 1748457450

As mentioned you can run this on a server board with 768+ gb memory in cpu mode. Average joe is going to be running quantized 30b (not 600b+) models on an $300/$400/$900 8/12/16gb GPU

rahimnathwani · 2025-05-28T18:44:03 1748457843

I'm not sure that's enough RAM to run it at full precision (FP8).

This guy ran a 4-bit quantized version with 768GB RAM: https://news.ycombinator.com/item?id=42897205