Hacker News new | past | comments | ask | show | jobs | submit login
NanoGPT (github.com/karpathy)
1532 points by trekhleb on Jan 11, 2023 | hide | past | favorite | 320 comments



Wow, fun to find this trending on HN this morning! I am currently also working on the associated video lecture (as the next episode of my video lecture series here https://karpathy.ai/zero-to-hero.html ), where I will build nanoGPT from scratch and aspire to spell everything out, as with the earlier videos. Hoping to get it out in ~2 weeks or so.


Open accessible lectures / knowledge like yours allowed many people, me included, to turn their life around by putting in the effort and develop themselves. Thank you.


While doing my PhD some years ago (it wasn't a PhD on AI, but very much related) I trained several models with the usual stack back then (pytorch and some others in TF). I realized that a lot of this stack could be rewritten in much simpler terms without sacrificing much fidelity and/or performance in the end.

Submissions like yours and other projects like this one (recently featured here as well) -> https://github.com/ggerganov/whisper.cpp, makes it pretty clear to me that this intuition is correct.

There's a couple tools I created back then that could push things further towards this direction, unfortunately they're not mature enough to warrant a release but the ideas they portray are worth taking a look at (IMHO) and I'll be happy to share them. If there's interest on your side (or anyone reading this thread) I'd love to talk more about it.


Your youtube playlist combined with NanoGPT and your Lex Fridman podcast is like having a university level degree with a free internship guidance. Thank you!


Just wanted to say thank you for all the incredible work and resources you publish. I've lost track of all the different skills I've learned from you, from computer vision, RNNs, minGPT, even speedcubing :D


+1. I've benefited greatly from your content, e.g. your CNN lecture was incredibly accessible [0]. I still find transformers stubbornly elude my intuitions despite reading many descriptions. I would very much appreciate your video lecture on this topic.

[0] I think https://www.youtube.com/watch?v=LxfUGhug-iQ


I've found all of your code and lessons on youtube so incredibly useful. You're a wonderful teacher and I really appreciate all the work you've done with this!


Andrej: thank you!

--

To the mod (dang): IMHO Andrej's comment should probably be at the top of the page, not my comment. UPDATE: Looks like that's done. Thank you :-)


Thank you for your amazing work. Between cs231n and your recent videos, I've learned a ton - and you have a gift to explain things in such an easy and straightforward way, that I'm always feeling like an idiot (in a positive way) for not having grasped the concept before.


Bad ass! A great addition would be some content on tuning pre-trained language models for particular purposes. It would be great to have examples of things like tuning a GPT model trained on language and code to take in a context and spit out code in my custom API, or using my internal terminology. Not sure if this is RL based fine tuning or just a bunch of language to code examples in a fine tuning dataset? In essence, how can we start using language to control our software?


Ty agree, most people practically speaking will be interested in finetuning rather than from-scratch pretraining. I currently have some language about it in readme but I agree this should get more focus, docs, examples, etc.


Appreciate the work to make GPT training accessible!

Do you leave hyperparams (like learning rate, batch size) the same when switching from 8xA100 to fewer GPUs, or do these need to be adjusted?

Separately, when going from 8xA100 GPU to a single A100 GPU, in the worst case we can expect the same model performance after training 8x as long correct? (And likely a bit better because we get more gradient updates in with smaller batch size)


Thank you for sharing your knowledge. Anything that can be done to democratize machine learning is an invaluable social service. Hats off to you.


Saying absolutely nothing new here, but your work is so damn inspiring! I wish I had such a natural connect to my work, an ability to distill complex concepts down to the fundamentals, and such inventiveness! I took your CS231N class at Stanford as well. Implementing the fundamental building blocks like backprop was fun and insightful. Thanks again for your passion and teaching!


Your tutorials are effective and concise. Thank you for them! Accessible, from-scratch knowledge on these topics is essential at this time in history and you're really making a dent in that problem.


Thanks, I love your video about back propagation where you painstakingly spell out every calculation. It was like a breath of fresh air compared to other materials out there.


Thanks for your work Andrej! I've been doing earlier lectures and this is absolutely fantastic educational content!


Thank you for your constant contributions.


Amazing work, much appreciated


Thank you for your great work!


just started watching your lectures! they are great!


I have taken several masters-level courses in Machine Learning -- and even with those credentials, I cannot recommend enough Andrej's youtube series, "Neural Networks: Zero to Hero". There, he teaches you, from scratch, how to build everything from the underlying automated gradient calculation system in pytorch, all the way up to the slower version of this model - `MinGPT`.

[1] https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs...

(edit: self-promo: I'm currently working on a Typescript follow-through of this same series of video lectures, if you want to follow along with stronger types for explanation: https://github.com/Marviel/lab-grad)


I can’t believe I just spent 2 and a half hours glued to my phone in bed watching this, for absolutely no reason other than it was such an interesting intro (to a subject I’m already familiar with). Thanks for the recommendation, and thanks Andrej for making this!


How does it compare to fast.ai? As a engineer looking to learn, what should I start with?


Both are good for different things.

Fast.AI is great, but it takes the top down, vs the bottom up, approach. It takes you from a production-level black box that you don't understand, down to the details. The benefit there is you get good high-level intuition of how it behaves at the "let me use this technology for a job" level.

Separately, the fast.ai library is also highly recommendable -- it comes with some state-of-the-art image recognition models, and its training wrappers are really helpful particularly for image-recognition dataset training.

Karpathy's "Neural Networks: Zero to Hero" video series starts at the level of individual neurons, and works you up to the final product. For some reason both this style, and Karpathy's conciseness appeal to me slightly more. I'm also super detail-oriented, though -- and any level of "hand waving" (even if further explanation comes later) always bothers me. He's also got some pretty high-profile industry experience which carries some weight with me.

But I'll say that both are really high-quality. -- ultimately, my recommendation would be to follow whichever one speaks most to you personally after the first 1hr or so.

EDIT: Per Jeremy's response below, if you want the bottom-up approach but like the fast.ai teaching style, you should check out "part 2" of the fast.ai set of tutorials, which is exactly that.


fast.ai has both - the "part 1" section is top-down, and the "part 2" section is bottom up. You can do part 2 without having done part 1. Part 2 starts with implementing matrix multiplication from scratch, then backprop from scratch, then SGD from scratch, etc.

There will be a new version of the part 2 course out in a few weeks. It even covers stuff like random number generation from scratch, convolutions from scratch, etc. It gradually works all the way up to Stable Diffusion.

@karpathy's and the fast.ai lessons work well together. They cover similar topics from different angles.

(I'm the primary creator of the fast.ai courses.)


That's awesome! I did not know that part 2 was structured this way, and will check it out. Will be really neat to see you teach stable diffusion.

Thanks for your work on fast.ai!


Jeremy @ Fast.ai says he takes this pedagogical approach because it's "proven" to be the best way to learn. He's probably right, but I do find it confusing at times because in the beginning you're just hitting ctrl + enter on a IPYNB haha.

Maybe Karpathy's approach will speak to me more--thanks for the recommendation!


Wow, I just watched the first video and it's, hands down, the most crystal clear explanation of neural nets and backpropagation I've ever seen. Bravo.


This is really good, and I was really excited by it but then I read:

> running on a single 8XA100 40GB node in 38 hours of training

This is a $40-80k machine. Not a diss, but I would love to see an advance that would allow anyone with a high end computer to be able to improve on this model. Before that happens this whole field is going to be owned by big corporations.


I don't know if that's a blocker. Ordinary people commonly rent a $40k machine for 38 hours from companies like Avis and Hertz.

If training a large model now costs the same as driving to visit grandma, that seems like a pretty good deal.


That's a great comparison. For a real number, I just checked Runpod and you can rent a system with 8xA100 for $17/hr or ~$700 for 38 hours. Not cheap, but also pretty close to the cost of renting a premium vehicle for a few days. I've trained a few small models by renting an 1xA5000 system and that only costs $0.44/hr, which is perfect for learning and experimentation.


It would be great if a tradeoff could be made, though. For example, train at 1/10th the speed for 1/10th of the cost.

This could correspond to taking public transport in your analogy, and would bring this within reach of most students.


Slower training tends to be only a little cheaper, because most modern architectures parallelize well, and they just care about the number of flops.

If you want to reduce cost, you need to reduce the model size, and you'll get worse results for less money.


The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.

I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.


Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.

And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).

Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.


Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?


Bloom-petals


Amazing. Thank you.


No prob. I think it’s a great idea


I guess this would only become a reality if games started requiring these cards.


Well if it used to cost you $1 for 1hr at 1x speed, now it will take you 10hr at 0.1x speed, and if my math checks out $1. You need to shrink the model.


But of course now you run it on your own computer instead of in the DC, which changes the numbers. Especially if your student dorm has a shared electricity bill :)


The good news is that, unlike vehicles, the rate for rented compute will continue to drop


Let's not forget that rendering 3D Animations in 3DSMAX or Maya used to take days for a single frame for a complex scene, and months for a few minutes.


You have to gas it up and heaven help you if it gets a scratch or a scuff.


Great news! Cloud instances energy usage is included in their price, and because they're remote and transient it's impossible to permanently damage them.


I think the equivalent of being not careful and getting a dent in this context is to leave it open to the internet and having a bitcoin miner installed.


You free the instance and the miner is gone.


As you are paying for the resources you use that's fine.

The closest would be if you used some form of software bug to actually cause physical damage, certainly not impossible, but extremely unlikely compared with actually physically damaging a car.


A better fit would be, if you have unlimited liability like with AWS, and you leak your key pair. Then someone runs up a 100k bill setting up mining instances


but you still have to pay for network ingress/egress traffic.


Similarly maybe we should only let people rent a NanoGPT box if they are over 25 and they have to get collision insurance.


If you can fit the training into 24GB, a used RTX 3090 for $700-$800 seems like a good deal at the moment. They are about 45-65% as fast as the A100 according to https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...

So if you buy two of these cards it will take 12-13 days instead of 38 hours but only require a $2500 PC.

James Betker, who created tortoise TTS, built his own $15k machine with 8x RTX 3090 and trained the models with it. He now works for OpenAI…


Recommended reading:

https://timdettmers.com/2023/01/16/which-gpu-for-deep-learni...

TL;DR: You probably don't need that expensive Threadripper because 2x PCIe 4.0 x16 will not be very beneficial. Go cheap, go 2x PCIe 4.0 x8.


Any link to the 15k machine ?. Maybe it is cheaper now.


I think it was a DIY machine, those RTX 3090 have gotten cheaper for sure. From my experience, going beyond 4 GPUs is a pricey affair. See [§]. All but one model of the RTX3090 require at least 3 slots.

If 4 GPUs connected via PCIe 4.0x16 are enough you can choose among various sRTX4 boards for 3000 series AMD Threadripper CPUs.

[§] https://www.reddit.com/r/deeplearning/comments/tw0olq/commen...

Another useful URL: https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-...


It's a $33/hour machine on AWS, so about $1250 for one training run. Not cheap, but easily in the reach of startups and educational or research institutions.

Edit: or about $340 if you get the 8xA100 instance from lambdalabs, in the realm of normal hobby spending


Or $9/hour if you use Spot :-)

https://aws.amazon.com/ec2/spot/pricing/


Hopefully your progress gets saved in time when the spot instance inevitably gets terminated in the midst of training.


"Managed Spot Training..."

"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."

https://docs.aws.amazon.com/sagemaker/latest/dg/model-manage...


If you use Horovod Elastic, I think you can avoid this problem working across a cluster of Spot instances.

https://horovod.readthedocs.io/en/stable/elastic_include.htm...


If you're doing something new/ custom (which you presumably are if you aren't using someone else's prebuilt model), it could take a lot of runs to figure out the best training data and finetune settings.

(I assume. I've never worked with GPT, but have done similar work in other domains).


After training don't you have to keep it running if you want to use it?


Just download the model and run it on something much smaller and cheaper. Bigger models like GPT-J are a bit of a pain to run, but GPT2-sized models run just fine on consumer GPUs.


Ahh okay, thanks. So how big is the model? Seems like it should be available to download so people don't have to train it. I understand you can train it on custom data but for a "default" model are there any available to download?


What’s required to run the model?


The biggest GPT2 (1.5B params) takes about 10GB VRAM, meaning it runs on a RTX 2080 TI, or the 12GB version of the RTX 3080


What's the largest language model I can run on a 3090 with 24 GiB RAM?


Depends on precision, you can run ~5B model with fp32 precision or ~11B fp16 model max. Int8 is really bad for real world use case so not mentioning it.

But if you are looking to get performance of ChatGPT or GPT-3 then don't waste your time, all GPT-3 like small LLM models (below at least 60B params) are useless for any real world use case, they are just toys.


If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Fortunately very few real world use cases need to be this general.

If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.


> If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Yes, thanks, that's what I meant.

> If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.

Agree, BERT family is a good example here.


Okay, thank you. Perfect response.


https://github.com/karpathy/nanoGPT#i-only-have-a-macbook

> This creates a much smaller Transformer (4 layers, 4 heads, 64 embedding size), runs only on CPU, does not torch.compile the model (torch seems to give an error if you try), only evaluates for one iteration so you can see the training loop at work immediately, and also makes sure the context length is much smaller (e.g. 64 tokens), and the batch size is reduced to 8. On my MacBook Air (M1) this takes about 400ms per iteration. The network is still pretty expensive because the current vocabulary is hard-coded to be the GPT-2 BPE encodings of vocab_size=50257. So the embeddings table and the last layer are still massive. In the future I may modify the code to support simple character-level encoding, in which case this would fly. (The required changes would actually be pretty minimal, TODO)


But how often do you need to run this? You can run 8xA1000 on LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be able to run the entire data set for less than $350.

[0] https://lambdalabs.com/service/gpu-cloud#pricing


They are acknowledged at the bottom for supporting andrej's research!!


A couple of weeks ago a new paper came out that shows how to train a high quality language model on a single GPU in one day.

https://arxiv.org/abs/2212.14034


If you can’t fit the model on your resources you can leverage DeepSpeed’s ZeRO-offload which will let you train GPT2 on a single V100 (32gb).

Alternatively, if you’re researching (with the caveat that you have to either publish, open source or share your results in a blog post) you can also get access to Google’s TPU research cloud which gives you a few v3-8s for 30 days (can’t do distributed training on devices but can run workloads in parallel). You can also ask nicely for a pod, I’ve been granted access to a v3-32 for 14 days pretty trivially which (if optimized) has more throughput than 8xA100 on transformer models.

TPUs and moreso pods are a bit harder to work with and TF performs far better than PyTorch on them.

https://www.deepspeed.ai/tutorials/zero-offload/

https://medium.com/analytics-vidhya/googles-tpu-research-clo...


I was curious about how much this would be to rent, because definitely the cost of those servers is outside the budget! Lambda has 8xA100 40gb for $8.80/hr: https://lambdalabs.com/service/gpu-cloud#pricing


It seems as likely as people being able to build big automaker level of cars just with tools in their garage. More compute is going to keep producing better results at least for LLMs.


How are universities and colleges dealing with this kind of demand for computing power? It must be hard to be able to do some courses now.


Most decently large colleges have been investing in HPC for a while, and started investing in GPU HPC around 2014. You'd be surprised what sort of school projects the compute budget exists for.


I went to a smallish state university, even there we had our own HPC center and lab. We had a proper HPC (IIRC) 6 row data center across campus and we had a continuous budget available to me as an undergraduate research assistant for building beowulf clusters for the graduate programs to run assignments on. I once got an allowance to buy 15 raspberry pis to build an arm cluster.


As far as research groups go - they get funds (project grants, donations, etc.) to purchase machines and parts, and then users have to timeshare them.

These machines are pretty much crunching numbers 24/7, and your project will get appended to a queue.


'group project'


That's to train it from scratch, though, right? If you preload the GPT2 weights you don't need to do this. You can just give it additional training on your texts.


Well, he does include instructions for running it on a personal computer, which looks like what I'm gonna be doing next week.

Besides the rental options discussed below these nvidia boxen don't look too big so either used ones will be available for cheap relatively soon, or you could just locate and liberate one in Promethean fashion.


If GPT-2 / nanoGPT needs this setup, just imagine what GPT3 / chatGPT needs!


Supposedly even running the trained model for ChatGPT is extremely expensive unlike the image generators which can largely be run on a consumer device.


I don’t know anything about this, but is that this instance type on AWS? p4d.24xlarge


You can rent on AWS and other cloud providers.


So if I see it right that would be a p4d.24xlarge instance. Which goes for about $32.77 an hour nowadays so the total training would be about $1245. Not cheap, but certainly not a nation state budget.

Edit: i just noticed lambda lab. It seems they ask $8.8 per hour for an instance of this caliber. That puts the total training cost around $334. I wonder how come it is that much cheaper.


That is a key difference. You can’t easily and cheaply rent an auto factory, but you’re starting to be able to rent an LLM training factory once for a model where you can then more cheaply run inference on.


Doesn't huggingface have dozens of freely available pretrained models like this (including various sized implementations of GPT2) and isn't the source available on most if you wanted to train them yourself?

All I see in the comments is praise for the author as a person, so just wondering what's unique about this that's not available elsewhere? 730 upvotes and counting, assuming I'm missing something...


True, but the use cases arent the same. As he did before for other models, he has a knack for distilling the code down to beautiful, self-contained examples of high didactic value.

It's an order of magnitude easier to grok the basics from this repo than from going through (admittedly more ergonomic or performant or production-ready) huggingface repos.


Additionally, in terms of the streamlining nanoGPT porports, HuggingFace's implementations play nice with optimization techniques such as ONNX/TensorRT, which will give you better performance than anything PyTorch-based even if minimal.

That doesn't mean an ONNX-ed nanoGPT won't be better, but the field of optimized text generation isn't as new as people claim.


This is a didactic implementation. If you read the HuggingFace repo it is much more abstracted on account they implement many models in the same codebase. It's not fast or big, just easier to read and tweak.


If so, then why the second line of its documentation says that "it is a rewrite of minGPT that prioritizes teeth over education"?


minGPT prioritized being understandable above all else, and was not very fast. This repo includes several optimizations, but it still much more understandable than probably any other open source implementation.


This is a dumb question about language models in general, not necessarily specific to NanoGPT: why is all the focus on training? Can I download and run a pre-trained model locally? Surely the specs required to run a model are much, much lower than those required to train the model?


I believe the training is where the architecture of the model is most apparent. You can absolutely download plenty of pre-trained models.

You will also probably need to fine tune for a specific use case, so a common approach is downloading a pre-trained model and fine tuning.

I think including the “from scratch” tuning script is educational more than anything else.


It's the equivalent of building from source versus downloading a compiled binary.

Also you can perform "fine tuning" which means you start with a trained model and train it further on your own data, allowing you to customize the model for specific tasks.


If you're only using pre-trained models, it's going to be harder to differentiate yourself. Training / specialization of models is where the moat-building is (due to access to different data sets / better ideas). By specializing / training, more of the token limit can be used for generation rather than prompting / better prompts can be made.

The lower the cost of training, the more profitable any resultant business. You can even envision businesses that train the model regularly to bring in new knowledge. The cheaper this is, the more opportunities open up.


inference can still be a bottleneck i think since you usually load the whole thing into memory which is 32-64GB+ usually?


Language models range from 1 to 300+ GB when loaded. It depends on how you load them, if you load in int8 you get 4x reduction.


Are there any possible technologal or scientific leaps on the horizon that would reduce training time by an order of magnitude or more? GPT-3 took 355 years to train with incredibly expensive hardware, which means small players have no chance to push the state of the art


As models get bigger less and less neurons get activated by any given input. If you can somehow predict which neurons get activated you can skip the vast majority of the computational load. I have read a paper where they argued that only 0.5% of the neurons are actually active in a 200 million parameter model so you can get a 200x improvement just from that.

What this tells you is that there is very little money in optimizing deep learning and that NVIDIA has made it very easy to just throw more hardware at then problem.


> very little money in optimizing deep learning

Oh - there are a lot of people working on optimizing AI. Amongst hobbyists, academia, and corporations alike.

The thing is, if you come up with a neat optimization that saves 30% of compute for the same results, typically instead of reducing your compute budget 30%, you instead increase your model/data size 30% and get better results.


Jevon's paradox of data and AI. The more efficiently data is used, the more demand their is for data.


Any state of the art model takes about three weeks to train.


More an indication of human patience than task difficulty.


This is hard a-priori, but fairly easy post-facto. Model distillation isn't a common practice yet, but it has already been demonstrated to be quite effective for specific use cases.


Distillation works but somehow we see very few papers doing it at this scale.


Do you have a link to that paper by any chance? By "neurons" did they mean weights or activations?


Here is a GPU implementation.

https://ieeexplore.ieee.org/document/9635657

It is somewhere from 8x to 25x faster than doing dense machine learning. The speedup was higher on the original CPU implementation and the GPU paper mentions that if there isn't enough shared memory on the GPU it will have to switch to an algorithm that has more overhead.

By neurons I actually meant "nodes"

My comment is effectively a summary of this article: https://www.kdnuggets.com/2020/03/deep-learning-breakthrough...

Edit: There is a paper for sparse spiking gradient descent promising a 150x improvement. I am not sure how practical this is because spiking neural network hardware heavily limits your model size but here it is:

https://arxiv.org/abs/2105.08810


200x improvement over Nvidia with sparse matrices or without? Nvidia supports a certain level of sparsity.


> argued that only 0.5% of the neurons are actually active in a 200 million parameter model so you can get a 200x improvement just from that

Yes, but you don't know which 0.5% depending on the input text.


I wonder about this, too. OpenAI's biggest 'moat' is that their model takes so much resources to train, not that their algorithms are particularly secret.

One idea I had was to not use one single model to learn all steps of the task, but to break it up. The human brain has dedicated grammar processing parts. It is unclear whether something like a universal grammar exists, but we have at least an innate sense for rhythm. Applied to NLP, you could heavily preprocess the input. Tokenize it, annotate parts of speech. Maybe add pronunciation, so the model doesn't have to think about weird english spelling rules, and so you can deal with audio more easily later. So I would build all these little expert-knowledge black boxes and offer them as input to my network.

But there is also some inherent resource cost in large language models. If you want to store and process the knowledge of the world, it is going to be expensive no matter what. Maybe we could split the problem into two parts: Understanding language, and world knowledge (with some messy middle ground). I believe you could replace the world knowledge with a huge graph database or triple store. Not just subject-verb-object, but with attribution and certainty numbers for every fact. The idea would be to query the database at inference time. I don't know how to use this in conjunction with a transformer network like GPT-3, so you'd likely need a very different architecture.

The big benefit of this would be that it is feasible to train the language part without the world knowledge part with much less resources. But you have other benefits, too. ChatGPT is trained to "win the language game". But as they say, winning the argument does not make you right. If you have a clean fact database, you can have it weigh statements from trustworthy sources higher. You then basically have a nice natural language frontend to a logical reasoning system that can respond with facts (or better: conclusions).


GPT and human brain ( at least the language / speech part ) have nothing in common. We, as humans, do not use language in a generative way, is derived from a higher or very low level of abstraction ( intentions, emotions, etc ) and is explictly use for communicating something. Even this text is based on previous knowledge, saved in an abstract way, and while writing this I must follow the synthax of the language or writing the right order otherwise, you , the person who reads this, will not understand what I mean. While GPT can generate the same text, it does not have a motivation and has no need to communicate ( while I just wanted to feel good by bringing some contribution on HN ).

So yes, very different architecture.


> and while writing this I must follow the synthax of the language or writing the right order otherwise

A good example that is not, word randomised order and kombination with Mrs Spelling and fonetic spel-ing prevent ye knot that which I wrote you to komprehend.

(My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).

A better point is that GPT-3's training set is more tokens than the number of times an average human synapse fires in a lifetime, squeezed into a network with about 3 orders of magnitude fewer parameters than the human brain has synapses.

It's wrong to model AI as anything like natural intelligence, but if someone insists, my go-to comparison (with an equivalent for image generators) is this: "Imagine someone made a rat immortal, then made it browse the web for 50,000 years. It's still a rat, despite being very well-trained."


> (My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).

At least for me it's perfectly understandable (except the "Mrs" part). This reminds of those "did you know you can flip characters randomly and our brain can still understand the text" copypastas that can be found everywhere. I think it's probably quite similar for word order: As long as your sentence structure is not extremely complicated, you can probably get away with changing it any way you like. Just like nobody has issues understanding Yoda in Star Wars.

Although I think there are some limits to changing word order - I can imagine complicated legal documents might get impossible to decipher if you start randomizing word order.


These are conceptual "differences" that don't actually explain the mechanics of what's going on. For all you know "motivation", "intentions", etc. are also just GPT-like subsystems, in which case the underlying mechanics are not as different as you imply.


If it were gpt-like sub systems, humans would be emitting MWs of power instead of the 100W now.

Whatever humans have it is many orders of magnitude better…


That's the hardware it runs on, not the software architecture of GPT. I could equally say that transistors are faster than synapses by the same ratio that marathon runners are faster than continental drift.


Or biology evolved a better way to do the same or similar enough computation that we simply haven't yet discovered.


Emotion is just "spiritual" word for a utility function. Or terminal goal to be more precise.


It seems to me that a lot of everyday communication is rather statistical in nature. We don’t necessarily think deeply about each word choice but instead fall back on well worn patterns and habits. We can be more deliberate about how we compose our sentences but most situations don’t call for it. It makes me wonder if we don’t all have a generative language model embedded in our brains that serves up the most likely next set of words based on our current internal state.


> GPT and human brain have nothing in common

Here we go again. They must have something in common, because for about 90% of the tasks the language model agrees with humans, even on novel tasks.

> We, as humans, do not use language in a generative way

Oh, do you want to say we are only doing classification from a short list of classes and don't generate open ended language? Weird, I speak novel word combinations all the time.


No, what is meant is that the next word I speak/write after a current word are not based on a statistical model, but on a world model which includes a language structure based on a defined syntax and cultural variaty. I actually mean what I say while the ChatGPT just parrots around weights and produces an output based purely on statistics. There is zere modeling which translates into real world ( what normally we call "understanding" and "experience" ).

As was said, a different architecture.


Oh, I see. Then I agree with you, an isolated model can't do any world modelling on its own. No matter how large it is, the real world is more complex.

It might be connected to the world, of course. And it might even use toys such as simulators, code execution, math verification and fact checking to further ground itself. I was thinking about the second scenario.


Ok top of it not having "motivation" to communicate, it has literally nothing to be communicated in the first place.

That's the key difference. We use language to express conceptualizations. We have some kind of abstract model somewhere that we are translating.

Maybe it isn't a cohesive model either. All I can say for certain is that - whatever it is - we are expressing it.

GPT does not express. It parrots. There is no conceptualization.


The more experience I get, the more I wonder if this is really the case for us. We certainly have some kind of abstract model in our heads when thinking deeply about a problem. But in many settings - in a work meeting, or socially with friends - I think it is a much more automatic process. The satisfaction you get when saying the right thing, the dread when you say something stupid: It is just like playing a game. Maybe the old philosophical concept of society as merely "language games" is correct after all. A bit silly but I find the thought makes annoying meetings a bit more bearable.

But you are of course right with GPT, it has no inner life and only parrots. It completely lacks something like an inner state, an existence outside of the brief moment it is invoked, or anything like reflection. Reminds me of the novel "Blindsight" (which I actually haven't read yet, but heard good things about!) where there are beings that are intelligent, but not conscious.


Intelligent but not conscious would still be a few steps ahead of GPT.

We can take a concept and refactor it symbolically. GPT can't do that. All it does is find symbols that are semantically close to other symbols.


I’m not sure that those two processes are as distinct as you believe them to be.


You seem very sure they aren't, yet you have no evidence apart from your own belief that you might be correct.

That's circular reasoning.


Yup.


This biggest most is high-quality data. Both their proprietary datasets (WebText, WebText2 etc), but also now their human-annotated data. Another secondary moat is their expertise with training models using PPO (their RL method), they can get results that are quite better than other labs. I say this moat is secondary because it's possible that you can get similar results with other RL algorithms (e.g. DeepMind using MPO) and because maybe you don't really need RL from Human Feedback, and just fine-tuning on instructions is enough


I find OpenAI having exclusive access to that kind of high-quality data more concerning than them having access to their current amount of compute and currently trained model. A couple of million dollars worth of compute is in the realm of any medium sized research university, larger company or any country worth of mention. And seeing as Moore's law still applies to GPU, the cost will only fall.

However high-quality data is scarce. I would be willing to fund a proper effort to create high-quality data.


It's not just about compute; if that were the case, then models like BLOOM and OPT, which also have 175 billion parameters, would have the same performance for real-world use cases as GPT-3, but they don't. Datasets are also very important.


Check out DeepMind RETRO, it's one year old already, but exactly what you say:

https://www.deepmind.com/publications/improving-language-mod...


Model size does not necessarily correlates to quality of results.

"Chinchilla (70B) Greatly Outperforms GPT-3 (175B) and Gopher (280B)" - https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b...


An interesting outcome of the nanoGPT repo is this struggle to exactly match the Chinchilla findings[0], even after discussing it with the authors.

A larger discussion is that the scaling laws achieve loss-optimal compute time, but the pre-training loss only improves predictions on the corpus, which contains texts written by people that were wrong or whose prose was lacking. In a real system, what you want to optimize for is accuracy, composability, inventiveness.

[0]: https://github.com/karpathy/nanoGPT/blob/master/scaling_laws...


I highly doubt this in practice on a large scale. Outside of the common phenomena of "most large NNs are under trained" and "less better data is sometimes better than more worse data", there are no other obvious mechanisms to explain why a smaller model with same or similar architecture would be better than a larger one.

I claim instead that we are still hardly scratching the surface with how we evaluate NLP systems. Also, some fields have straight up trash evaluation schemes. Summarization and ROGUE scores are totally BS and I find the claim that they even correlate with high quality summaries suspect. I say this with publications in the that subfield, so I have personal experience with just how crummy many summarizes are.


there are no other obvious mechanisms to explain why a smaller model with same or similar architecture would be better than a larger one.

Overfitting?


The consensus seems to be that the majority of LMs are undertrained not overfitting though.


What do you mean by "small players have no chance"? OpenAI was founded in 2015, it used to be a "small player" which just got things right and grew with it - we're not talking of Google or Facebook investing a chunk of their billions cash. In Germany, AlephAlpha has built their own supercomputer and are training similar sized models. It's expensive for sure, but well in the possibilities of startups. In France researchers trained the similarly sized BLOOM model https://huggingface.co/bigscience/bloom. They claim it cost between $2 and $4 millions.

Sure, a single researcher can't replicate this at their university, but even though OpenAI likes to publish it this way, we're not really talking about research here. Research was inventing the transformer architecture, this is just making it bigger by (very smart) engineering choices. It's something companies should do (and are doing), not researchers.


Microsoft (using Azure DCs) built a supercomputer with 10,000 V100 GPUs exclusively for OpenAI. [0]

It is estimated that it cost around $5M in compute time to train GPT-3.

OpenAI has received billions in investment prior to launching GPT-3, including $1B from Microsoft in 2019.

[0]: https://blogs.microsoft.com/ai/openai-azure-supercomputer/


> we're not talking of Google or Facebook investing a chunk of their billions cash

OpenAI had raised $1B from Microsoft in 2019 and used it to train a 175B param model. Now, they have raised $10B and are training GPT-4 with 1.5T params. GPUs are capital intensive and as long as there are returns to bigger models, that's exactly where things will go.


I can't find any source on the 1.5T params number. I'd love to read more if you have any links to share. Thanks


afaik, gpt-4 is mostly rumours so far, same thing for the 1.5T number. gpt-4 is suerly coming.


Maybe it will be called GPT-XP by then, with Microsoft owning half of it.


Looking forward to see GPT-4 recommending Linux and Libre Office instead of Windows/Office as the logical choice out of 250 IQ ML Model...


In my imagination, OpenAI does what Bungie did when MS bought them, and open-sources what used to be their crown jewels.

That said, GPT-AlephOne only makes sense if there's a preceding GPT-∞.


They have got to release GPT-3.11 For Workgroups first.


Or GPT-365


Then they can bring back the talking paperclip, but this time actually useful.


It could actually work. It would be an incredibly gutsy move and I love it, and they'd probably earn a lot of respect. They’d get so much press for it. And if it held up, it’d probably be one of the things that MS is remembered for.


Why not ask GPT itself what it wants to be called?


Or GPT One.


GPT-10 will be evergreen and 'the last version of GPT'.

And then three years later GPT-11 will be required to run the latest games.


Will 1.5T parameters be possible to run in the public way GPT-3 is? I can’t wait to see what happens with this much learning!


OpenAI was founded in 2015 by a group of billionaires who pledged $1Bn of funding. That is hardly a small scrappy start up.


> we're not talking of Google or Facebook investing a chunk of their billions cash.

On the contrary, in this thread we are are mainly talking about that.


I am actually still unclear how AlephAlpha pulled that off and who funds them, since they have a rather low profile team.


Small players should focus on applications of this tech.

We now know that whatever AI Models succeed in the future, they'll be trained by a huge company and finetuned to a specific use case. Small companies should be working on use cases, and then just upgrade to the latest SOTA model.


> Small players should focus on applications of this tech.

That sounds a bit condescending. We are probably at a point from which the government should intervene and help establish level playing field. Otherwise we are going to see a deeper divide between multibillion businesses conquering multiple markets and sort of neofiefdom situation. This is not good.


It's not that condescending, that's todays reality. Should I feel entitled for $600k training time that may or may not work? Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?

It's quite reasonable to make use of models already trained for small players.


> Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?

Governments already routinely do that for pharmaceutical research or for nuclear (fusion) research. In fact, almost all major impact research and development was funded by the government, mostly the military. Lasers, microwaves, silicon, interconnected computers - all funded by the US tax payer, back in the golden times when you'd get laughed out of the room if you dared think about "small government". And the sums involved were ridiculously larger than the worth of a house. We're talking of billions of dollars.

Nowadays, R&D funding is way WAY more complex. Some things like AI or mRNA vaccines are mostly funded by private venture capital, some are funded by large philanthropic donors (e.g. Gates Foundation), some by the inconceivably enormous university endowments, a lot by in-house researchers at large corporations, and a select few by government grants.

The result of that complexity:

- professors have to spend an absurd percentage of their time "chasing grants" (anecdata, up to 40% [1]) instead of actually doing research

- because grants are time-restricted, it's rare to have tenure track any more

- because of the time restriction and low grant amounts, it's very hard for the support staff as well. In Germany and Austria, for example, extremely low paid "chain contracts" are common - one contract after another, usually for a year, but sometimes as low as half a year. It's virtually impossible to have a social life if you have to up-root it for every contract because you have to take contracts wherever they are, and forget about starting a family because it's just so damn insecure. The only ones that can make it usually come from highly privileged environments: rich parents or, rarely, partners that can support you.

Everyone in academia outside of tenured professors struggles with surviving, and the system ruthlessly grinds people to their bones. It's a disgrace.

[1] https://www.johndcook.com/blog/2011/04/25/chasing-grants/


Pharmaceutical or nuclear research doesn't really classify as "small scale" as this thread started. I know there are massive amounts of money handed our by governments to fund research, but for a 3 guy startup in a garage that's probably hopeless. Public money is cursed anyways, it's better not to touch it.

I've also read it at many places, that academic research funding is way too misaligned. It's a shame, really.


I'm not being condescending at all, we've learned that the value in AI is in the applications. If you think government should regulate the field, it should be to make AI Models a commodity, like electricity.


Do you think you'll get a global agreement on this? Or would china just eat America's lunch then?


Yes, see DeepMind RETRO:

> In our experiments on the Pile, a standard language modeling benchmark, a 7.5 billion parameter RETRO model outperforms the 175 billion parameter Jurassic-1 on 10 out of 16 datasets and outperforms the 280B Gopher on 9 out of 16 datasets.

https://www.deepmind.com/blog/improving-language-models-by-r...

Though, there hasn't been much follow-up research on it (or DeepMind is not publishing it).

Annotated paper: https://github.com/labmlai/annotated_deep_learning_paper_imp...


The research is still ongoing, although perhaps lower-profile than what appears in the press.

RETRO did get press, but it was not the first retrieval model, and in fact was not SOTA when it got published; FiD was, which later evolved into Atlas[0], published a few months ago.

[0]: https://github.com/facebookresearch/atlas


How long does it take to train a human? It's useless for two years then maybe it can tell you it needs to poop.

The breakthrough will be developing this equivalent in an accessible manner and us taking care to train the thing for a couple of decades but then it becomes our friend.


Yes, but to be fair, the system that does the training really sucks and doesn’t scale.


Neither does OpenAI. It costs so much and still delivers so little. A human can generate breakthroughs in science and tech that can be used to reduce carbon emissions. ChatGPT can do no such thing.


What percentage of humans make meaningful contributions to advancing science or technology? The overwhelming majority of us are just worker bees servicing the needs of the human population.


I agree with you on this point. It’s also arguable that less people with a better education system could yield the same result with less environmental impact.

But my point, poorly explained, is that whatever ChatGPT is, it isn’t original or creative thought as a human would do it.

Chomsky’s example (which is based off Turing): Do submarines swim? Yes, they swim — if that’s what you mean by swimming.


We don't have any clear definitions for "creativity" to begin with. In practice, in these contexts, it seems to be defined as "whatever only humans can do" - that is, the goalposts are automatically moved with every AI advancement.


How could they be moved when they aren't even defined in the first place? Scientists don't even know where to begin when it comes to studying the mind and human consciousness.

But yes, scientists can look at your experiments and show that they don't have anything in common with human thought.


> What percentage of humans make meaningful contributions to advancing science or technology?

I’m a nobody that you’ve never heard of and I’ve arguably made meaningful contributions. If that’s true, don’t you think there could be way more people out there than you or sibling commenter imply?


You can't know that. Currently, 8 billion humans generate a few scientific breakthroughs per year. You'd have to run several billion ChatGPTs for a year with zero breakthroughs to have any confidence in such a claim.


With billions of GPT output streams, how do you actually discover and rank what’s significant? Screen them through some even more powerful models? I imagine it’s like a volcano eruption of text where some are absolutely brilliant and most is worthless and finding the jewels is even more demanding than generating it all.


Some theories are easily testable. For instance, ask it to write some code to efficiently solve traveling salesman problems, and then test the code on some sample problems. You can score the quality of solutions and time taken, and manually inspect the best ones.


At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would. It can’t consume hundreds of math textbooks and learn the principles of math and then apply them more broadly to science textbooks and research papers. It can’t even reliably add two numbers.

Yes, brute forcing with hard AI can produce many thoughts. But the AI wouldn’t know they are correct. It couldn’t explain why. Any discovery would only be attributable to randomness. It wouldn’t be learning from itself and its priors.


> At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would.

Actually there are many indications that GPT understands the data, because its output mostly makes sense. The reason it can't assign meaning the way a human would is because a human can correlate words with other sensory data that GPT doesn't have access to. That's where GPT creates nonsense.

Think carefully about what "understanding" means in a mechanistic sense. It's a form of compression, and a few billion parameters encoding the contents of a large part of the internet seems like pretty good compression to me.


GPT doesn't display understanding of purely abstract systems, so I doubt it's an issue of lacking sensory information. It can't consistently do arithmetic, for example - and I think it would be presumptuous to insist that sensory information is a prerequisite for mathematics, even though that's how humans arrived at it.


It's not yet clear why it struggles with arithmetic. It could be data-related, could be model-related, although scaling both seems to improve the situation.

In any case, GPT could still understand non-abstract things just fine. People with low IQ also struggle with abstract reasoning, and IQ tests place GPT-3 at around 83.


I still think that this will be a major form of AI that is accessible to the public at large and it will enable productivity improvements at all levels.

I'm not joking, this is really something I think will/should happen.


Alternatively, are there ways to train on consumer graphics cards, similar to SETI@Home or Folding@Home? I would personally be happy to donate gpu time, as I imagine many others would as well.


There absolutely are! Check out hivemind (https://github.com/learning-at-home/hivemind), a general library for deep learning over the Internet, or Petals (https://petals.ml/), a system that leverages Hivemind and allows you to run BLOOM-176B (or other large language models) that is distributed over many volunteer PCs. You can join it and host some layers of the model by running literally one command on a Linux machine with Docker and a recent enough GPU.

Disclaimer: I work on these projects, both are based on our research over the past three years


The cost of moving data from one gpu to the next will destroy performance.

The system are moving in the opposite direction (look at Dojo architecture or TensTorrent)

The silver lining is that the cost of training will fall substantially with those architecture that are not based in reusing gpu.


Work together and fuck up companies together. That's the way to go.


Or how to apply communism to software engineering. I like that.

More seriously, the risk that a few companies become even more powerful thanks to their restricted access to such NN is very frightening. The worth is, without legal restrictions, there is nothing that we can do against it. And I doubt that legal restrictions come in the next months / years.


Well at that point, some people might have the crazy crazy insight that no matter how big the model is, or how many GPUs they have, it burns up all the same.


What does “355 years” mean in this context? I assume it’s not human years


Claimed here, so this is presumably the reference (355 GPU Years):

https://lambdalabs.com/blog/demystifying-gpt-3

"We are waiting for OpenAI to reveal more details about the training infrastructure and model implementation. But to put things into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run."


That's still including margins of cloud vendors. OpenAI had Microsoft providing resources which could do that at much lower cost. It still won't be cheap but you'll be way below $5m if you buy hardware yourself, given that you're able to utilize it long enough. Especially if you set it up in a region with low electricity prices, latency doesn't matter anyway.


Cumulative hours spent across training hardware.


I think AI is going to go the way of the hard sciences where the age of tinkerers making progress by leaps and bounds in their basement is over and incremental progress is going to be the domain of universities or large private companies that can afford to throw money behind it. I would love to be proven wrong and see radical shifts in how people approach these problems. Seems like the cycle started and got to this point way too soon for AI though


My take on this is that (good) content is one of the bigger problems still, particularly also who exactly the original training data belongs to (or where it comes from). There's a certain risk (we'll see with Github CoPilot soon) it will slow down for a bit until the licensing issues are all sorted out. This can only be solved (for now) by bringing in public funding/data, which universities have always been a very good proxy for. Which also means it (usually) should be open access to the public, to some extent (and useful for the garage folks to catch up a bit). But, once we're past that, it'll be all about that giant body of pre-trained data, securely kept within the next Facebook or Microsoft, amounting to literal data gold (just much higher value at a lot less weight).


Tinkerers can fine tune a model though. Unfortunately most fine tuning seems to be outmatched at the next iteration of the model.


> Are there any possible technologal or scientific leaps on the horizon

Yes. From 2017: "Prediction 4: The simplest 2D text encodings for neural networks will be TLs. High level TLs will be found to translate machine written programs into understandable trees."

We have something coming out that is an OOM better than anything else out there right now.


I think something of Seti@Home kind will come.


small players will never have a chance to push the state of the art, as whatever optimization there is will also be applied at large scale with more money


Take a leaf from Seti@Home‘s book and try to come up with a distributed, volunteer-based approach to training an open source LLM. There is already an enormous amount of suitable ML hardware on end user devices.


Huggingface actually recently did this, but I think it's for inference on their giant models like BLOOM


Good point, but perhaps a leap could take small players into territories of language models that are large enough to be useful. GPT-3 crossed that threshold


A lot of SOTA comes from small players. It just isn't the case for LLMs.


Could this be distributed? Put all those mining GPUs to work. A lot of people like participating in public projects like this. I would!


>> GPT-3 took 355 years to train

> Could this be distributed? Put all those mining GPUs to work.

Nope. It's a strictly O(n) process. If it weren't for the foresight of George Patrick Turnbull in 1668, we would not be anywhere close to these amazing results today.


Why would an O(n) algorithm not be able to be distributed?


I couldn't find any references to George Patrick Turnbull. If that an ancestor of yours? If so, the comment seems rather subjective.


They're being facetious about the '355 years to train' thing. ;)


OK haha good one then. Mine was a bit too subtle.


In theory, yes. "Hogwild!" is an approach to distributed training, in essence, each worker is given a bunch of data, they compute the gradient and send that to a central authority. The authority accumulates the gradients and periodically pushes new weights.

There is also Federated Learning which seemed to start taking off, but then interest rapidly declined.


Exactly. This is inevitable imho. There is no way people will be ok to depend on few wall-gardened models.


It should be no issue if it became massively parralelized a-la SETI. I wonder when Wikimedia or Apache foundation will jump into AI


Wikimedia and other organizations that deal with moderation might want to keep this technology out of the hands of the general public for as long as possible.


There are a couple of cases where small changes in the model make training much quicker. For example, the currently leading Go AI, KataGo, requires much less time to train than AlphaGo did.


Yes. There are plenty forward leaps, most of them are not new and are just waiting to be integrated or released :

Let's pave the road for SkyNet hard lift-off :

-The first obvious one is use of external knowledge store, aka instead of having to store facts in the neural weights where they struggle, just store them in a database and teach your neural network to use it. (This is also similar to something like webgpt where you allow your network to search the web). This will allow you to have a network of 1G parameters (and external indexes of a few TB) that is as performant as a network of 100G parameters, and with better scaling property too. You can probably gain at least 2 orders of magnitude there.

-The second leap is better architecture of your neural networks, approximating transformer that are quadratic compute by something that is linear compute (linformer) or n log n compute (reformer) can get you an order of magnitude faster by simply reducing your iteration time. Similarly using some architectures based on sparsity can give you faster computation (although some of the gains are reduced by lesser efficiency of sparse memory access pattern). Using (analog bits) Diffusion to Generatively PreTrain sentences at a time instead of token by token. You can probably gain between 1 and 3 order of magnitude here if you write and optimize everything manually (or have your advanced network/compiler optimize your code for you)

-The third leap is reduced domain : You don't have a single network that you train on everything. Training one network by domain allows you to have a smaller network that compute faster. But also it allows you to focus your training on what matters to you : for example if you want to have a mathematics network, its parameters are not influenced a lot by showing it football pictures. There is at least 2 orders of magnitude there.

-The fourth one is external tool usage. It's kind of related to the first one but whereas in the first one is readily differentiable, this one necessitate some Reinforcement Learning (that's what decision transformer are used for).

-Compression : compress everywhere. The bottlenecks are memory bandwidth related. Work in compressed form when relevant. One order of magnitude

-Distributed training : Because the memory bandwidth of inside a GPU is in the order of TB/s where as the transfer to the GPU is in the order of 10GB/s. There is an advantage to have the parameters reside on the GPU but there is a limited quantity of memory in the GPU, so distributed training (something like petals.ml) allows you to increase your memory bandwidth by collaborating. So each actor can probably gain an order of magnitude. Provided that they can keep bad actors away.

-Use free resources : The other day Steam had 10M users with GPU waiting around doing nothing, just release a dwarf fortress mod with prettier pictures and use the compute for more important tasks.

-Remove any humans in the loop : it's faster to iterate when you don't have to rely any human, either for dataset construction or model building

:)


For casual readers like me: are there examples of what this can do once trained? E.g. it mentions training on Shakespeare, but gives no examples of fake Shakespearean.


The repo seems to imply that it matches GPT-2, so I imagine any analyses of GPT-2 will give you a good idea.


Does anyone know the main differences between GPT-2 and GPT-3? Are there significant architectural changes, or is the advancement primarily from training?


If you google "GPT-2 vs GPT-3" you'll find lots of overviews and comparisons, like:

* https://www.kdnuggets.com/2021/02/gpt2-gpt3-openai-showdown....

* https://bakztfuture.substack.com/p/the-chasm-between-gpt-2-a...


Thanks. Sounds like they 10x'ed the number of parameters, which made some "magic leap" that isn't yet well understood, and fed it more data to train it on more specialized domains.


Yes, although Chinchilla seems to imply that training data size matters a lot more than parameter count, and nanoGPT author is trying to reproduce that here:

https://github.com/karpathy/nanoGPT/blob/master/scaling_laws...


I was also a bit surprised that the Chinchilla numbers and tables don't reproduce and that there are calculation bugs in the paper (e.g. the FLOPs calculation in the paper is wrong), especially because the paper has been so impactful in the field. Maybe people are focusing on the broad themes of the paper (e.g. scale model and data approx. in tandem) and just roughly interpolating the main Figure, without sweating the details. The corresponding authors responded very kindly at first and I was able to bring the results closer but now they went dark. Still hoping to make things match, if others in LLM space can spot any issues in my own reproduction please let me know.


Oh, that's really interesting, and makes sense intuitively. From the abstract:

> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.

Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.


I’m not easily finding GPT-2 use cases. Any query guidance?


The GPT family of models shines above 100B parameters. Almost nobody uses GPT2 today. It's too weak.

If you want to go with <1B model, you use a BERT which is bidirectional or a T5 that is easier to fine-tune on other tasks.


Something that immediately comes to mind is text summarization. You'll by now be used to better results from GPT-3 or recent models, though.


I could not find any sample (prompt and results). Can anyone provide samples of it's quality, even if it is in a narrow field of knowledge or specific use case? I tried GPT2, GPT-J 6B and GPT-NeoX 20B (implementation by Fabrice Bellard at textsynth.com/playground.html) but I could not find any production-quality scenario yet, only cherry-picked simple cases.


At this model size quality is not worth discussing. It is clearly another league from GPT-3.


Indeed, it is like comparing the speech of a 2-year-old child to that of a college professor.


That's what I really miss to conclude if I should try it myself or not.


> The code itself is plain and readable: train.py is a ~300-line boilerplate training loop and model.py a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it.

What's the best source for these weights?


Kaggle or HuggingFace


Thank you Andrej Karpathy for the work on ai and gpt models. It really helped me solve a problem as entrepreneur. I started making first few grand from ai.


May i ask how? Consulting?


To train small gpt-like models, there's also aitextgen: https://github.com/minimaxir/aitextgen


As the creator of aitextgen, I'm mixed on continuing support since there doesn't seem to be as much demand as expected for small GPT models given the success and cost-effectiveness of GPT-3/ChatGPT, unfortunately.

I still have a few ideas there (including another secret approach at better text generation) but it's hard to determine ROI.


I think what you have created still has great demand. It give devs who do not have the budget or need for the gigantic models, something to train and use for their own specific language tasks.

Not everyone is trying to replicate CHATGPT results for certain tasks.


Andrej doesn't need to do this.

He's done it because he evidently loves it, and wants to share his hard-earned knowledge with the rest of the world.

He may be a product of the ivory tower, but he's been in the trenches. He knows firsthand how f-ing hard it is to ship a product.

And here he is, sharing useful personal code with everyone.

This github repo now has collected ~4K stars and its antecessor (minGPT) has collected ~11K stars over the past couple of years. In my experience, the number of people who clone, copy, view or otherwise use code from a repo is one to two orders of magnitude larger than the number of people who star it, so we can safely say that Andrej has helped at least a few hundred thousand -- but likely more than a million -- individuals around the world learn how to build and tinker with GPT models.

Remarkably, as I write this, no one else here has said thank you yet, so let me say it on everyone's behalf:

THANK YOU ANDREJ.

--

EDITS: I changed the wording in response to latexr's comments below.


Edit: the OP has updated their wording to make it clear they meant any kind of viewing or usage. I don’t think any of us would disagree more people use code than star repos. Original comment left below with original quote, since this has gotten a number of replies that would stop making sense with a larger edit.

> Normally, the number of people who clone or copy code from a repo is one to two orders of magnitude larger than the number of people who take the time to star it

Intuitively, I’m having trouble believing that. Starring takes considerably less effort than cloning or copying code. The “time to star” is a literal second, maybe two if you have to scroll up.

From anecdotal observation, repos with more forks and/or external contributors than stars are far from the norm. I’ve seen many mentioning they star repos as a way of bookmarking they seldom go back to, or as an easy way to send kudos to the developer even when they don’t use the project.

In no way is this a comment on the value of Andrej’s work (I’m not familiar with it). I am only interested in the source of your “orders of magnitude” claim, which if proven will update my mental model of the coding community.


I checked my 5 year old repository of ~300 stars. It has a ~100 unique clones a month. So if the average was half of it then the 1 order of magnitude would be quite an accurate approximation.

I think the biggest difference with a clone and a star is that a star requires an account and some vested interest in the social network of Github. Anyone who is not interested in the social aspect can just bookmark it.

I guess this differs quite a lot by target demographic. A tool for GPT will probably get a lot more stars than a plugin for some consumer software simply because it is more targeted for the audience of people who have Github accounts.


Thank you for sharing your anecdata. In my experience, the number of clones per month is much higher at first, and then decays gradually until it settles into a stable run-rate, so it's likely that you've had more than 100 x 12 x 5 clones over those five years -- i.e., between one and two orders of magnitude the number of stars, 300.


Another data point: icdiff is 13y old with 4k stars and 200 unique clones in the past month.

(This is a tool that most people install and run without any interaction with GitHub, since it is in package managers)


If I want to use a repository, my first step is to either download a released binary or clone the repository. Forking is much further down the line for me, when I've used the code, encountered a problem, fixed it, and decided to polish the fix up to make a PR. I star something when I either have used it and like it, or when I think I want to use it in the future and want to bookmark it (though the former more often than the latter). I have given out about 50% more stars than I've forked, and have probably cloned an order of magnitude more than I've forked or starred.

Of course not everyone is the same, but I'd be surprised if overall clones were less than an order of magnitude more than forks or stars, and find two or even three orders of magnitude believable depending on the target group of the repo.


Exactly. I would add that the number of clones (not forks) and file/page views is viewable only by the owner of the repo, so we can only guess. (If you own a github repo, you can see the most recent number of clones and page views by clicking on insight -> traffic.)

My estimate of "one to two orders of magnitude" is based on anecdotal evidence. I edited my comment to reflect as much.


I've stared maybe 2-3 repositories over the past 15 years, contributed to probably a half dozen and used hundreds (if not more) in my applications. To me using means using that project in an application you develop. Typically I get them from NPM or Nuget and I contribute when a) the project owner thinks my feature idea is a good idea or b) I run into a bug that I can fix.

Starring is just not that useful to me so I can see why users or contributors would be much higher. I typically star repos if it's an unpopular or old repository that doesn't have NPM or Nuget packages.


How many projects have you starred and how many have you cloned?

Whilst starring is simpler, the incentive is much lower than that of cloning. Especially for projects you just want to use and not contribute to or follow.

In my many years of work, i have only starred less than 50 repos. I am sure i have cloned more than a thousand.


> How many projects have you starred and how many have you cloned?

I seldom star, but neither you nor I can be extrapolated to the general community. I have thousands of stars in some repos, and I know a significant number of those users don’t even code, let alone clone repos or copy code, they’re interested in the final product. They have GitHub accounts because it’s the way to report bugs or make feature requests.

The OP made a claim. All I’m looking to understand is if it has data backing it up or it’s just a gut feeling, because if it’s the former I’ll have learned something and made a correction of my mental model of the world. Sharing more anecdotes will leave us stuck in the same situation.


Some repos have code that 'phones home' when run. For example, checking for updates or security vulnerabilities.

By checking the usage statistics on that server, you can get an idea how many users there are, and typically it's far higher than the number of stars.


That just tells us that more people use the code than star the repo. I don’t think that’d be a surprise to anyone. The claim was that more people clone and copy code from the repo than the ones who star it, which is a different matter from the number of users.


Thank you for clarifying. I meant use. The number of clones and the number of file/page views are proxies for that. So is the number of installs via pip, conda, and other Python package management systems, in this case. I updated my comment to reflect as much.


I’m all for thanking open source contributors, but your excessively prostrating wording is a bit much for me.


If I overdid it, I'm sorry. I promise it wasn't intentional. My comment was spur-of-the-moment, motivated by sincere gratitude :-)


[flagged]


can't tell if you're making some kind of clever quip or if this is some random spambot just entering a random reverse DNS lookup line.


Him doing this is not like when your average bloke does it.

He appears to be building a business and maintaining his profile. And there is nothing wrong with that - I admire him for for pursuing his career in this positive and helpful way.

But random folks do this sort of thing everyday with no such career goals and little recognition, so I'm not sure it is this specific contribution that needs to be called out.


> I'm not sure it is this specific contribution that needs to be called out.

I go the other way. I would like to thank anyone who releases open source code, whether they cause big ripples or not.


What business is he building?


Pedantry time!

A million people building GPT models means that one in 8000 humans on earth has built one. That seems wildly off.

Linkedin has about 100.000 profiles of data scientists. Assume generously that the actual number is 10x higher. Not correcting for the fact that a data scientist isn't always a machine learning expert, etc etc, there's just no way every single one of them even KNOWS what a GPT-like model is.


Not only building. Also tinkering, using, testing out of curiosity, etc. There are around ~30 million software developers worldwide today (source: googled it). Around ~7 million of them are registered users of Github (source: googled it). 1M+ seems likely to me.

BTW, I appreciate that you preceded your comment with "Pedantry time!" -- nice gesture :-)


Thoughtful post! Everything so true! I am always amazed by individuals who truly are educators of the world.


I love italics. They're good.


In hindsight, yes, I may have overused them out of excitement. Sorry! :-)


Would it be possible to take all my user manuals and past customer Q&A and train on just on that to produce a customer helper chat bot?


Really cool. Can anyone answer these questions:

Should I use this or minGPT?

It says it needs 8XA100 40GB node. What is that and where do I acquire it?

Could someone else train this and then send me the model? What would be required to run it as opposed to training it?


A100’s are Nvidia GPU’s. You can rent them from providers like AWS or LamdaLabs. The readme has instructions for downloading the original GPT2 weights from OpenAI. You can also train a very simple version on a smaller dataset from your laptop as described in the README.

If you just want to play with a similar but much better model goto https://chat.openai.com


Excuse my ignorance but what can a layman do with this?


Become less lay?


To me this is the important quote:

Unlike OpenWebText this will run in seconds. Finetuning takes very little time, e.g. on a single GPU just a few minutes. Run an example finetuning like:


see also 'Cramming: Training a Language Model on a Single GPU in One Day' https://arxiv.org/abs/2212.14034 and https://github.com/JonasGeiping/cramming


Somewhat off topic, does someone know how bing might integrate chat gpt into search. Is it to understand the prompt and filter results. Taking the question and summarizing it to search the index. Is it to summarize all the documents into an index and search that. Or to just be like chat gpt is now and use it to generate new results from it's knowledge base? I'm trying to connect the dots between a generative form like these are and how it would influence search in the future. Or is the lucene style index search on it's way out in a generative world?


> reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in 38 hours of training

For comparison GPT-3 has more than 1000x more params (175B) and training time was around 2 months on ~1500 V100 GPUs which costs millions of dollars in cloud compute costs. Gopher with 280B params was trained on 4096 TPU-v3 chips, Microsoft Megatron-Turing NLG 530B trained on 2240 NVIDIA A100 cards (each card costs ~15k USD). And the most mind blowing is PaLM from Google with 540B params and trained on 6144 TPU v4, which costs around 10-30M USD in cloud compute to train.


If I trained this on a 30,000 word document could it give me a summary? Or would there be no need to train it in that case, and I could just tell it "Summarise this: <insert 30,000 word document>"?


30,000 words wouldn't be enough to train this from scratch - you'd ideally train from hundreds of millions of words at least.

30,000 words would be enough to finetune an existing model. If you did that, then the model would output text similar to the finetuning data. For example, if you finetuned it on shakespeare, then you might be able to use the model to make a new play, in shakespeare's style.


If you finetuned it on the text of Shakespeare's plays, how would it link that text to the string "Shakespeare"?


It still has the knowledge from the main training on data from across the whole internet, so would still know the word Shakespeare...

But you're right - the model finetuned on shakespeare would be good at writing a new play in the style of shakespeare, but would be bad at giving a critique of shakespeare's works.


The context window (block size) of this model is 1024 symbols. Symbols approximately map to words. So you can't ask it to summarize anything over 1024 words.


Yeah that's the issue I was thinking of, how to get it to summarise large documents. Has anyone any ideas?


People have had some success with the following process:

Divide your 30,000 word document into a hundred 300 word chuncks. For each chunk, give as input:

    Please summarize the following text into  50 words:

    [chunk]
Join all the outputs together, and you now have a shorter document. Repeat the process recursively.

You can improve the results by doing the process again, but this time giving some context:

    Please summarize the following text, an extract of a document about [1st attempt at a summary], into  50 words:

    [chunk]


You can also use "Please suggest a section title for the following text".

Then that title can be used in the 2nd round, for example using a query of the form "The following is an extract from the Introduction section of a document about The benefits and disadvantages of nuclear power in sweden:"


I imagine you could do even better by finetuning the neural net on the document before asking for the recursive summary. Then it has all the information to work with, albeit in a compressed form.


So, are there any of these projects that aren't vendor locked to NVIDIA and are able to train large models with limited GPU RAM space?

I don't mind letting my machine churn for 2-3 weeks. But I'm not looking to buy another 1000$ GPU just because CUDA is the only compute library researchers understand


Another cheap option is runpod

According to https://www.runpod.io/gpu-instance/pricing renting out 4x A100 40GB costs $3.56 per hour.

So that's $3.56 * 2 * 38hours = $270.56 then.


Wow, this is great. I can't wait for the video lecture, transformers are an aspect of modern machine learning that I'm not completely clear on. Andrej's lectures are brilliant - super detailed, and really answer the detailed questions I always have. Great stuff!


What's the applicability? Can you give me some examples of what can be used this for?


I imagine this might be interesting for domain-specific GPT models. Say training it on a mountain of technical documentation, or on every fanfiction published on the internet, or a sentiment analysis dataset. Of course fine-tuning GPT3 would give better results, but nanoGPT might allow you to make a much smaller model that's still good enough, to enable cheaper inference.

Also the opportunity to play around with all the parameters fairly cheaply to find improvements. The todo section of the readme gives a small taste of that. Making bigger models works for OpenAI, but maybe the rest of us manage to make small models just perform better instead.


Is there any trained model for text generation that you can run locally yet?


GPT2 can be run locally (on a somewhat beefy consumer GPU)


Can you add some info on what consumer GPU would be needed for this? Would a 3080 be able to handle this?


Assuming you get the 12GB version of the 3080. A 2080TI is another option. Though you can reduce precision or use one of the smaller GPT2 versions to run on smaller cards as well.


Let me slightly rephrase the question: what is the best model that one can run on high-end consumer grade hardware? Let's say RTX 3090.


The original GPT-2 small (the 124M one) can run on a CPU, just slowly and not scalably.


Plenty. Huggingface alone has a ton


There’s LAION working on open source[1] version of chatGPT

[1] https://github.com/LAION-AI/Open-Assistant


Though their roadmap doc says they're looking into finetuning existing GPT-J/T5 models for this task. So you'll probably want a 3090 (24GB VRAM) and at least 16GB of CPU RAM to run inference if/when the project is complete.


This should be way higher up.


Is there a list of datasets like https://skylion007.github.io/OpenWebTextCorpus/ ?


For an AI noob like me: can you use spot instances to train models? They are about 1/3rd the price on AWS compared to on demand ones, so it'd make a significant difference.


Yes you should use them. They can be taken away from you with 2 min notice. (It doesn't happen a lot in practice though. I have been running a different instance for over a month. AWS doesn't force you if they don't have to)

If you are going to run a long training job, ensure you are creating checkpoints. Be sure to use persistent storage, EBS and ensure that you check the option that it doesn't get deleted if the instance is stopped, so your checkpoint remain in the disk and you can easily restart.

I haven't tried it but prices here are much cheaper. https://vast.ai/#pricing


Yes you can. In Oregon you could eventually get this instance at $9. I say eventually, because of course Spot allocation is not guaranteed. ( And neither is On Demand ...but that is a story for another day)

https://aws.amazon.com/ec2/spot/pricing/


Why not? This is the exact use case of what Spot instances seem to be for. (Not hosting a service, but just calculating something for yourself.)


Thank you so much for this! It is so impressive and I'm sure it took a lot of hard work!

Is it able to re-write articles? And where could I find a guide on how to train it?



What would I google to figure out how to productionize the output of this?

This repo trains a model--how would I prompt it and print the generated output?


I would love to see a minInstructGPT or a minRetro, or maybe something that combines instruction and retrieval into a readable codebase!


How critical are training warmups and is an iteration here the same as an epoch?


Curious to know how close that training loop is to actual openai code.


Karpathy is such a boss!


So is MSFT now extra grossly overpaying for ChatGPT?


638c7215


As someone who's been in software for almost 25 years now, I read through this in amazement of how much new stuff still keeps coming in. This industry never stops and that makes it such a fascinating (but arguably harsh) world to be in.

Looking at this feels like seeing the source code of a 64k demo, learning about Mode 13h and trying to replicate it in Turbo Pascal.

And, much like the old days of graphics programming, there's a good chance all of this knowledge will be mostly irrelevant soon, as the abstraction layers tend to come quicker and quicker and take care of the hard foundational work underneath. Then it'll be many of us here discussing whether or not it was good to have been with it from the start, to really get it, or whether playing with the highly-abstracted components is all that's needed to succeed with it.

Either way, super cool to see the pace here and I loved the "I only have a macbook" section.


It will be funny to look back from the future and think, wow, how did we get anything done with only 40GB RAM


14 hours ago: https://news.ycombinator.com/item?id=34331919

Curious why HN didn't merge the submission as it usually does. Is there a "no, submit this again" option?


HN lets reposts through if the story hasn't had significant attention yet. This is to give good submissions multiple chances at getting attention. Past explanations:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

I know it sucks when your submission was earlier and gets overlooked! We should eventually have some sort of karma-sharing to take care of this. In the meantime, it at least evens out in the long run, since the reason is randomness.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


The other post probably didn’t make it to the front page


I think the link should be: https://github.com/karpathy/nanoGPT




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: