Wow, fun to find this trending on HN this morning! I am currently also working on the associated video lecture (as the next episode of my video lecture series here https://karpathy.ai/zero-to-hero.html ), where I will build nanoGPT from scratch and aspire to spell everything out, as with the earlier videos. Hoping to get it out in ~2 weeks or so.
Open accessible lectures / knowledge like yours allowed many people, me included, to turn their life around by putting in the effort and develop themselves. Thank you.
While doing my PhD some years ago (it wasn't a PhD on AI, but very much related) I trained several models with the usual stack back then (pytorch and some others in TF). I realized that a lot of this stack could be rewritten in much simpler terms without sacrificing much fidelity and/or performance in the end.
Submissions like yours and other projects like this one (recently featured here as well) -> https://github.com/ggerganov/whisper.cpp, makes it pretty clear to me that this intuition is correct.
There's a couple tools I created back then that could push things further towards this direction, unfortunately they're not mature enough to warrant a release but the ideas they portray are worth taking a look at (IMHO) and I'll be happy to share them. If there's interest on your side (or anyone reading this thread) I'd love to talk more about it.
Your youtube playlist combined with NanoGPT and your Lex Fridman podcast is like having a university level degree with a free internship guidance. Thank you!
Just wanted to say thank you for all the incredible work and resources you publish. I've lost track of all the different skills I've learned from you, from computer vision, RNNs, minGPT, even speedcubing :D
+1. I've benefited greatly from your content, e.g. your CNN lecture was incredibly accessible [0]. I still find transformers stubbornly elude my intuitions despite reading many descriptions. I would very much appreciate your video lecture on this topic.
I've found all of your code and lessons on youtube so incredibly useful. You're a wonderful teacher and I really appreciate all the work you've done with this!
Thank you for your amazing work.
Between cs231n and your recent videos, I've learned a ton - and you have a gift to explain things in such an easy and straightforward way, that I'm always feeling like an idiot (in a positive way) for not having grasped the concept before.
Bad ass! A great addition would be some content on tuning pre-trained language models for particular purposes. It would be great to have examples of things like tuning a GPT model trained on language and code to take in a context and spit out code in my custom API, or using my internal terminology. Not sure if this is RL based fine tuning or just a bunch of language to code examples in a fine tuning dataset? In essence, how can we start using language to control our software?
Ty agree, most people practically speaking will be interested in finetuning rather than from-scratch pretraining. I currently have some language about it in readme but I agree this should get more focus, docs, examples, etc.
Appreciate the work to make GPT training accessible!
Do you leave hyperparams (like learning rate, batch size) the same when switching from 8xA100 to fewer GPUs, or do these need to be adjusted?
Separately, when going from 8xA100 GPU to a single A100 GPU, in the worst case we can expect the same model performance after training 8x as long correct? (And likely a bit better because we get more gradient updates in with smaller batch size)
Saying absolutely nothing new here, but your work is so damn inspiring! I wish I had such a natural connect to my work, an ability to distill complex concepts down to the fundamentals, and such inventiveness! I took your CS231N class at Stanford as well. Implementing the fundamental building blocks like backprop was fun and insightful. Thanks again for your passion and teaching!
Your tutorials are effective and concise. Thank you for them! Accessible, from-scratch knowledge on these topics is essential at this time in history and you're really making a dent in that problem.
Thanks, I love your video about back propagation where you painstakingly spell out every calculation. It was like a breath of fresh air compared to other materials out there.
I have taken several masters-level courses in Machine Learning -- and even with those credentials, I cannot recommend enough Andrej's youtube series, "Neural Networks: Zero to Hero". There, he teaches you, from scratch, how to build everything from the underlying automated gradient calculation system in pytorch, all the way up to the slower version of this model - `MinGPT`.
(edit: self-promo: I'm currently working on a Typescript follow-through of this same series of video lectures, if you want to follow along with stronger types for explanation: https://github.com/Marviel/lab-grad)
I can’t believe I just spent 2 and a half hours glued to my phone in bed watching this, for absolutely no reason other than it was such an interesting intro (to a subject I’m already familiar with). Thanks for the recommendation, and thanks Andrej for making this!
Fast.AI is great, but it takes the top down, vs the bottom up, approach. It takes you from a production-level black box that you don't understand, down to the details. The benefit there is you get good high-level intuition of how it behaves at the "let me use this technology for a job" level.
Separately, the fast.ai library is also highly recommendable -- it comes with some state-of-the-art image recognition models, and its training wrappers are really helpful particularly for image-recognition dataset training.
Karpathy's "Neural Networks: Zero to Hero" video series starts at the level of individual neurons, and works you up to the final product. For some reason both this style, and Karpathy's conciseness appeal to me slightly more. I'm also super detail-oriented, though -- and any level of "hand waving" (even if further explanation comes later) always bothers me. He's also got some pretty high-profile industry experience which carries some weight with me.
But I'll say that both are really high-quality. -- ultimately, my recommendation would be to follow whichever one speaks most to you personally after the first 1hr or so.
EDIT: Per Jeremy's response below, if you want the bottom-up approach but like the fast.ai teaching style, you should check out "part 2" of the fast.ai set of tutorials, which is exactly that.
fast.ai has both - the "part 1" section is top-down, and the "part 2" section is bottom up. You can do part 2 without having done part 1. Part 2 starts with implementing matrix multiplication from scratch, then backprop from scratch, then SGD from scratch, etc.
There will be a new version of the part 2 course out in a few weeks. It even covers stuff like random number generation from scratch, convolutions from scratch, etc. It gradually works all the way up to Stable Diffusion.
@karpathy's and the fast.ai lessons work well together. They cover similar topics from different angles.
Jeremy @ Fast.ai says he takes this pedagogical approach because it's "proven" to be the best way to learn. He's probably right, but I do find it confusing at times because in the beginning you're just hitting ctrl + enter on a IPYNB haha.
Maybe Karpathy's approach will speak to me more--thanks for the recommendation!
This is really good, and I was really excited by it but then I read:
> running on a single 8XA100 40GB node in 38 hours of training
This is a $40-80k machine. Not a diss, but I would love to see an advance that would allow anyone with a high end computer to be able to improve on this model. Before that happens this whole field is going to be owned by big corporations.
That's a great comparison. For a real number, I just checked Runpod and you can rent a system with 8xA100 for $17/hr or ~$700 for 38 hours. Not cheap, but also pretty close to the cost of renting a premium vehicle for a few days. I've trained a few small models by renting an 1xA5000 system and that only costs $0.44/hr, which is perfect for learning and experimentation.
The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.
I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.
Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.
And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).
Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.
Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?
Well if it used to cost you $1 for 1hr at 1x speed, now it will take you 10hr at 0.1x speed, and if my math checks out $1. You need to shrink the model.
But of course now you run it on your own computer instead of in the DC, which changes the numbers. Especially if your student dorm has a shared electricity bill :)
Let's not forget that rendering 3D Animations in 3DSMAX or Maya used to take days for a single frame for a complex scene, and months for a few minutes.
Great news! Cloud instances energy usage is included in their price, and because they're remote and transient it's impossible to permanently damage them.
I think the equivalent of being not careful and getting a dent in this context is to leave it open to the internet and having a bitcoin miner installed.
As you are paying for the resources you use that's fine.
The closest would be if you used some form of software bug to actually cause physical damage, certainly not impossible, but extremely unlikely compared with actually physically damaging a car.
A better fit would be, if you have unlimited liability like with AWS, and you leak your key pair. Then someone runs up a 100k bill setting up mining instances
I think it was a DIY machine, those RTX 3090 have gotten cheaper for sure.
From my experience, going beyond 4 GPUs is a pricey affair. See [§]. All but one model of the RTX3090 require at least 3 slots.
If 4 GPUs connected via PCIe 4.0x16 are enough you can choose among various sRTX4 boards for 3000 series AMD Threadripper CPUs.
It's a $33/hour machine on AWS, so about $1250 for one training run. Not cheap, but easily in the reach of startups and educational or research institutions.
Edit: or about $340 if you get the 8xA100 instance from lambdalabs, in the realm of normal hobby spending
"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."
If you're doing something new/ custom (which you presumably are if you aren't using someone else's prebuilt model), it could take a lot of runs to figure out the best training data and finetune settings.
(I assume. I've never worked with GPT, but have done similar work in other domains).
Just download the model and run it on something much smaller and cheaper. Bigger models like GPT-J are a bit of a pain to run, but GPT2-sized models run just fine on consumer GPUs.
Ahh okay, thanks. So how big is the model? Seems like it should be available to download so people don't have to train it. I understand you can train it on custom data but for a "default" model are there any available to download?
Depends on precision, you can run ~5B model with fp32 precision or ~11B fp16 model max. Int8 is really bad for real world use case so not mentioning it.
But if you are looking to get performance of ChatGPT or GPT-3 then don't waste your time, all GPT-3 like small LLM models (below at least 60B params) are useless for any real world use case, they are just toys.
If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.
Fortunately very few real world use cases need to be this general.
If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.
> If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.
Yes, thanks, that's what I meant.
> If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.
> This creates a much smaller Transformer (4 layers, 4 heads, 64 embedding size), runs only on CPU, does not torch.compile the model (torch seems to give an error if you try), only evaluates for one iteration so you can see the training loop at work immediately, and also makes sure the context length is much smaller (e.g. 64 tokens), and the batch size is reduced to 8. On my MacBook Air (M1) this takes about 400ms per iteration. The network is still pretty expensive because the current vocabulary is hard-coded to be the GPT-2 BPE encodings of vocab_size=50257. So the embeddings table and the last layer are still massive. In the future I may modify the code to support simple character-level encoding, in which case this would fly. (The required changes would actually be pretty minimal, TODO)
But how often do you need to run this? You can run 8xA1000 on LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be able to run the entire data set for less than $350.
If you can’t fit the model on your resources you can leverage DeepSpeed’s ZeRO-offload which will let you train GPT2 on a single V100 (32gb).
Alternatively, if you’re researching (with the caveat that you have to either publish, open source or share your results in a blog post) you can also get access to Google’s TPU research cloud which gives you a few v3-8s for 30 days (can’t do distributed training on devices but can run workloads in parallel). You can also ask nicely for a pod, I’ve been granted access to a v3-32 for 14 days pretty trivially which (if optimized) has more throughput than 8xA100 on transformer models.
TPUs and moreso pods are a bit harder to work with and TF performs far better than PyTorch on them.
I was curious about how much this would be to rent, because definitely the cost of those servers is outside the budget! Lambda has 8xA100 40gb for $8.80/hr: https://lambdalabs.com/service/gpu-cloud#pricing
It seems as likely as people being able to build big automaker level of cars just with tools in their garage. More compute is going to keep producing better results at least for LLMs.
Most decently large colleges have been investing in HPC for a while, and started investing in GPU HPC around 2014. You'd be surprised what sort of school projects the compute budget exists for.
I went to a smallish state university, even there we had our own HPC center and lab. We had a proper HPC (IIRC) 6 row data center across campus and we had a continuous budget available to me as an undergraduate research assistant for building beowulf clusters for the graduate programs to run assignments on. I once got an allowance to buy 15 raspberry pis to build an arm cluster.
That's to train it from scratch, though, right? If you preload the GPT2 weights you don't need to do this. You can just give it additional training on your texts.
Well, he does include instructions for running it on a personal computer, which looks like what I'm gonna be doing next week.
Besides the rental options discussed below these nvidia boxen don't look too big so either used ones will be available for cheap relatively soon, or you could just locate and liberate one in Promethean fashion.
Supposedly even running the trained model for ChatGPT is extremely expensive unlike the image generators which can largely be run on a consumer device.
So if I see it right that would be a p4d.24xlarge instance. Which goes for about $32.77 an hour nowadays so the total training would be about $1245. Not cheap, but certainly not a nation state budget.
Edit: i just noticed lambda lab. It seems they ask $8.8 per hour for an instance of this caliber. That puts the total training cost around $334. I wonder how come it is that much cheaper.
That is a key difference. You can’t easily and cheaply rent an auto factory, but you’re starting to be able to rent an LLM training factory once for a model where you can then more cheaply run inference on.
Doesn't huggingface have dozens of freely available pretrained models like this (including various sized implementations of GPT2) and isn't the source available on most if you wanted to train them yourself?
All I see in the comments is praise for the author as a person, so just wondering what's unique about this that's not available elsewhere? 730 upvotes and counting, assuming I'm missing something...
True, but the use cases arent the same. As he did before for other models, he has a knack for distilling the code down to beautiful, self-contained examples of high didactic value.
It's an order of magnitude easier to grok the basics from this repo than from going through (admittedly more ergonomic or performant or production-ready) huggingface repos.
Additionally, in terms of the streamlining nanoGPT porports, HuggingFace's implementations play nice with optimization techniques such as ONNX/TensorRT, which will give you better performance than anything PyTorch-based even if minimal.
That doesn't mean an ONNX-ed nanoGPT won't be better, but the field of optimized text generation isn't as new as people claim.
This is a didactic implementation. If you read the HuggingFace repo it is much more abstracted on account they implement many models in the same codebase. It's not fast or big, just easier to read and tweak.
minGPT prioritized being understandable above all else, and was not very fast. This repo includes several optimizations, but it still much more understandable than probably any other open source implementation.
This is a dumb question about language models in general, not necessarily specific to NanoGPT: why is all the focus on training? Can I download and run a pre-trained model locally? Surely the specs required to run a model are much, much lower than those required to train the model?
It's the equivalent of building from source versus downloading a compiled binary.
Also you can perform "fine tuning" which means you start with a trained model and train it further on your own data, allowing you to customize the model for specific tasks.
If you're only using pre-trained models, it's going to be harder to differentiate yourself. Training / specialization of models is where the moat-building is (due to access to different data sets / better ideas). By specializing / training, more of the token limit can be used for generation rather than prompting / better prompts can be made.
The lower the cost of training, the more profitable any resultant business. You can even envision businesses that train the model regularly to bring in new knowledge. The cheaper this is, the more opportunities open up.
Are there any possible technologal or scientific leaps on the horizon that would reduce training time by an order of magnitude or more? GPT-3 took 355 years to train with incredibly expensive hardware, which means small players have no chance to push the state of the art
As models get bigger less and less neurons get activated by any given input. If you can somehow predict which neurons get activated you can skip the vast majority of the computational load. I have read a paper where they argued that only 0.5% of the neurons are actually active in a 200 million parameter model so you can get a 200x improvement just from that.
What this tells you is that there is very little money in optimizing deep learning and that NVIDIA has made it very easy to just throw more hardware at then problem.
Oh - there are a lot of people working on optimizing AI. Amongst hobbyists, academia, and corporations alike.
The thing is, if you come up with a neat optimization that saves 30% of compute for the same results, typically instead of reducing your compute budget 30%, you instead increase your model/data size 30% and get better results.
This is hard a-priori, but fairly easy post-facto. Model distillation isn't a common practice yet, but it has already been demonstrated to be quite effective for specific use cases.
It is somewhere from 8x to 25x faster than doing dense machine learning. The speedup was higher on the original CPU implementation and the GPU paper mentions that if there isn't enough shared memory on the GPU it will have to switch to an algorithm that has more overhead.
Edit: There is a paper for sparse spiking gradient descent promising a 150x improvement. I am not sure how practical this is because spiking neural network hardware heavily limits your model size but here it is:
I wonder about this, too. OpenAI's biggest 'moat' is that their model takes so much resources to train, not that their algorithms are particularly secret.
One idea I had was to not use one single model to learn all steps of the task, but to break it up. The human brain has dedicated grammar processing parts. It is unclear whether something like a universal grammar exists, but we have at least an innate sense for rhythm. Applied to NLP, you could heavily preprocess the input. Tokenize it, annotate parts of speech. Maybe add pronunciation, so the model doesn't have to think about weird english spelling rules, and so you can deal with audio more easily later. So I would build all these little expert-knowledge black boxes and offer them as input to my network.
But there is also some inherent resource cost in large language models. If you want to store and process the knowledge of the world, it is going to be expensive no matter what. Maybe we could split the problem into two parts: Understanding language, and world knowledge (with some messy middle ground). I believe you could replace the world knowledge with a huge graph database or triple store. Not just subject-verb-object, but with attribution and certainty numbers for every fact. The idea would be to query the database at inference time. I don't know how to use this in conjunction with a transformer network like GPT-3, so you'd likely need a very different architecture.
The big benefit of this would be that it is feasible to train the language part without the world knowledge part with much less resources. But you have other benefits, too. ChatGPT is trained to "win the language game". But as they say, winning the argument does not make you right. If you have a clean fact database, you can have it weigh statements from trustworthy sources higher. You then basically have a nice natural language frontend to a logical reasoning system that can respond with facts (or better: conclusions).
GPT and human brain ( at least the language / speech part ) have nothing in common. We, as humans, do not use language in a generative way, is derived from a higher or very low level of abstraction ( intentions, emotions, etc ) and is explictly use for communicating something. Even this text is based on previous knowledge, saved in an abstract way, and while writing this I must follow the synthax of the language or writing the right order otherwise, you , the person who reads this, will not understand what I mean. While GPT can generate the same text, it does not have a motivation and has no need to communicate ( while I just wanted to feel good by bringing some contribution on HN ).
> and while writing this I must follow the synthax of the language or writing the right order otherwise
A good example that is not, word randomised order and kombination with Mrs Spelling and fonetic spel-ing prevent ye knot that which I wrote you to komprehend.
(My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).
A better point is that GPT-3's training set is more tokens than the number of times an average human synapse fires in a lifetime, squeezed into a network with about 3 orders of magnitude fewer parameters than the human brain has synapses.
It's wrong to model AI as anything like natural intelligence, but if someone insists, my go-to comparison (with an equivalent for image generators) is this: "Imagine someone made a rat immortal, then made it browse the web for 50,000 years. It's still a rat, despite being very well-trained."
> (My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).
At least for me it's perfectly understandable (except the "Mrs" part). This reminds of those "did you know you can flip characters randomly and our brain can still understand the text" copypastas that can be found everywhere. I think it's probably quite similar for word order: As long as your sentence structure is not extremely complicated, you can probably get away with changing it any way you like. Just like nobody has issues understanding Yoda in Star Wars.
Although I think there are some limits to changing word order - I can imagine complicated legal documents might get impossible to decipher if you start randomizing word order.
These are conceptual "differences" that don't actually explain the mechanics of what's going on. For all you know "motivation", "intentions", etc. are also just GPT-like subsystems, in which case the underlying mechanics are not as different as you imply.
That's the hardware it runs on, not the software architecture of GPT. I could equally say that transistors are faster than synapses by the same ratio that marathon runners are faster than continental drift.
It seems to me that a lot of everyday communication is rather statistical in nature. We don’t necessarily think deeply about each word choice but instead fall back on well worn patterns and habits. We can be more deliberate about how we compose our sentences but most situations don’t call for it. It makes me wonder if we don’t all have a generative language model embedded in our brains that serves up the most likely next set of words based on our current internal state.
Here we go again. They must have something in common, because for about 90% of the tasks the language model agrees with humans, even on novel tasks.
> We, as humans, do not use language in a generative way
Oh, do you want to say we are only doing classification from a short list of classes and don't generate open ended language? Weird, I speak novel word combinations all the time.
No, what is meant is that the next word I speak/write after a current word are not based on a statistical model, but on a world model which includes a language structure based on a defined syntax and cultural variaty. I actually mean what I say while the ChatGPT just parrots around weights and produces an output based purely on statistics. There is zere modeling which translates into real world ( what normally we call "understanding" and "experience" ).
Oh, I see. Then I agree with you, an isolated model can't do any world modelling on its own. No matter how large it is, the real world is more complex.
It might be connected to the world, of course. And it might even use toys such as simulators, code execution, math verification and fact checking to further ground itself. I was thinking about the second scenario.
The more experience I get, the more I wonder if this is really the case for us. We certainly have some kind of abstract model in our heads when thinking deeply about a problem. But in many settings - in a work meeting, or socially with friends - I think it is a much more automatic process. The satisfaction you get when saying the right thing, the dread when you say something stupid: It is just like playing a game. Maybe the old philosophical concept of society as merely "language games" is correct after all. A bit silly but I find the thought makes annoying meetings a bit more bearable.
But you are of course right with GPT, it has no inner life and only parrots. It completely lacks something like an inner state, an existence outside of the brief moment it is invoked, or anything like reflection. Reminds me of the novel "Blindsight" (which I actually haven't read yet, but heard good things about!) where there are beings that are intelligent, but not conscious.
This biggest most is high-quality data. Both their proprietary datasets (WebText, WebText2 etc), but also now their human-annotated data. Another secondary moat is their expertise with training models using PPO (their RL method), they can get results that are quite better than other labs. I say this moat is secondary because it's possible that you can get similar results with other RL algorithms (e.g. DeepMind using MPO) and because maybe you don't really need RL from Human Feedback, and just fine-tuning on instructions is enough
I find OpenAI having exclusive access to that kind of high-quality data more concerning than them having access to their current amount of compute and currently trained model. A couple of million dollars worth of compute is in the realm of any medium sized research university, larger company or any country worth of mention. And seeing as Moore's law still applies to GPU, the cost will only fall.
However high-quality data is scarce. I would be willing to fund a proper effort to create high-quality data.
It's not just about compute; if that were the case, then models like BLOOM and OPT, which also have 175 billion parameters, would have the same performance for real-world use cases as GPT-3, but they don't. Datasets are also very important.
An interesting outcome of the nanoGPT repo is this struggle to exactly match the Chinchilla findings[0], even after discussing it with the authors.
A larger discussion is that the scaling laws achieve loss-optimal compute time, but the pre-training loss only improves predictions on the corpus, which contains texts written by people that were wrong or whose prose was lacking. In a real system, what you want to optimize for is accuracy, composability, inventiveness.
I highly doubt this in practice on a large scale. Outside of the common phenomena of "most large NNs are under trained" and "less better data is sometimes better than more worse data", there are no other obvious mechanisms to explain why a smaller model with same or similar architecture would be better than a larger one.
I claim instead that we are still hardly scratching the surface with how we evaluate NLP systems. Also, some fields have straight up trash evaluation schemes. Summarization and ROGUE scores are totally BS and I find the claim that they even correlate with high quality summaries suspect. I say this with publications in the that subfield, so I have personal experience with just how crummy many summarizes are.
What do you mean by "small players have no chance"? OpenAI was founded in 2015, it used to be a "small player" which just got things right and grew with it - we're not talking of Google or Facebook investing a chunk of their billions cash. In Germany, AlephAlpha has built their own supercomputer and are training similar sized models. It's expensive for sure, but well in the possibilities of startups. In France researchers trained the similarly sized BLOOM model https://huggingface.co/bigscience/bloom. They claim it cost between $2 and $4 millions.
Sure, a single researcher can't replicate this at their university, but even though OpenAI likes to publish it this way, we're not really talking about research here. Research was inventing the transformer architecture, this is just making it bigger by (very smart) engineering choices. It's something companies should do (and are doing), not researchers.
> we're not talking of Google or Facebook investing a chunk of their billions cash
OpenAI had raised $1B from Microsoft in 2019 and used it to train a 175B param model. Now, they have raised $10B and are training GPT-4 with 1.5T params. GPUs are capital intensive and as long as there are returns to bigger models, that's exactly where things will go.
It could actually work. It would be an incredibly gutsy move and I love it, and they'd probably earn a lot of respect. They’d get so much press for it. And if it held up, it’d probably be one of the things that MS is remembered for.
Small players should focus on applications of this tech.
We now know that whatever AI Models succeed in the future, they'll be trained by a huge company and finetuned to a specific use case. Small companies should be working on use cases, and then just upgrade to the latest SOTA model.
> Small players should focus on applications of this tech.
That sounds a bit condescending. We are probably at a point from which the government should intervene and help establish level playing field.
Otherwise we are going to see a deeper divide between multibillion businesses conquering multiple markets and sort of neofiefdom situation.
This is not good.
It's not that condescending, that's todays reality. Should I feel entitled for $600k training time that may or may not work? Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?
It's quite reasonable to make use of models already trained for small players.
> Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?
Governments already routinely do that for pharmaceutical research or for nuclear (fusion) research. In fact, almost all major impact research and development was funded by the government, mostly the military. Lasers, microwaves, silicon, interconnected computers - all funded by the US tax payer, back in the golden times when you'd get laughed out of the room if you dared think about "small government". And the sums involved were ridiculously larger than the worth of a house. We're talking of billions of dollars.
Nowadays, R&D funding is way WAY more complex. Some things like AI or mRNA vaccines are mostly funded by private venture capital, some are funded by large philanthropic donors (e.g. Gates Foundation), some by the inconceivably enormous university endowments, a lot by in-house researchers at large corporations, and a select few by government grants.
The result of that complexity:
- professors have to spend an absurd percentage of their time "chasing grants" (anecdata, up to 40% [1]) instead of actually doing research
- because grants are time-restricted, it's rare to have tenure track any more
- because of the time restriction and low grant amounts, it's very hard for the support staff as well. In Germany and Austria, for example, extremely low paid "chain contracts" are common - one contract after another, usually for a year, but sometimes as low as half a year. It's virtually impossible to have a social life if you have to up-root it for every contract because you have to take contracts wherever they are, and forget about starting a family because it's just so damn insecure. The only ones that can make it usually come from highly privileged environments: rich parents or, rarely, partners that can support you.
Everyone in academia outside of tenured professors struggles with surviving, and the system ruthlessly grinds people to their bones. It's a disgrace.
Pharmaceutical or nuclear research doesn't really classify as "small scale" as this thread started. I know there are massive amounts of money handed our by governments to fund research, but for a 3 guy startup in a garage that's probably hopeless. Public money is cursed anyways, it's better not to touch it.
I've also read it at many places, that academic research funding is way too misaligned. It's a shame, really.
I'm not being condescending at all, we've learned that the value in AI is in the applications. If you think government should regulate the field, it should be to make AI Models a commodity, like electricity.
> In our experiments on the Pile, a standard language modeling benchmark, a 7.5 billion parameter RETRO model outperforms the 175 billion parameter Jurassic-1 on 10 out of 16 datasets and outperforms the 280B Gopher on 9 out of 16 datasets.
The research is still ongoing, although perhaps lower-profile than what appears in the press.
RETRO did get press, but it was not the first retrieval model, and in fact was not SOTA when it got published; FiD was, which later evolved into Atlas[0], published a few months ago.
How long does it take to train a human? It's useless for two years then maybe it can tell you it needs to poop.
The breakthrough will be developing this equivalent in an accessible manner and us taking care to train the thing for a couple of decades but then it becomes our friend.
Neither does OpenAI. It costs so much and still delivers so little. A human can generate breakthroughs in science and tech that can be used to reduce carbon emissions. ChatGPT can do no such thing.
What percentage of humans make meaningful contributions to advancing science or technology? The overwhelming majority of us are just worker bees servicing the needs of the human population.
I agree with you on this point. It’s also arguable that less people with a better education system could yield the same result with less environmental impact.
But my point, poorly explained, is that whatever ChatGPT is, it isn’t original or creative thought as a human would do it.
Chomsky’s example (which is based off Turing): Do submarines swim? Yes, they swim — if that’s what you mean by swimming.
We don't have any clear definitions for "creativity" to begin with. In practice, in these contexts, it seems to be defined as "whatever only humans can do" - that is, the goalposts are automatically moved with every AI advancement.
How could they be moved when they aren't even defined in the first place? Scientists don't even know where to begin when it comes to studying the mind and human consciousness.
But yes, scientists can look at your experiments and show that they don't have anything in common with human thought.
> What percentage of humans make meaningful contributions to advancing science or technology?
I’m a nobody that you’ve never heard of and I’ve arguably made meaningful contributions. If that’s true, don’t you think there could be way more people out there than you or sibling commenter imply?
You can't know that. Currently, 8 billion humans generate a few scientific breakthroughs per year. You'd have to run several billion ChatGPTs for a year with zero breakthroughs to have any confidence in such a claim.
With billions of GPT output streams, how do you actually discover and rank what’s significant? Screen them through some even more powerful models? I imagine it’s like a volcano eruption of text where some are absolutely brilliant and most is worthless and finding the jewels is even more demanding than generating it all.
Some theories are easily testable. For instance, ask it to write some code to efficiently solve traveling salesman problems, and then test the code on some sample problems. You can score the quality of solutions and time taken, and manually inspect the best ones.
At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would. It can’t consume hundreds of math textbooks and learn the principles of math and then apply them more broadly to science textbooks and research papers. It can’t even reliably add two numbers.
Yes, brute forcing with hard AI can produce many thoughts. But the AI wouldn’t know they are correct. It couldn’t explain why. Any discovery would only be attributable to randomness. It wouldn’t be learning from itself and its priors.
> At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would.
Actually there are many indications that GPT understands the data, because its output mostly makes sense. The reason it can't assign meaning the way a human would is because a human can correlate words with other sensory data that GPT doesn't have access to. That's where GPT creates nonsense.
Think carefully about what "understanding" means in a mechanistic sense. It's a form of compression, and a few billion parameters encoding the contents of a large part of the internet seems like pretty good compression to me.
GPT doesn't display understanding of purely abstract systems, so I doubt it's an issue of lacking sensory information. It can't consistently do arithmetic, for example - and I think it would be presumptuous to insist that sensory information is a prerequisite for mathematics, even though that's how humans arrived at it.
It's not yet clear why it struggles with arithmetic. It could be data-related, could be model-related, although scaling both seems to improve the situation.
In any case, GPT could still understand non-abstract things just fine. People with low IQ also struggle with abstract reasoning, and IQ tests place GPT-3 at around 83.
I still think that this will be a major form of AI that is accessible to the public at large and it will enable productivity improvements at all levels.
I'm not joking, this is really something I think will/should happen.
Alternatively, are there ways to train on consumer graphics cards, similar to SETI@Home or Folding@Home? I would personally be happy to donate gpu time, as I imagine many others would as well.
There absolutely are! Check out hivemind (https://github.com/learning-at-home/hivemind), a general library for deep learning over the Internet, or Petals (https://petals.ml/), a system that leverages Hivemind and allows you to run BLOOM-176B (or other large language models) that is distributed over many volunteer PCs. You can join it and host some layers of the model by running literally one command on a Linux machine with Docker and a recent enough GPU.
Disclaimer: I work on these projects, both are based on our research over the past three years
Or how to apply communism to software engineering. I like that.
More seriously, the risk that a few companies become even more powerful thanks to their restricted access to such NN is very frightening. The worth is, without legal restrictions, there is nothing that we can do against it. And I doubt that legal restrictions come in the next months / years.
Well at that point, some people might have the crazy crazy insight that no matter how big the model is, or how many GPUs they have, it burns up all the same.
"We are waiting for OpenAI to reveal more details about the training infrastructure and model implementation. But to put things into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run."
That's still including margins of cloud vendors. OpenAI had Microsoft providing resources which could do that at much lower cost. It still won't be cheap but you'll be way below $5m if you buy hardware yourself, given that you're able to utilize it long enough. Especially if you set it up in a region with low electricity prices, latency doesn't matter anyway.
I think AI is going to go the way of the hard sciences where the age of tinkerers making progress by leaps and bounds in their basement is over and incremental progress is going to be the domain of universities or large private companies that can afford to throw money behind it. I would love to be proven wrong and see radical shifts in how people approach these problems. Seems like the cycle started and got to this point way too soon for AI though
My take on this is that (good) content is one of the bigger problems still, particularly also who exactly the original training data belongs to (or where it comes from). There's a certain risk (we'll see with Github CoPilot soon) it will slow down for a bit until the licensing issues are all sorted out. This can only be solved (for now) by bringing in public funding/data, which universities have always been a very good proxy for. Which also means it (usually) should be open access to the public, to some extent (and useful for the garage folks to catch up a bit). But, once we're past that, it'll be all about that giant body of pre-trained data, securely kept within the next Facebook or Microsoft, amounting to literal data gold (just much higher value at a lot less weight).
> Are there any possible technologal or scientific leaps on the horizon
Yes. From 2017: "Prediction 4: The simplest 2D text encodings for neural
networks will be TLs. High level TLs will be found to translate machine written programs into understandable trees."
We have something coming out that is an OOM better than anything else out there right now.
small players will never have a chance to push the state of the art, as whatever optimization there is will also be applied at large scale with more money
Take a leaf from Seti@Home‘s book and try to come up with a distributed, volunteer-based approach to training an open source LLM. There is already an enormous amount of suitable ML hardware on end user devices.
Good point, but perhaps a leap could take small players into territories of language models that are large enough to be useful. GPT-3 crossed that threshold
> Could this be distributed? Put all those mining GPUs to work.
Nope. It's a strictly O(n) process. If it weren't for the foresight of George Patrick Turnbull in 1668, we would not be anywhere close to these amazing results today.
In theory, yes. "Hogwild!" is an approach to distributed training, in essence, each worker is given a bunch of data, they compute the gradient and send that to a central authority. The authority accumulates the gradients and periodically pushes new weights.
There is also Federated Learning which seemed to start taking off, but then interest rapidly declined.
Wikimedia and other organizations that deal with moderation might want to keep this technology out of the hands of the general public for as long as possible.
There are a couple of cases where small changes in the model make training much quicker. For example, the currently leading Go AI, KataGo, requires much less time to train than AlphaGo did.
Yes. There are plenty forward leaps, most of them are not new and are just waiting to be integrated or released :
Let's pave the road for SkyNet hard lift-off :
-The first obvious one is use of external knowledge store, aka instead of having to store facts in the neural weights where they struggle, just store them in a database and teach your neural network to use it. (This is also similar to something like webgpt where you allow your network to search the web). This will allow you to have a network of 1G parameters (and external indexes of a few TB) that is as performant as a network of 100G parameters, and with better scaling property too. You can probably gain at least 2 orders of magnitude there.
-The second leap is better architecture of your neural networks, approximating transformer that are quadratic compute by something that is linear compute (linformer) or n log n compute (reformer) can get you an order of magnitude faster by simply reducing your iteration time. Similarly using some architectures based on sparsity can give you faster computation (although some of the gains are reduced by lesser efficiency of sparse memory access pattern). Using (analog bits) Diffusion to Generatively PreTrain sentences at a time instead of token by token. You can probably gain between 1 and 3 order of magnitude here if you write and optimize everything manually (or have your advanced network/compiler optimize your code for you)
-The third leap is reduced domain : You don't have a single network that you train on everything. Training one network by domain allows you to have a smaller network that compute faster. But also it allows you to focus your training on what matters to you : for example if you want to have a mathematics network, its parameters are not influenced a lot by showing it football pictures.
There is at least 2 orders of magnitude there.
-The fourth one is external tool usage. It's kind of related to the first one but whereas in the first one is readily differentiable, this one necessitate some Reinforcement Learning (that's what decision transformer are used for).
-Compression : compress everywhere. The bottlenecks are memory bandwidth related. Work in compressed form when relevant. One order of magnitude
-Distributed training : Because the memory bandwidth of inside a GPU is in the order of TB/s where as the transfer to the GPU is in the order of 10GB/s. There is an advantage to have the parameters reside on the GPU but there is a limited quantity of memory in the GPU, so distributed training (something like petals.ml) allows you to increase your memory bandwidth by collaborating. So each actor can probably gain an order of magnitude. Provided that they can keep bad actors away.
-Use free resources : The other day Steam had 10M users with GPU waiting around doing nothing, just release a dwarf fortress mod with prettier pictures and use the compute for more important tasks.
-Remove any humans in the loop : it's faster to iterate when you don't have to rely any human, either for dataset construction or model building
For casual readers like me: are there examples of what this can do once trained? E.g. it mentions training on Shakespeare, but gives no examples of fake Shakespearean.
Does anyone know the main differences between GPT-2 and GPT-3? Are there significant architectural changes, or is the advancement primarily from training?
Thanks. Sounds like they 10x'ed the number of parameters, which made some "magic leap" that isn't yet well understood, and fed it more data to train it on more specialized domains.
Yes, although Chinchilla seems to imply that training data size matters a lot more than parameter count, and nanoGPT author is trying to reproduce that here:
I was also a bit surprised that the Chinchilla numbers and tables don't reproduce and that there are calculation bugs in the paper (e.g. the FLOPs calculation in the paper is wrong), especially because the paper has been so impactful in the field. Maybe people are focusing on the broad themes of the paper (e.g. scale model and data approx. in tandem) and just roughly interpolating the main Figure, without sweating the details. The corresponding authors responded very kindly at first and I was able to bring the results closer but now they went dark. Still hoping to make things match, if others in LLM space can spot any issues in my own reproduction please let me know.
Oh, that's really interesting, and makes sense intuitively. From the abstract:
> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.
Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.
I could not find any sample (prompt and results). Can anyone provide samples of it's quality, even if it is in a narrow field of knowledge or specific use case?
I tried GPT2, GPT-J 6B and GPT-NeoX 20B (implementation by Fabrice Bellard at textsynth.com/playground.html) but I could not find any production-quality scenario yet, only cherry-picked simple cases.
> The code itself is plain and readable: train.py is a ~300-line boilerplate training loop and model.py a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it.
Thank you Andrej Karpathy for the work on ai and gpt models. It really helped me solve a problem as entrepreneur. I started making first few grand from ai.
As the creator of aitextgen, I'm mixed on continuing support since there doesn't seem to be as much demand as expected for small GPT models given the success and cost-effectiveness of GPT-3/ChatGPT, unfortunately.
I still have a few ideas there (including another secret approach at better text generation) but it's hard to determine ROI.
I think what you have created still has great demand. It give devs who do not have the budget or need for the gigantic models, something to train and use for their own specific language tasks.
Not everyone is trying to replicate CHATGPT results for certain tasks.
He's done it because he evidently loves it, and wants to share his hard-earned knowledge with the rest of the world.
He may be a product of the ivory tower, but he's been in the trenches. He knows firsthand how f-ing hard it is to ship a product.
And here he is, sharing useful personal code with everyone.
This github repo now has collected ~4K stars and its antecessor (minGPT) has collected ~11K stars over the past couple of years. In my experience, the number of people who clone, copy, view or otherwise use code from a repo is one to two orders of magnitude larger than the number of people who star it, so we can safely say that Andrej has helped at least a few hundred thousand -- but likely more than a million -- individuals around the world learn how to build and tinker with GPT models.
Remarkably, as I write this, no one else here has said thank you yet, so let me say it on everyone's behalf:
THANK YOU ANDREJ.
--
EDITS: I changed the wording in response to latexr's comments below.
Edit: the OP has updated their wording to make it clear they meant any kind of viewing or usage. I don’t think any of us would disagree more people use code than star repos. Original comment left below with original quote, since this has gotten a number of replies that would stop making sense with a larger edit.
> Normally, the number of people who clone or copy code from a repo is one to two orders of magnitude larger than the number of people who take the time to star it
Intuitively, I’m having trouble believing that. Starring takes considerably less effort than cloning or copying code. The “time to star” is a literal second, maybe two if you have to scroll up.
From anecdotal observation, repos with more forks and/or external contributors than stars are far from the norm. I’ve seen many mentioning they star repos as a way of bookmarking they seldom go back to, or as an easy way to send kudos to the developer even when they don’t use the project.
In no way is this a comment on the value of Andrej’s work (I’m not familiar with it). I am only interested in the source of your “orders of magnitude” claim, which if proven will update my mental model of the coding community.
I checked my 5 year old repository of ~300 stars. It has a ~100 unique clones a month. So if the average was half of it then the 1 order of magnitude would be quite an accurate approximation.
I think the biggest difference with a clone and a star is that a star requires an account and some vested interest in the social network of Github. Anyone who is not interested in the social aspect can just bookmark it.
I guess this differs quite a lot by target demographic. A tool for GPT will probably get a lot more stars than a plugin for some consumer software simply because it is more targeted for the audience of people who have Github accounts.
Thank you for sharing your anecdata. In my experience, the number of clones per month is much higher at first, and then decays gradually until it settles into a stable run-rate, so it's likely that you've had more than 100 x 12 x 5 clones over those five years -- i.e., between one and two orders of magnitude the number of stars, 300.
If I want to use a repository, my first step is to either download a released binary or clone the repository. Forking is much further down the line for me, when I've used the code, encountered a problem, fixed it, and decided to polish the fix up to make a PR. I star something when I either have used it and like it, or when I think I want to use it in the future and want to bookmark it (though the former more often than the latter). I have given out about 50% more stars than I've forked, and have probably cloned an order of magnitude more than I've forked or starred.
Of course not everyone is the same, but I'd be surprised if overall clones were less than an order of magnitude more than forks or stars, and find two or even three orders of magnitude believable depending on the target group of the repo.
Exactly. I would add that the number of clones (not forks) and file/page views is viewable only by the owner of the repo, so we can only guess. (If you own a github repo, you can see the most recent number of clones and page views by clicking on insight -> traffic.)
My estimate of "one to two orders of magnitude" is based on anecdotal evidence. I edited my comment to reflect as much.
I've stared maybe 2-3 repositories over the past 15 years, contributed to probably a half dozen and used hundreds (if not more) in my applications. To me using means using that project in an application you develop. Typically I get them from NPM or Nuget and I contribute when a) the project owner thinks my feature idea is a good idea or b) I run into a bug that I can fix.
Starring is just not that useful to me so I can see why users or contributors would be much higher. I typically star repos if it's an unpopular or old repository that doesn't have NPM or Nuget packages.
How many projects have you starred and how many have you cloned?
Whilst starring is simpler, the incentive is much lower than that of cloning. Especially for projects you just want to use and not contribute to or follow.
In my many years of work, i have only starred less than 50 repos. I am sure i have cloned more than a thousand.
> How many projects have you starred and how many have you cloned?
I seldom star, but neither you nor I can be extrapolated to the general community. I have thousands of stars in some repos, and I know a significant number of those users don’t even code, let alone clone repos or copy code, they’re interested in the final product. They have GitHub accounts because it’s the way to report bugs or make feature requests.
The OP made a claim. All I’m looking to understand is if it has data backing it up or it’s just a gut feeling, because if it’s the former I’ll have learned something and made a correction of my mental model of the world. Sharing more anecdotes will leave us stuck in the same situation.
That just tells us that more people use the code than star the repo. I don’t think that’d be a surprise to anyone. The claim was that more people clone and copy code from the repo than the ones who star it, which is a different matter from the number of users.
Thank you for clarifying. I meant use. The number of clones and the number of file/page views are proxies for that. So is the number of installs via pip, conda, and other Python package management systems, in this case. I updated my comment to reflect as much.
Him doing this is not like when your average bloke does it.
He appears to be building a business and maintaining his profile. And there is nothing wrong with that - I admire him for for pursuing his career in this positive and helpful way.
But random folks do this sort of thing everyday with no such career goals and little recognition, so I'm not sure it is this specific contribution that needs to be called out.
A million people building GPT models means that one in 8000 humans on earth has built one. That seems wildly off.
Linkedin has about 100.000 profiles of data scientists. Assume generously that the actual number is 10x higher. Not correcting for the fact that a data scientist isn't always a machine learning expert, etc etc, there's just no way every single one of them even KNOWS what a GPT-like model is.
Not only building. Also tinkering, using, testing out of curiosity, etc. There are around ~30 million software developers worldwide today (source: googled it). Around ~7 million of them are registered users of Github (source: googled it). 1M+ seems likely to me.
BTW, I appreciate that you preceded your comment with "Pedantry time!" -- nice gesture :-)
A100’s are Nvidia GPU’s. You can rent them from providers like AWS or LamdaLabs. The readme has instructions for downloading the original GPT2 weights from OpenAI. You can also train a very simple version on a smaller dataset from your laptop as described in the README.
Unlike OpenWebText this will run in seconds. Finetuning takes very little time, e.g. on a single GPU just a few minutes. Run an example finetuning like:
Somewhat off topic, does someone know how bing might integrate chat gpt into search. Is it to understand the prompt and filter results. Taking the question and summarizing it to search the index. Is it to summarize all the documents into an index and search that. Or to just be like chat gpt is now and use it to generate new results from it's knowledge base? I'm trying to connect the dots between a generative form like these are and how it would influence search in the future. Or is the lucene style index search on it's way out in a generative world?
> reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in 38 hours of training
For comparison GPT-3 has more than 1000x more params (175B) and training time was around 2 months on ~1500 V100 GPUs which costs millions of dollars in cloud compute costs. Gopher with 280B params was trained on 4096 TPU-v3 chips, Microsoft Megatron-Turing NLG 530B trained on 2240 NVIDIA A100 cards (each card costs ~15k USD).
And the most mind blowing is PaLM from Google with 540B params and trained on 6144 TPU v4, which costs around 10-30M USD in cloud compute to train.
If I trained this on a 30,000 word document could it give me a summary? Or would there be no need to train it in that case, and I could just tell it "Summarise this: <insert 30,000 word document>"?
30,000 words wouldn't be enough to train this from scratch - you'd ideally train from hundreds of millions of words at least.
30,000 words would be enough to finetune an existing model. If you did that, then the model would output text similar to the finetuning data. For example, if you finetuned it on shakespeare, then you might be able to use the model to make a new play, in shakespeare's style.
It still has the knowledge from the main training on data from across the whole internet, so would still know the word Shakespeare...
But you're right - the model finetuned on shakespeare would be good at writing a new play in the style of shakespeare, but would be bad at giving a critique of shakespeare's works.
The context window (block size) of this model is 1024 symbols. Symbols approximately map to words. So you can't ask it to summarize anything over 1024 words.
You can also use "Please suggest a section title for the following text".
Then that title can be used in the 2nd round, for example using a query of the form "The following is an extract from the Introduction section of a document about The benefits and disadvantages of nuclear power in sweden:"
I imagine you could do even better by finetuning the neural net on the document before asking for the recursive summary. Then it has all the information to work with, albeit in a compressed form.
So, are there any of these projects that aren't vendor locked to NVIDIA and are able to train large models with limited GPU RAM space?
I don't mind letting my machine churn for 2-3 weeks. But I'm not looking to buy another 1000$ GPU just because CUDA is the only compute library researchers understand
Wow, this is great. I can't wait for the video lecture, transformers are an aspect of modern machine learning that I'm not completely clear on. Andrej's lectures are brilliant - super detailed, and really answer the detailed questions I always have. Great stuff!
I imagine this might be interesting for domain-specific GPT models. Say training it on a mountain of technical documentation, or on every fanfiction published on the internet, or a sentiment analysis dataset. Of course fine-tuning GPT3 would give better results, but nanoGPT might allow you to make a much smaller model that's still good enough, to enable cheaper inference.
Also the opportunity to play around with all the parameters fairly cheaply to find improvements. The todo section of the readme gives a small taste of that. Making bigger models works for OpenAI, but maybe the rest of us manage to make small models just perform better instead.
Assuming you get the 12GB version of the 3080. A 2080TI is another option. Though you can reduce precision or use one of the smaller GPT2 versions to run on smaller cards as well.
Though their roadmap doc says they're looking into finetuning existing GPT-J/T5 models for this task. So you'll probably want a 3090 (24GB VRAM) and at least 16GB of CPU RAM to run inference if/when the project is complete.
For an AI noob like me: can you use spot instances to train models? They are about 1/3rd the price on AWS compared to on demand ones, so it'd make a significant difference.
Yes you should use them. They can be taken away from you with 2 min notice. (It doesn't happen a lot in practice though. I have been running a different instance for over a month. AWS doesn't force you if they don't have to)
If you are going to run a long training job, ensure you are creating checkpoints. Be sure to use persistent storage, EBS and ensure that you check the option that it doesn't get deleted if the instance is stopped, so your checkpoint remain in the disk and you can easily restart.
Yes you can. In Oregon you could eventually get this instance at $9. I say eventually, because of course Spot allocation is not guaranteed. ( And neither is On Demand ...but that is a story for another day)
As someone who's been in software for almost 25 years now, I read through this in amazement of how much new stuff still keeps coming in. This industry never stops and that makes it such a fascinating (but arguably harsh) world to be in.
Looking at this feels like seeing the source code of a 64k demo, learning about Mode 13h and trying to replicate it in Turbo Pascal.
And, much like the old days of graphics programming, there's a good chance all of this knowledge will be mostly irrelevant soon, as the abstraction layers tend to come quicker and quicker and take care of the hard foundational work underneath. Then it'll be many of us here discussing whether or not it was good to have been with it from the start, to really get it, or whether playing with the highly-abstracted components is all that's needed to succeed with it.
Either way, super cool to see the pace here and I loved the "I only have a macbook" section.
HN lets reposts through if the story hasn't had significant attention yet. This is to give good submissions multiple chances at getting attention. Past explanations:
I know it sucks when your submission was earlier and gets overlooked! We should eventually have some sort of karma-sharing to take care of this. In the meantime, it at least evens out in the long run, since the reason is randomness.