NanoGPT

karpathy · on Jan 11, 2023

Wow, fun to find this trending on HN this morning! I am currently also working on the associated video lecture (as the next episode of my video lecture series here https://karpathy.ai/zero-to-hero.html ), where I will build nanoGPT from scratch and aspire to spell everything out, as with the earlier videos. Hoping to get it out in ~2 weeks or so.

StefanWestfal · on Jan 11, 2023

Open accessible lectures / knowledge like yours allowed many people, me included, to turn their life around by putting in the effort and develop themselves. Thank you.

moralestapia · on Jan 11, 2023

While doing my PhD some years ago (it wasn't a PhD on AI, but very much related) I trained several models with the usual stack back then (pytorch and some others in TF). I realized that a lot of this stack could be rewritten in much simpler terms without sacrificing much fidelity and/or performance in the end.

Submissions like yours and other projects like this one (recently featured here as well) -> https://github.com/ggerganov/whisper.cpp, makes it pretty clear to me that this intuition is correct.

There's a couple tools I created back then that could push things further towards this direction, unfortunately they're not mature enough to warrant a release but the ideas they portray are worth taking a look at (IMHO) and I'll be happy to share them. If there's interest on your side (or anyone reading this thread) I'd love to talk more about it.

subbu · on Jan 11, 2023

Your youtube playlist combined with NanoGPT and your Lex Fridman podcast is like having a university level degree with a free internship guidance. Thank you!

imranq · on Jan 11, 2023

Just wanted to say thank you for all the incredible work and resources you publish. I've lost track of all the different skills I've learned from you, from computer vision, RNNs, minGPT, even speedcubing :D

gtoubassi · on Jan 11, 2023

+1. I've benefited greatly from your content, e.g. your CNN lecture was incredibly accessible [0]. I still find transformers stubbornly elude my intuitions despite reading many descriptions. I would very much appreciate your video lecture on this topic.

[0] I think https://www.youtube.com/watch?v=LxfUGhug-iQ

katsucurry · on Jan 11, 2023

I've found all of your code and lessons on youtube so incredibly useful. You're a wonderful teacher and I really appreciate all the work you've done with this!

cs702 · on Jan 11, 2023

Andrej: thank you!

--

To the mod (dang): IMHO Andrej's comment should probably be at the top of the page, not my comment. UPDATE: Looks like that's done. Thank you :-)

TheAlchemist · on Jan 11, 2023

Thank you for your amazing work. Between cs231n and your recent videos, I've learned a ton - and you have a gift to explain things in such an easy and straightforward way, that I'm always feeling like an idiot (in a positive way) for not having grasped the concept before.

goldenshale · on Jan 11, 2023

Bad ass! A great addition would be some content on tuning pre-trained language models for particular purposes. It would be great to have examples of things like tuning a GPT model trained on language and code to take in a context and spit out code in my custom API, or using my internal terminology. Not sure if this is RL based fine tuning or just a bunch of language to code examples in a fine tuning dataset? In essence, how can we start using language to control our software?

karpathy · on Jan 11, 2023

Ty agree, most people practically speaking will be interested in finetuning rather than from-scratch pretraining. I currently have some language about it in readme but I agree this should get more focus, docs, examples, etc.

highfrequency · on Jan 11, 2023

Appreciate the work to make GPT training accessible!

Do you leave hyperparams (like learning rate, batch size) the same when switching from 8xA100 to fewer GPUs, or do these need to be adjusted?

Separately, when going from 8xA100 GPU to a single A100 GPU, in the worst case we can expect the same model performance after training 8x as long correct? (And likely a bit better because we get more gradient updates in with smaller batch size)

eternalban · on Jan 11, 2023

Thank you for sharing your knowledge. Anything that can be done to democratize machine learning is an invaluable social service. Hats off to you.

silentsea90 · on Jan 11, 2023

Saying absolutely nothing new here, but your work is so damn inspiring! I wish I had such a natural connect to my work, an ability to distill complex concepts down to the fundamentals, and such inventiveness! I took your CS231N class at Stanford as well. Implementing the fundamental building blocks like backprop was fun and insightful. Thanks again for your passion and teaching!

marviel · on Jan 11, 2023

Your tutorials are effective and concise. Thank you for them! Accessible, from-scratch knowledge on these topics is essential at this time in history and you're really making a dent in that problem.

nurettin · on Jan 12, 2023

Thanks, I love your video about back propagation where you painstakingly spell out every calculation. It was like a breath of fresh air compared to other materials out there.

misza222 · on Jan 11, 2023

Thanks for your work Andrej! I've been doing earlier lectures and this is absolutely fantastic educational content!

de_nied · on Jan 11, 2023

Thank you for your constant contributions.

m3affan · on Jan 12, 2023

Amazing work, much appreciated

dsabanin · on Jan 11, 2023

Thank you for your great work!

hwc · on Jan 12, 2023

just started watching your lectures! they are great!

marviel · on Jan 11, 2023

I have taken several masters-level courses in Machine Learning -- and even with those credentials, I cannot recommend enough Andrej's youtube series, "Neural Networks: Zero to Hero". There, he teaches you, from scratch, how to build everything from the underlying automated gradient calculation system in pytorch, all the way up to the slower version of this model - `MinGPT`.

[1] https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs...

(edit: self-promo: I'm currently working on a Typescript follow-through of this same series of video lectures, if you want to follow along with stronger types for explanation: https://github.com/Marviel/lab-grad)

brap · on Jan 11, 2023

I can’t believe I just spent 2 and a half hours glued to my phone in bed watching this, for absolutely no reason other than it was such an interesting intro (to a subject I’m already familiar with). Thanks for the recommendation, and thanks Andrej for making this!

randoglando · on Jan 11, 2023

How does it compare to fast.ai? As a engineer looking to learn, what should I start with?

marviel · on Jan 11, 2023

Both are good for different things.

Fast.AI is great, but it takes the top down, vs the bottom up, approach. It takes you from a production-level black box that you don't understand, down to the details. The benefit there is you get good high-level intuition of how it behaves at the "let me use this technology for a job" level.

Separately, the fast.ai library is also highly recommendable -- it comes with some state-of-the-art image recognition models, and its training wrappers are really helpful particularly for image-recognition dataset training.

Karpathy's "Neural Networks: Zero to Hero" video series starts at the level of individual neurons, and works you up to the final product. For some reason both this style, and Karpathy's conciseness appeal to me slightly more. I'm also super detail-oriented, though -- and any level of "hand waving" (even if further explanation comes later) always bothers me. He's also got some pretty high-profile industry experience which carries some weight with me.

But I'll say that both are really high-quality. -- ultimately, my recommendation would be to follow whichever one speaks most to you personally after the first 1hr or so.

EDIT: Per Jeremy's response below, if you want the bottom-up approach but like the fast.ai teaching style, you should check out "part 2" of the fast.ai set of tutorials, which is exactly that.

jph00 · on Jan 11, 2023

fast.ai has both - the "part 1" section is top-down, and the "part 2" section is bottom up. You can do part 2 without having done part 1. Part 2 starts with implementing matrix multiplication from scratch, then backprop from scratch, then SGD from scratch, etc.

There will be a new version of the part 2 course out in a few weeks. It even covers stuff like random number generation from scratch, convolutions from scratch, etc. It gradually works all the way up to Stable Diffusion.

@karpathy's and the fast.ai lessons work well together. They cover similar topics from different angles.

(I'm the primary creator of the fast.ai courses.)

marviel · on Jan 11, 2023

That's awesome! I did not know that part 2 was structured this way, and will check it out. Will be really neat to see you teach stable diffusion.

Thanks for your work on fast.ai!

jwithington · on Jan 11, 2023

Jeremy @ Fast.ai says he takes this pedagogical approach because it's "proven" to be the best way to learn. He's probably right, but I do find it confusing at times because in the beginning you're just hitting ctrl + enter on a IPYNB haha.

Maybe Karpathy's approach will speak to me more--thanks for the recommendation!

superdisk · on Jan 11, 2023

Wow, I just watched the first video and it's, hands down, the most crystal clear explanation of neural nets and backpropagation I've ever seen. Bravo.

arturventura · on Jan 11, 2023

This is really good, and I was really excited by it but then I read:

> running on a single 8XA100 40GB node in 38 hours of training

This is a $40-80k machine. Not a diss, but I would love to see an advance that would allow anyone with a high end computer to be able to improve on this model. Before that happens this whole field is going to be owned by big corporations.

pavlov · on Jan 11, 2023

I don't know if that's a blocker. Ordinary people commonly rent a $40k machine for 38 hours from companies like Avis and Hertz.

If training a large model now costs the same as driving to visit grandma, that seems like a pretty good deal.

jetrink · on Jan 11, 2023

That's a great comparison. For a real number, I just checked Runpod and you can rent a system with 8xA100 for $17/hr or ~$700 for 38 hours. Not cheap, but also pretty close to the cost of renting a premium vehicle for a few days. I've trained a few small models by renting an 1xA5000 system and that only costs $0.44/hr, which is perfect for learning and experimentation.

amelius · on Jan 11, 2023

It would be great if a tradeoff could be made, though. For example, train at 1/10th the speed for 1/10th of the cost.

This could correspond to taking public transport in your analogy, and would bring this within reach of most students.

londons_explore · on Jan 11, 2023

Slower training tends to be only a little cheaper, because most modern architectures parallelize well, and they just care about the number of flops.

If you want to reduce cost, you need to reduce the model size, and you'll get worse results for less money.

mk_stjames · on Jan 11, 2023

The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.

I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.

buildbot · on Jan 11, 2023

Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.

And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).

Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.

sbrother · on Jan 12, 2023

Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?

pizza · on Jan 12, 2023

Bloom-petals

sbrother · on Jan 12, 2023

Amazing. Thank you.

pizza · on Jan 13, 2023

No prob. I think it’s a great idea

amelius · on Jan 11, 2023

I guess this would only become a reality if games started requiring these cards.

mcbuilder · on Jan 11, 2023

Well if it used to cost you $1 for 1hr at 1x speed, now it will take you 10hr at 0.1x speed, and if my math checks out $1. You need to shrink the model.

amelius · on Jan 11, 2023

But of course now you run it on your own computer instead of in the DC, which changes the numbers. Especially if your student dorm has a shared electricity bill :)

willseth · on Jan 11, 2023

The good news is that, unlike vehicles, the rate for rented compute will continue to drop

Apofis · on Jan 11, 2023

Let's not forget that rendering 3D Animations in 3DSMAX or Maya used to take days for a single frame for a complex scene, and months for a few minutes.

swader999 · on Jan 11, 2023

You have to gas it up and heaven help you if it gets a scratch or a scuff.

speed_spread · on Jan 11, 2023

Great news! Cloud instances energy usage is included in their price, and because they're remote and transient it's impossible to permanently damage them.

aequitas · on Jan 11, 2023

I think the equivalent of being not careful and getting a dent in this context is to leave it open to the internet and having a bitcoin miner installed.

Aissen · on Jan 11, 2023

You free the instance and the miner is gone.

iso1631 · on Jan 11, 2023

As you are paying for the resources you use that's fine.

The closest would be if you used some form of software bug to actually cause physical damage, certainly not impossible, but extremely unlikely compared with actually physically damaging a car.

idonotknowwhy · on Jan 11, 2023

A better fit would be, if you have unlimited liability like with AWS, and you leak your key pair. Then someone runs up a 100k bill setting up mining instances

DesiLurker · on Jan 11, 2023

but you still have to pay for network ingress/egress traffic.

ofcourseyoudo · on Jan 11, 2023

Similarly maybe we should only let people rent a NanoGPT box if they are over 25 and they have to get collision insurance.

Tepix · on Jan 11, 2023

If you can fit the training into 24GB, a used RTX 3090 for $700-$800 seems like a good deal at the moment. They are about 45-65% as fast as the A100 according to https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...

So if you buy two of these cards it will take 12-13 days instead of 38 hours but only require a $2500 PC.

James Betker, who created tortoise TTS, built his own $15k machine with 8x RTX 3090 and trained the models with it. He now works for OpenAI…

Tepix · on Jan 25, 2023

Recommended reading:

https://timdettmers.com/2023/01/16/which-gpu-for-deep-learni...

TL;DR: You probably don't need that expensive Threadripper because 2x PCIe 4.0 x16 will not be very beneficial. Go cheap, go 2x PCIe 4.0 x8.

klaudioz · on Jan 12, 2023

Any link to the 15k machine ?. Maybe it is cheaper now.

Tepix · on Jan 12, 2023

I think it was a DIY machine, those RTX 3090 have gotten cheaper for sure. From my experience, going beyond 4 GPUs is a pricey affair. See [§]. All but one model of the RTX3090 require at least 3 slots.

If 4 GPUs connected via PCIe 4.0x16 are enough you can choose among various sRTX4 boards for 3000 series AMD Threadripper CPUs.

[§] https://www.reddit.com/r/deeplearning/comments/tw0olq/commen...

Another useful URL: https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-...

wongarsu · on Jan 11, 2023

It's a $33/hour machine on AWS, so about $1250 for one training run. Not cheap, but easily in the reach of startups and educational or research institutions.

Edit: or about $340 if you get the 8xA100 instance from lambdalabs, in the realm of normal hobby spending

belter · on Jan 11, 2023

Or $9/hour if you use Spot :-)

https://aws.amazon.com/ec2/spot/pricing/

snerbles · on Jan 11, 2023

Hopefully your progress gets saved in time when the spot instance inevitably gets terminated in the midst of training.

belter · on Jan 11, 2023

"Managed Spot Training..."

"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."

https://docs.aws.amazon.com/sagemaker/latest/dg/model-manage...

acetabulum · on Jan 11, 2023

If you use Horovod Elastic, I think you can avoid this problem working across a cluster of Spot instances.

https://horovod.readthedocs.io/en/stable/elastic_include.htm...

bobbyi · on Jan 11, 2023

If you're doing something new/ custom (which you presumably are if you aren't using someone else's prebuilt model), it could take a lot of runs to figure out the best training data and finetune settings.

(I assume. I've never worked with GPT, but have done similar work in other domains).

weird-eye-issue · on Jan 11, 2023

After training don't you have to keep it running if you want to use it?

wongarsu · on Jan 11, 2023

Just download the model and run it on something much smaller and cheaper. Bigger models like GPT-J are a bit of a pain to run, but GPT2-sized models run just fine on consumer GPUs.

weird-eye-issue · on Jan 12, 2023

Ahh okay, thanks. So how big is the model? Seems like it should be available to download so people don't have to train it. I understand you can train it on custom data but for a "default" model are there any available to download?

bilsbie · on Jan 11, 2023

What’s required to run the model?

wongarsu · on Jan 11, 2023

The biggest GPT2 (1.5B params) takes about 10GB VRAM, meaning it runs on a RTX 2080 TI, or the 12GB version of the RTX 3080

renewiltord · on Jan 11, 2023

What's the largest language model I can run on a 3090 with 24 GiB RAM?

lossolo · on Jan 11, 2023

Depends on precision, you can run ~5B model with fp32 precision or ~11B fp16 model max. Int8 is really bad for real world use case so not mentioning it.

But if you are looking to get performance of ChatGPT or GPT-3 then don't waste your time, all GPT-3 like small LLM models (below at least 60B params) are useless for any real world use case, they are just toys.

haldujai · on Jan 11, 2023

If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Fortunately very few real world use cases need to be this general.

If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.

lossolo · on Jan 11, 2023

> If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Yes, thanks, that's what I meant.

> If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.

Agree, BERT family is a good example here.

renewiltord · on Jan 11, 2023

Okay, thank you. Perfect response.

JustSomeNobody · on Jan 11, 2023

https://github.com/karpathy/nanoGPT#i-only-have-a-macbook

> This creates a much smaller Transformer (4 layers, 4 heads, 64 embedding size), runs only on CPU, does not torch.compile the model (torch seems to give an error if you try), only evaluates for one iteration so you can see the training loop at work immediately, and also makes sure the context length is much smaller (e.g. 64 tokens), and the batch size is reduced to 8. On my MacBook Air (M1) this takes about 400ms per iteration. The network is still pretty expensive because the current vocabulary is hard-coded to be the GPT-2 BPE encodings of vocab_size=50257. So the embeddings table and the last layer are still massive. In the future I may modify the code to support simple character-level encoding, in which case this would fly. (The required changes would actually be pretty minimal, TODO)

windexh8er · on Jan 11, 2023

But how often do you need to run this? You can run 8xA1000 on LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be able to run the entire data set for less than $350.

[0] https://lambdalabs.com/service/gpu-cloud#pricing

throwawaymaths · on Jan 11, 2023

They are acknowledged at the bottom for supporting andrej's research!!

jph00 · on Jan 11, 2023

A couple of weeks ago a new paper came out that shows how to train a high quality language model on a single GPU in one day.

https://arxiv.org/abs/2212.14034

haldujai · on Jan 11, 2023

If you can’t fit the model on your resources you can leverage DeepSpeed’s ZeRO-offload which will let you train GPT2 on a single V100 (32gb).

Alternatively, if you’re researching (with the caveat that you have to either publish, open source or share your results in a blog post) you can also get access to Google’s TPU research cloud which gives you a few v3-8s for 30 days (can’t do distributed training on devices but can run workloads in parallel). You can also ask nicely for a pod, I’ve been granted access to a v3-32 for 14 days pretty trivially which (if optimized) has more throughput than 8xA100 on transformer models.

TPUs and moreso pods are a bit harder to work with and TF performs far better than PyTorch on them.

https://www.deepspeed.ai/tutorials/zero-offload/

https://medium.com/analytics-vidhya/googles-tpu-research-clo...

dceddia · on Jan 11, 2023

I was curious about how much this would be to rent, because definitely the cost of those servers is outside the budget! Lambda has 8xA100 40gb for $8.80/hr: https://lambdalabs.com/service/gpu-cloud#pricing

Tenoke · on Jan 11, 2023

It seems as likely as people being able to build big automaker level of cars just with tools in their garage. More compute is going to keep producing better results at least for LLMs.

kzrdude · on Jan 11, 2023

How are universities and colleges dealing with this kind of demand for computing power? It must be hard to be able to do some courses now.

CuriouslyC · on Jan 11, 2023

Most decently large colleges have been investing in HPC for a while, and started investing in GPU HPC around 2014. You'd be surprised what sort of school projects the compute budget exists for.

r3trohack3r · on Jan 11, 2023

I went to a smallish state university, even there we had our own HPC center and lab. We had a proper HPC (IIRC) 6 row data center across campus and we had a continuous budget available to me as an undergraduate research assistant for building beowulf clusters for the graduate programs to run assignments on. I once got an allowance to buy 15 raspberry pis to build an arm cluster.

TrackerFF · on Jan 11, 2023

As far as research groups go - they get funds (project grants, donations, etc.) to purchase machines and parts, and then users have to timeshare them.

These machines are pretty much crunching numbers 24/7, and your project will get appended to a queue.

londons_explore · on Jan 11, 2023

'group project'

ProjectArcturis · on Jan 11, 2023

That's to train it from scratch, though, right? If you preload the GPT2 weights you don't need to do this. You can just give it additional training on your texts.

anigbrowl · on Jan 11, 2023

Well, he does include instructions for running it on a personal computer, which looks like what I'm gonna be doing next week.

Besides the rental options discussed below these nvidia boxen don't look too big so either used ones will be available for cheap relatively soon, or you could just locate and liberate one in Promethean fashion.

anilshanbhag · on Jan 11, 2023

If GPT-2 / nanoGPT needs this setup, just imagine what GPT3 / chatGPT needs!

Gigachad · on Jan 11, 2023

Supposedly even running the trained model for ChatGPT is extremely expensive unlike the image generators which can largely be run on a consumer device.

aidos · on Jan 11, 2023

I don’t know anything about this, but is that this instance type on AWS? p4d.24xlarge

base698 · on Jan 11, 2023

You can rent on AWS and other cloud providers.

krisoft · on Jan 11, 2023

So if I see it right that would be a p4d.24xlarge instance. Which goes for about $32.77 an hour nowadays so the total training would be about $1245. Not cheap, but certainly not a nation state budget.

Edit: i just noticed lambda lab. It seems they ask $8.8 per hour for an instance of this caliber. That puts the total training cost around $334. I wonder how come it is that much cheaper.

liquidk · on Jan 11, 2023

That is a key difference. You can’t easily and cheaply rent an auto factory, but you’re starting to be able to rent an LLM training factory once for a model where you can then more cheaply run inference on.

QuadrupleA · on Jan 11, 2023

Doesn't huggingface have dozens of freely available pretrained models like this (including various sized implementations of GPT2) and isn't the source available on most if you wanted to train them yourself?

All I see in the comments is praise for the author as a person, so just wondering what's unique about this that's not available elsewhere? 730 upvotes and counting, assuming I'm missing something...

isoprophlex · on Jan 11, 2023

True, but the use cases arent the same. As he did before for other models, he has a knack for distilling the code down to beautiful, self-contained examples of high didactic value.

It's an order of magnitude easier to grok the basics from this repo than from going through (admittedly more ergonomic or performant or production-ready) huggingface repos.

minimaxir · on Jan 11, 2023

Additionally, in terms of the streamlining nanoGPT porports, HuggingFace's implementations play nice with optimization techniques such as ONNX/TensorRT, which will give you better performance than anything PyTorch-based even if minimal.

That doesn't mean an ONNX-ed nanoGPT won't be better, but the field of optimized text generation isn't as new as people claim.

visarga · on Jan 11, 2023

This is a didactic implementation. If you read the HuggingFace repo it is much more abstracted on account they implement many models in the same codebase. It's not fast or big, just easier to read and tweak.

pms · on Jan 12, 2023

If so, then why the second line of its documentation says that "it is a rewrite of minGPT that prioritizes teeth over education"?

ironrabbit · on Jan 12, 2023

minGPT prioritized being understandable above all else, and was not very fast. This repo includes several optimizations, but it still much more understandable than probably any other open source implementation.

justusthane · on Jan 11, 2023

This is a dumb question about language models in general, not necessarily specific to NanoGPT: why is all the focus on training? Can I download and run a pre-trained model locally? Surely the specs required to run a model are much, much lower than those required to train the model?

code_runner · on Jan 11, 2023

I believe the training is where the architecture of the model is most apparent. You can absolutely download plenty of pre-trained models.

You will also probably need to fine tune for a specific use case, so a common approach is downloading a pre-trained model and fine tuning.

I think including the “from scratch” tuning script is educational more than anything else.

nerdponx · on Jan 11, 2023

It's the equivalent of building from source versus downloading a compiled binary.

Also you can perform "fine tuning" which means you start with a trained model and train it further on your own data, allowing you to customize the model for specific tasks.

anon291 · on Jan 11, 2023

If you're only using pre-trained models, it's going to be harder to differentiate yourself. Training / specialization of models is where the moat-building is (due to access to different data sets / better ideas). By specializing / training, more of the token limit can be used for generation rather than prompting / better prompts can be made.

The lower the cost of training, the more profitable any resultant business. You can even envision businesses that train the model regularly to bring in new knowledge. The cheaper this is, the more opportunities open up.

ausbah · on Jan 11, 2023

inference can still be a bottleneck i think since you usually load the whole thing into memory which is 32-64GB+ usually?

visarga · on Jan 11, 2023

Language models range from 1 to 300+ GB when loaded. It depends on how you load them, if you load in int8 you get 4x reduction.

awestroke · on Jan 11, 2023

Are there any possible technologal or scientific leaps on the horizon that would reduce training time by an order of magnitude or more? GPT-3 took 355 years to train with incredibly expensive hardware, which means small players have no chance to push the state of the art

imtringued · on Jan 11, 2023

As models get bigger less and less neurons get activated by any given input. If you can somehow predict which neurons get activated you can skip the vast majority of the computational load. I have read a paper where they argued that only 0.5% of the neurons are actually active in a 200 million parameter model so you can get a 200x improvement just from that.

What this tells you is that there is very little money in optimizing deep learning and that NVIDIA has made it very easy to just throw more hardware at then problem.

londons_explore · on Jan 11, 2023

> very little money in optimizing deep learning

Oh - there are a lot of people working on optimizing AI. Amongst hobbyists, academia, and corporations alike.

The thing is, if you come up with a neat optimization that saves 30% of compute for the same results, typically instead of reducing your compute budget 30%, you instead increase your model/data size 30% and get better results.

narrator · on Jan 11, 2023

Jevon's paradox of data and AI. The more efficiently data is used, the more demand their is for data.

antognini · on Jan 11, 2023

Any state of the art model takes about three weeks to train.

visarga · on Jan 11, 2023

More an indication of human patience than task difficulty.

CuriouslyC · on Jan 11, 2023

This is hard a-priori, but fairly easy post-facto. Model distillation isn't a common practice yet, but it has already been demonstrated to be quite effective for specific use cases.

visarga · on Jan 11, 2023

Distillation works but somehow we see very few papers doing it at this scale.

WithinReason · on Jan 11, 2023

Do you have a link to that paper by any chance? By "neurons" did they mean weights or activations?

imtringued · on Jan 11, 2023

Here is a GPU implementation.

https://ieeexplore.ieee.org/document/9635657

It is somewhere from 8x to 25x faster than doing dense machine learning. The speedup was higher on the original CPU implementation and the GPU paper mentions that if there isn't enough shared memory on the GPU it will have to switch to an algorithm that has more overhead.

By neurons I actually meant "nodes"

My comment is effectively a summary of this article: https://www.kdnuggets.com/2020/03/deep-learning-breakthrough...

Edit: There is a paper for sparse spiking gradient descent promising a 150x improvement. I am not sure how practical this is because spiking neural network hardware heavily limits your model size but here it is:

https://arxiv.org/abs/2105.08810

cma · on Jan 17, 2023

200x improvement over Nvidia with sparse matrices or without? Nvidia supports a certain level of sparsity.

visarga · on Jan 11, 2023

> argued that only 0.5% of the neurons are actually active in a 200 million parameter model so you can get a 200x improvement just from that

Yes, but you don't know which 0.5% depending on the input text.

captainmuon · on Jan 11, 2023

I wonder about this, too. OpenAI's biggest 'moat' is that their model takes so much resources to train, not that their algorithms are particularly secret.

One idea I had was to not use one single model to learn all steps of the task, but to break it up. The human brain has dedicated grammar processing parts. It is unclear whether something like a universal grammar exists, but we have at least an innate sense for rhythm. Applied to NLP, you could heavily preprocess the input. Tokenize it, annotate parts of speech. Maybe add pronunciation, so the model doesn't have to think about weird english spelling rules, and so you can deal with audio more easily later. So I would build all these little expert-knowledge black boxes and offer them as input to my network.

But there is also some inherent resource cost in large language models. If you want to store and process the knowledge of the world, it is going to be expensive no matter what. Maybe we could split the problem into two parts: Understanding language, and world knowledge (with some messy middle ground). I believe you could replace the world knowledge with a huge graph database or triple store. Not just subject-verb-object, but with attribution and certainty numbers for every fact. The idea would be to query the database at inference time. I don't know how to use this in conjunction with a transformer network like GPT-3, so you'd likely need a very different architecture.

The big benefit of this would be that it is feasible to train the language part without the world knowledge part with much less resources. But you have other benefits, too. ChatGPT is trained to "win the language game". But as they say, winning the argument does not make you right. If you have a clean fact database, you can have it weigh statements from trustworthy sources higher. You then basically have a nice natural language frontend to a logical reasoning system that can respond with facts (or better: conclusions).

ccozan · on Jan 11, 2023

GPT and human brain ( at least the language / speech part ) have nothing in common. We, as humans, do not use language in a generative way, is derived from a higher or very low level of abstraction ( intentions, emotions, etc ) and is explictly use for communicating something. Even this text is based on previous knowledge, saved in an abstract way, and while writing this I must follow the synthax of the language or writing the right order otherwise, you , the person who reads this, will not understand what I mean. While GPT can generate the same text, it does not have a motivation and has no need to communicate ( while I just wanted to feel good by bringing some contribution on HN ).

So yes, very different architecture.

ben_w · on Jan 11, 2023

> and while writing this I must follow the synthax of the language or writing the right order otherwise

A good example that is not, word randomised order and kombination with Mrs Spelling and fonetic spel-ing prevent ye knot that which I wrote you to komprehend.

(My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).

A better point is that GPT-3's training set is more tokens than the number of times an average human synapse fires in a lifetime, squeezed into a network with about 3 orders of magnitude fewer parameters than the human brain has synapses.

It's wrong to model AI as anything like natural intelligence, but if someone insists, my go-to comparison (with an equivalent for image generators) is this: "Imagine someone made a rat immortal, then made it browse the web for 50,000 years. It's still a rat, despite being very well-trained."

themulticaster · on Jan 12, 2023

> (My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).

At least for me it's perfectly understandable (except the "Mrs" part). This reminds of those "did you know you can flip characters randomly and our brain can still understand the text" copypastas that can be found everywhere. I think it's probably quite similar for word order: As long as your sentence structure is not extremely complicated, you can probably get away with changing it any way you like. Just like nobody has issues understanding Yoda in Star Wars.

Although I think there are some limits to changing word order - I can imagine complicated legal documents might get impossible to decipher if you start randomizing word order.

naasking · on Jan 11, 2023

These are conceptual "differences" that don't actually explain the mechanics of what's going on. For all you know "motivation", "intentions", etc. are also just GPT-like subsystems, in which case the underlying mechanics are not as different as you imply.

mensetmanusman · on Jan 11, 2023

If it were gpt-like sub systems, humans would be emitting MWs of power instead of the 100W now.

Whatever humans have it is many orders of magnitude better…

ben_w · on Jan 11, 2023

That's the hardware it runs on, not the software architecture of GPT. I could equally say that transistors are faster than synapses by the same ratio that marathon runners are faster than continental drift.

naasking · on Jan 11, 2023

Or biology evolved a better way to do the same or similar enough computation that we simply haven't yet discovered.

ImHereToVote · on Jan 11, 2023

Emotion is just "spiritual" word for a utility function. Or terminal goal to be more precise.

throwuwu · on Jan 11, 2023

It seems to me that a lot of everyday communication is rather statistical in nature. We don’t necessarily think deeply about each word choice but instead fall back on well worn patterns and habits. We can be more deliberate about how we compose our sentences but most situations don’t call for it. It makes me wonder if we don’t all have a generative language model embedded in our brains that serves up the most likely next set of words based on our current internal state.

visarga · on Jan 11, 2023

> GPT and human brain have nothing in common

Here we go again. They must have something in common, because for about 90% of the tasks the language model agrees with humans, even on novel tasks.

> We, as humans, do not use language in a generative way

Oh, do you want to say we are only doing classification from a short list of classes and don't generate open ended language? Weird, I speak novel word combinations all the time.

ccozan · on Jan 11, 2023

No, what is meant is that the next word I speak/write after a current word are not based on a statistical model, but on a world model which includes a language structure based on a defined syntax and cultural variaty. I actually mean what I say while the ChatGPT just parrots around weights and produces an output based purely on statistics. There is zere modeling which translates into real world ( what normally we call "understanding" and "experience" ).

As was said, a different architecture.

visarga · on Jan 12, 2023

Oh, I see. Then I agree with you, an isolated model can't do any world modelling on its own. No matter how large it is, the real world is more complex.

It might be connected to the world, of course. And it might even use toys such as simulators, code execution, math verification and fact checking to further ground itself. I was thinking about the second scenario.

thomastjeffery · on Jan 11, 2023

Ok top of it not having "motivation" to communicate, it has literally nothing to be communicated in the first place.

That's the key difference. We use language to express conceptualizations. We have some kind of abstract model somewhere that we are translating.

Maybe it isn't a cohesive model either. All I can say for certain is that - whatever it is - we are expressing it.

GPT does not express. It parrots. There is no conceptualization.

captainmuon · on Jan 11, 2023

The more experience I get, the more I wonder if this is really the case for us. We certainly have some kind of abstract model in our heads when thinking deeply about a problem. But in many settings - in a work meeting, or socially with friends - I think it is a much more automatic process. The satisfaction you get when saying the right thing, the dread when you say something stupid: It is just like playing a game. Maybe the old philosophical concept of society as merely "language games" is correct after all. A bit silly but I find the thought makes annoying meetings a bit more bearable.

But you are of course right with GPT, it has no inner life and only parrots. It completely lacks something like an inner state, an existence outside of the brief moment it is invoked, or anything like reflection. Reminds me of the novel "Blindsight" (which I actually haven't read yet, but heard good things about!) where there are beings that are intelligent, but not conscious.

thomastjeffery · on Jan 11, 2023

Intelligent but not conscious would still be a few steps ahead of GPT.

We can take a concept and refactor it symbolically. GPT can't do that. All it does is find symbols that are semantically close to other symbols.

mansoon · on Jan 12, 2023

I’m not sure that those two processes are as distinct as you believe them to be.

thomastjeffery · on Jan 14, 2023

You seem very sure they aren't, yet you have no evidence apart from your own belief that you might be correct.

That's circular reasoning.

mansoon · on Jan 14, 2023

joaogui1 · on Jan 11, 2023

This biggest most is high-quality data. Both their proprietary datasets (WebText, WebText2 etc), but also now their human-annotated data. Another secondary moat is their expertise with training models using PPO (their RL method), they can get results that are quite better than other labs. I say this moat is secondary because it's possible that you can get similar results with other RL algorithms (e.g. DeepMind using MPO) and because maybe you don't really need RL from Human Feedback, and just fine-tuning on instructions is enough

Metus · on Jan 11, 2023

I find OpenAI having exclusive access to that kind of high-quality data more concerning than them having access to their current amount of compute and currently trained model. A couple of million dollars worth of compute is in the realm of any medium sized research university, larger company or any country worth of mention. And seeing as Moore's law still applies to GPU, the cost will only fall.

However high-quality data is scarce. I would be willing to fund a proper effort to create high-quality data.

lossolo · on Jan 11, 2023

It's not just about compute; if that were the case, then models like BLOOM and OPT, which also have 175 billion parameters, would have the same performance for real-world use cases as GPT-3, but they don't. Datasets are also very important.

visarga · on Jan 11, 2023

Check out DeepMind RETRO, it's one year old already, but exactly what you say:

https://www.deepmind.com/publications/improving-language-mod...

belter · on Jan 11, 2023

Model size does not necessarily correlates to quality of results.

"Chinchilla (70B) Greatly Outperforms GPT-3 (175B) and Gopher (280B)" - https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b...

espadrine · on Jan 11, 2023

An interesting outcome of the nanoGPT repo is this struggle to exactly match the Chinchilla findings[0], even after discussing it with the authors.

A larger discussion is that the scaling laws achieve loss-optimal compute time, but the pre-training loss only improves predictions on the corpus, which contains texts written by people that were wrong or whose prose was lacking. In a real system, what you want to optimize for is accuracy, composability, inventiveness.

[0]: https://github.com/karpathy/nanoGPT/blob/master/scaling_laws...

Der_Einzige · on Jan 11, 2023

I highly doubt this in practice on a large scale. Outside of the common phenomena of "most large NNs are under trained" and "less better data is sometimes better than more worse data", there are no other obvious mechanisms to explain why a smaller model with same or similar architecture would be better than a larger one.

I claim instead that we are still hardly scratching the surface with how we evaluate NLP systems. Also, some fields have straight up trash evaluation schemes. Summarization and ROGUE scores are totally BS and I find the claim that they even correlate with high quality summaries suspect. I say this with publications in the that subfield, so I have personal experience with just how crummy many summarizes are.

WithinReason · on Jan 11, 2023

there are no other obvious mechanisms to explain why a smaller model with same or similar architecture would be better than a larger one.

Overfitting?

Der_Einzige · on Jan 11, 2023

The consensus seems to be that the majority of LMs are undertrained not overfitting though.

spi · on Jan 11, 2023

What do you mean by "small players have no chance"? OpenAI was founded in 2015, it used to be a "small player" which just got things right and grew with it - we're not talking of Google or Facebook investing a chunk of their billions cash. In Germany, AlephAlpha has built their own supercomputer and are training similar sized models. It's expensive for sure, but well in the possibilities of startups. In France researchers trained the similarly sized BLOOM model https://huggingface.co/bigscience/bloom. They claim it cost between $2 and $4 millions.

Sure, a single researcher can't replicate this at their university, but even though OpenAI likes to publish it this way, we're not really talking about research here. Research was inventing the transformer architecture, this is just making it bigger by (very smart) engineering choices. It's something companies should do (and are doing), not researchers.

awestroke · on Jan 11, 2023

Microsoft (using Azure DCs) built a supercomputer with 10,000 V100 GPUs exclusively for OpenAI. [0]

It is estimated that it cost around $5M in compute time to train GPT-3.

OpenAI has received billions in investment prior to launching GPT-3, including $1B from Microsoft in 2019.

[0]: https://blogs.microsoft.com/ai/openai-azure-supercomputer/

nileshtrivedi · on Jan 11, 2023

> we're not talking of Google or Facebook investing a chunk of their billions cash

OpenAI had raised $1B from Microsoft in 2019 and used it to train a 175B param model. Now, they have raised $10B and are training GPT-4 with 1.5T params. GPUs are capital intensive and as long as there are returns to bigger models, that's exactly where things will go.

awestroke · on Jan 11, 2023

I can't find any source on the 1.5T params number. I'd love to read more if you have any links to share. Thanks

wut42 · on Jan 11, 2023

afaik, gpt-4 is mostly rumours so far, same thing for the 1.5T number. gpt-4 is suerly coming.

wnkrshm · on Jan 11, 2023

Maybe it will be called GPT-XP by then, with Microsoft owning half of it.

belter · on Jan 11, 2023

Looking forward to see GPT-4 recommending Linux and Libre Office instead of Windows/Office as the logical choice out of 250 IQ ML Model...

ben_w · on Jan 11, 2023

In my imagination, OpenAI does what Bungie did when MS bought them, and open-sources what used to be their crown jewels.

That said, GPT-AlephOne only makes sense if there's a preceding GPT-∞.

egorfine · on Jan 11, 2023

They have got to release GPT-3.11 For Workgroups first.

awestroke · on Jan 11, 2023

Or GPT-365

MikeDelta · on Jan 11, 2023

Then they can bring back the talking paperclip, but this time actually useful.

generalizations · on Jan 11, 2023

It could actually work. It would be an incredibly gutsy move and I love it, and they'd probably earn a lot of respect. They’d get so much press for it. And if it held up, it’d probably be one of the things that MS is remembered for.

int_19h · on Jan 11, 2023

Why not ask GPT itself what it wants to be called?

wut42 · on Jan 11, 2023

Or GPT One.

taneq · on Jan 11, 2023

GPT-10 will be evergreen and 'the last version of GPT'.

And then three years later GPT-11 will be required to run the latest games.

andy_ppp · on Jan 11, 2023

Will 1.5T parameters be possible to run in the public way GPT-3 is? I can’t wait to see what happens with this much learning!

SilverBirch · on Jan 11, 2023

OpenAI was founded in 2015 by a group of billionaires who pledged $1Bn of funding. That is hardly a small scrappy start up.

hdjjhhvvhga · on Jan 11, 2023

> we're not talking of Google or Facebook investing a chunk of their billions cash.

On the contrary, in this thread we are are mainly talking about that.

orbifold · on Jan 11, 2023

I am actually still unclear how AlephAlpha pulled that off and who funds them, since they have a rather low profile team.

rjtavares · on Jan 11, 2023

Small players should focus on applications of this tech.

We now know that whatever AI Models succeed in the future, they'll be trained by a huge company and finetuned to a specific use case. Small companies should be working on use cases, and then just upgrade to the latest SOTA model.

varispeed · on Jan 11, 2023

> Small players should focus on applications of this tech.

That sounds a bit condescending. We are probably at a point from which the government should intervene and help establish level playing field. Otherwise we are going to see a deeper divide between multibillion businesses conquering multiple markets and sort of neofiefdom situation. This is not good.

tiborsaas · on Jan 11, 2023

It's not that condescending, that's todays reality. Should I feel entitled for $600k training time that may or may not work? Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?

It's quite reasonable to make use of models already trained for small players.

mschuster91 · on Jan 11, 2023

> Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?

Governments already routinely do that for pharmaceutical research or for nuclear (fusion) research. In fact, almost all major impact research and development was funded by the government, mostly the military. Lasers, microwaves, silicon, interconnected computers - all funded by the US tax payer, back in the golden times when you'd get laughed out of the room if you dared think about "small government". And the sums involved were ridiculously larger than the worth of a house. We're talking of billions of dollars.

Nowadays, R&D funding is way WAY more complex. Some things like AI or mRNA vaccines are mostly funded by private venture capital, some are funded by large philanthropic donors (e.g. Gates Foundation), some by the inconceivably enormous university endowments, a lot by in-house researchers at large corporations, and a select few by government grants.

The result of that complexity:

- professors have to spend an absurd percentage of their time "chasing grants" (anecdata, up to 40% [1]) instead of actually doing research

- because grants are time-restricted, it's rare to have tenure track any more

- because of the time restriction and low grant amounts, it's very hard for the support staff as well. In Germany and Austria, for example, extremely low paid "chain contracts" are common - one contract after another, usually for a year, but sometimes as low as half a year. It's virtually impossible to have a social life if you have to up-root it for every contract because you have to take contracts wherever they are, and forget about starting a family because it's just so damn insecure. The only ones that can make it usually come from highly privileged environments: rich parents or, rarely, partners that can support you.

Everyone in academia outside of tenured professors struggles with surviving, and the system ruthlessly grinds people to their bones. It's a disgrace.

[1] https://www.johndcook.com/blog/2011/04/25/chasing-grants/

tiborsaas · on Jan 11, 2023

Pharmaceutical or nuclear research doesn't really classify as "small scale" as this thread started. I know there are massive amounts of money handed our by governments to fund research, but for a 3 guy startup in a garage that's probably hopeless. Public money is cursed anyways, it's better not to touch it.

I've also read it at many places, that academic research funding is way too misaligned. It's a shame, really.

rjtavares · on Jan 11, 2023

I'm not being condescending at all, we've learned that the value in AI is in the applications. If you think government should regulate the field, it should be to make AI Models a commodity, like electricity.

googlryas · on Jan 11, 2023

Do you think you'll get a global agreement on this? Or would china just eat America's lunch then?

ignoramous · on Jan 11, 2023

Yes, see DeepMind RETRO:

> In our experiments on the Pile, a standard language modeling benchmark, a 7.5 billion parameter RETRO model outperforms the 175 billion parameter Jurassic-1 on 10 out of 16 datasets and outperforms the 280B Gopher on 9 out of 16 datasets.

https://www.deepmind.com/blog/improving-language-models-by-r...

Though, there hasn't been much follow-up research on it (or DeepMind is not publishing it).

Annotated paper: https://github.com/labmlai/annotated_deep_learning_paper_imp...

espadrine · on Jan 11, 2023

The research is still ongoing, although perhaps lower-profile than what appears in the press.

RETRO did get press, but it was not the first retrieval model, and in fact was not SOTA when it got published; FiD was, which later evolved into Atlas[0], published a few months ago.

[0]: https://github.com/facebookresearch/atlas

isthisthingon99 · on Jan 11, 2023

How long does it take to train a human? It's useless for two years then maybe it can tell you it needs to poop.

The breakthrough will be developing this equivalent in an accessible manner and us taking care to train the thing for a couple of decades but then it becomes our friend.

licebmi__at__ · on Jan 11, 2023

Yes, but to be fair, the system that does the training really sucks and doesn’t scale.

cactusplant7374 · on Jan 11, 2023

Neither does OpenAI. It costs so much and still delivers so little. A human can generate breakthroughs in science and tech that can be used to reduce carbon emissions. ChatGPT can do no such thing.

VBprogrammer · on Jan 11, 2023

What percentage of humans make meaningful contributions to advancing science or technology? The overwhelming majority of us are just worker bees servicing the needs of the human population.

cactusplant7374 · on Jan 11, 2023

I agree with you on this point. It’s also arguable that less people with a better education system could yield the same result with less environmental impact.

But my point, poorly explained, is that whatever ChatGPT is, it isn’t original or creative thought as a human would do it.

Chomsky’s example (which is based off Turing): Do submarines swim? Yes, they swim — if that’s what you mean by swimming.

int_19h · on Jan 11, 2023

We don't have any clear definitions for "creativity" to begin with. In practice, in these contexts, it seems to be defined as "whatever only humans can do" - that is, the goalposts are automatically moved with every AI advancement.

cactusplant7374 · on Jan 12, 2023

How could they be moved when they aren't even defined in the first place? Scientists don't even know where to begin when it comes to studying the mind and human consciousness.

But yes, scientists can look at your experiments and show that they don't have anything in common with human thought.

jsjohnst · on Jan 11, 2023

> What percentage of humans make meaningful contributions to advancing science or technology?

I’m a nobody that you’ve never heard of and I’ve arguably made meaningful contributions. If that’s true, don’t you think there could be way more people out there than you or sibling commenter imply?

tlb · on Jan 11, 2023

You can't know that. Currently, 8 billion humans generate a few scientific breakthroughs per year. You'd have to run several billion ChatGPTs for a year with zero breakthroughs to have any confidence in such a claim.

mbrock · on Jan 11, 2023

With billions of GPT output streams, how do you actually discover and rank what’s significant? Screen them through some even more powerful models? I imagine it’s like a volcano eruption of text where some are absolutely brilliant and most is worthless and finding the jewels is even more demanding than generating it all.

tlb · on Jan 11, 2023

Some theories are easily testable. For instance, ask it to write some code to efficiently solve traveling salesman problems, and then test the code on some sample problems. You can score the quality of solutions and time taken, and manually inspect the best ones.

cactusplant7374 · on Jan 11, 2023

At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would. It can’t consume hundreds of math textbooks and learn the principles of math and then apply them more broadly to science textbooks and research papers. It can’t even reliably add two numbers.

Yes, brute forcing with hard AI can produce many thoughts. But the AI wouldn’t know they are correct. It couldn’t explain why. Any discovery would only be attributable to randomness. It wouldn’t be learning from itself and its priors.

naasking · on Jan 11, 2023

> At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would.

Actually there are many indications that GPT understands the data, because its output mostly makes sense. The reason it can't assign meaning the way a human would is because a human can correlate words with other sensory data that GPT doesn't have access to. That's where GPT creates nonsense.

Think carefully about what "understanding" means in a mechanistic sense. It's a form of compression, and a few billion parameters encoding the contents of a large part of the internet seems like pretty good compression to me.

ivanbakel · on Jan 11, 2023

GPT doesn't display understanding of purely abstract systems, so I doubt it's an issue of lacking sensory information. It can't consistently do arithmetic, for example - and I think it would be presumptuous to insist that sensory information is a prerequisite for mathematics, even though that's how humans arrived at it.

naasking · on Jan 11, 2023

It's not yet clear why it struggles with arithmetic. It could be data-related, could be model-related, although scaling both seems to improve the situation.

In any case, GPT could still understand non-abstract things just fine. People with low IQ also struggle with abstract reasoning, and IQ tests place GPT-3 at around 83.

isthisthingon99 · on Jan 11, 2023

I still think that this will be a major form of AI that is accessible to the public at large and it will enable productivity improvements at all levels.

I'm not joking, this is really something I think will/should happen.

xur17 · on Jan 11, 2023

Alternatively, are there ways to train on consumer graphics cards, similar to SETI@Home or Folding@Home? I would personally be happy to donate gpu time, as I imagine many others would as well.

mryab · on Jan 11, 2023

There absolutely are! Check out hivemind (https://github.com/learning-at-home/hivemind), a general library for deep learning over the Internet, or Petals (https://petals.ml/), a system that leverages Hivemind and allows you to run BLOOM-176B (or other large language models) that is distributed over many volunteer PCs. You can join it and host some layers of the model by running literally one command on a Linux machine with Docker and a recent enough GPU.

Disclaimer: I work on these projects, both are based on our research over the past three years

alfor · on Jan 11, 2023

The cost of moving data from one gpu to the next will destroy performance.

The system are moving in the opposite direction (look at Dojo architecture or TensTorrent)

The silver lining is that the cost of training will fall substantially with those architecture that are not based in reusing gpu.

noidiocyallowed · on Jan 11, 2023

Work together and fuck up companies together. That's the way to go.

abricq · on Jan 11, 2023

Or how to apply communism to software engineering. I like that.

More seriously, the risk that a few companies become even more powerful thanks to their restricted access to such NN is very frightening. The worth is, without legal restrictions, there is nothing that we can do against it. And I doubt that legal restrictions come in the next months / years.

beepbooptheory · on Jan 11, 2023

Well at that point, some people might have the crazy crazy insight that no matter how big the model is, or how many GPUs they have, it burns up all the same.

pprotas · on Jan 11, 2023

What does “355 years” mean in this context? I assume it’s not human years

mellosouls · on Jan 11, 2023

Claimed here, so this is presumably the reference (355 GPU Years):

https://lambdalabs.com/blog/demystifying-gpt-3

"We are waiting for OpenAI to reveal more details about the training infrastructure and model implementation. But to put things into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run."

dx034 · on Jan 11, 2023

That's still including margins of cloud vendors. OpenAI had Microsoft providing resources which could do that at much lower cost. It still won't be cheap but you'll be way below $5m if you buy hardware yourself, given that you're able to utilize it long enough. Especially if you set it up in a region with low electricity prices, latency doesn't matter anyway.

Manfred · on Jan 11, 2023

Cumulative hours spent across training hardware.

omeysalvi · on Jan 11, 2023

I think AI is going to go the way of the hard sciences where the age of tinkerers making progress by leaps and bounds in their basement is over and incremental progress is going to be the domain of universities or large private companies that can afford to throw money behind it. I would love to be proven wrong and see radical shifts in how people approach these problems. Seems like the cycle started and got to this point way too soon for AI though