While not quite your definition of toy, I have a small deep learning rig I built myself with two 4090's, and that has been enough to train several different ~200m parameter LLMs, starting with a hand rolled tokenizer and just vanilla pytorch to experiment with different architectures. While its not going to win any benchmarks or be usable for real problems (you should just be fine tuning llama), it has been super valuable for me to really understand exactly how these things work.
I use devpod.sh and a pytorch dev container I can spin up locally, with the intention of also spinning it up in the cloud to scale experiments up (but I haven't done much of that). Still, can recommend devpods for reproducible environment I don't feel worried about trashing!
If people are interested I can throw the git repo up now, but I have been planning on finding some time clean it up and write up a really short digest of what I learned.
Above anything I can write though, I highly recommend Andre Kaparthy's youtube channel - https://www.youtube.com/@AndrejKarpathy
You can follow along in a google colab so all you really need is a web browser. My project started as following along there and then grew when I wanted to train it to mimic my friends and I on some data I had of us chatting in slack, which meant some architecture improvements, figuring out how to pre-training on a large corpus, etc etc
I want to echo the recommendation of Andrej Kaparthy's YouTube channel.
Before I started watching his videos, I thought that understanding how gradient descent actually worked and what autograd actually does under the hood was unimportant - after all, I can get a working network by just slapping together some layers and letting my ML framework of choice handle the rest (and, to the credit of modern frameworks, you can get impressively far with that assumption). Andrej's Micrograd video was what changed my mind - understanding the basics of how gradients are calculated and how they flow has made everything make so much more sense.
If the classes at my university had been as good as what that man publishes on his YouTube channel for free, I would've actually finished my degree instead of dropping out.
I use devpod.sh and a pytorch dev container I can spin up locally, with the intention of also spinning it up in the cloud to scale experiments up (but I haven't done much of that). Still, can recommend devpods for reproducible environment I don't feel worried about trashing!
If people are interested I can throw the git repo up now, but I have been planning on finding some time clean it up and write up a really short digest of what I learned.
Above anything I can write though, I highly recommend Andre Kaparthy's youtube channel - https://www.youtube.com/@AndrejKarpathy You can follow along in a google colab so all you really need is a web browser. My project started as following along there and then grew when I wanted to train it to mimic my friends and I on some data I had of us chatting in slack, which meant some architecture improvements, figuring out how to pre-training on a large corpus, etc etc