> It compiles a custom kernel for every operation, allowing extreme shape specialization.
This doesn't matter. Just look at the performance achieved by CuDNN kernels (which back PyTorch), they're dynamically shaped and hit near peak. For dense linear algebra at the size of modern neural networks, optimizing for the loop bound condition won't help much.
> All tensors are lazy, so it can aggressively fuse operations.
This matters. PyTorch teams are trying to implement that now (they have LazyTensor, AITemplate, TorchDynamo), but I'm not sure of the status (it's been tried repeatedly).
> The backend is 10x+ simpler, meaning optimizing one kernel makes everything fast.
The first part of that sentence matters, the second part doesn't. Kernels are already fast and their reuse outside of being fused into each other (which you need a full linear algebra compiler to do) isn't very high. If you make sum fast, you have not made matrix multiplication fast even though MM has a sum in it. It just isn't that easy to compose operations and still hit 80+% of hardware efficiency.
But it is easier to iterate fast and build a seamless lazy compiler if your backend is simple. You can pattern match more easily and ensure you handle edge cases without insanely complicated things like alias analysis (which PyTorch has to do).
While this is true for most common GEMM looking ops, if you tread off the beaten path things get slow (odd channel sizes, batch sizes, etc...). Right now in PyTorch, GroupNorm is 2x slower than BatchNorm. There's no fundamental reason, just that the kernels loop over axes in a less than ideal order. Dynamic recompilation allows you to change the loop order too, not just deal with boundary conditions.
Yea, makes sense. I think there's something to be said for dynamic compilation solving this problem more elegantly than providing tons of hand-tuned kernels (PyTorch is 890MB lmao https://pypi.org/project/torch/#files), but I don't think it's a strict reason for a performance win.
> change the loop order too
Memory layout as well! I'm 100% for dynamic compilation, but I'm claiming that it really finds its stride when you fuse things.
Agreed. For anything at all common, most of the gains will be from fusion, the rest is just free. PyTorch also uses tons of GPU memory after only initializing, I wonder if it's copying all the kernels in?
Jax preallocates 90% of available GPU memory when first operation is run to minimize allocation overhead. Can PyTorch grab that VRAM for a similar reason?
Yes PyTorch uses what they call a caching memory allocator[0], basically seems like are allocating a very chunk of GPU memory and implementing a heap with it. If needed they expose some knobs and functions to allow you to control it and observe the memory usage.
Whole net performance at comma, when we switch from BatchNorm to GroupNorm it adds 70ms to the training step time, and it's -70ms for no norm. We also wrote a custom AllNorm that's like 10% slower than BatchNorm (and I put several hours into trying to optimize it). Obviously not indicative of everyone's experience, but my point is BatchNorm is hyperoptimized and others, which are pretty much the same thing, aren't.
Thanks, that's certainly helpful anecdotal evidence.. yeah it seems like there should be an "AllNorm" implementation that covers all cases and is just fast. I was wondering because I'm currently looking at math_group_norm, which was ported from PyTorch/XLA and it results in a really weird decomposition that I'm astonished works at all. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen...
I'm also wondering if the handcoded backward passes are actually "numerically correct", because e.g. epsilon doesn't appear in it at all. Someone worked out the gradients manually for BN here: https://web.archive.org/web/20180826123459/http://cthorey.gi...
You can clearly see epsilon appearing in the output. And of course there's the whole training vs. eval mode thing with BN which GN doesn't have.
Or, better, identifying that the machine has a primitive that is better than doing each op individually. For example, a multiply-accumulate instruction vs a multiply and separate accumulate. The source code still says "a*b+c", the compiler is just expected to infer the MAC instruction.
Yep! This is an assumed optimization when it comes to modern linear algebra compilers. New primitives go way beyond FMAs: full matrix multiplies on nvidia/Intel and outer product accumulates on Apple silicon. It’s also expected that these are used nearly optimally (or you’ve got a bug).
The only thing I'd recommend is exposing "eval()" or something to let users tell you when they want you to evaluate things. It'll save a ton of time when it comes to hot-fixing performance and memory use issues. It's really hard to determine when to evaluate, and although it's a fun problem to figure out, it's nice to have an escape hatch for users to just tell you. (Flashlight has explored this and written about it here: https://fl.readthedocs.io/en/latest/debugging.html?highlight...)
> It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
>
> - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
> - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
> - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
> - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
>
> But how...where are your CONVs and MATMULs? Read the code to solve this mystery.
Ok, I was curious, so I read the code. The answer is that it represents a MATMUL as a 1x1 CONV. And it lied about CONV, which is a ProcessingOps.CONV and explicitly represented and implemented: https://github.com/geohot/tinygrad/blob/c0050fab8ff0bc667e40... Quite the letdown of figuring out this 'mystery'.
That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?
Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)
OK generally I think you're doing exactly what I believe ML is lacking right now. Another huge opportunity is instead of taking the average neural network and designing accelerators for it, designing hardware-friendly networks that run well on a sane accelerator that was designed to work with only these specialised networks (that doesn't need 80% chip area for on-chip memory for example). These might end up being completely different networks to what researchers use today. I work in this area and I think it's also possible to use the loss function to optimise the network for a specific HW.
I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:
There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?
We could have NEG instead of SUB, but with the constant folding it's a wash. DIV is already an HLOP with reciprocal (used to use POW, but that was slower. And what would you implement RELU in terms of?
Just looking at the code from my phone, but it seems that the conv op calls another primitive and einsum, which I believe is just a fancy MUL with broadcasting? so it might still be technically correct?
Einsum is an expressive way of doing element wise products and then possibly reducing them. An einsum is essentially a description of the dimensions of the input tensors and the dimensions of the resulting output after multiplication. If the output has reduced dimensions, then a summation is applied over them. The package einops provides reductions such as summation, averaging, and so on.
For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .
I can easily recommend the einops package and using einsum, simplifies things significantly.
I must say they gained instant credibility with the minimalistic website given how fast it loaded.
Code looks simple and easy to follow, and I love how the comments are constantly mentioning hardware characteristics, making maxing the hardware the goal. It seems that it’s trying to achieve this by jitting optimal code for the operations at hand rather than hand-optimizing kernels, and betting that the small number of operations will make tuning the codegen tractable.
I haven’t kept up much with what’s happening in ML, but at least in the realm of columnar database engines, interpreting a series of hand-optimized kernels seems to be the dominant approach over compiling a vectorized query plan. Are compilers good enough at optimizing ML operations that specializing on input shape makes a difference over hand-tuned kernels?
I like this too and I don't understand the downvotes. It says a lot about the philosophy of the project. Minimalist, bold, brutalist, no-frills first principles thinking. For better and worse.
There's an interesting roadmap in the "cherry" folder of the git repo[0]. It begins by bringing up a design on FPGA and ends with selling the company for $1B+ by building accelerator cards to compete with NVIDIA:
Cherry Three (5nm tapeout)
=====
* Support DMA over PCI-E 4.0. 32 GB/s
* 16 cores
* 8M elements in on board RAM of each core (288 MB SRAM on chip)
* Shared ~16GB GDDR6 between cores. Something like 512 GB/s
* 16x 32x32x32 matmul = 32768 mults
* 1 PFLOP @ 1 ghz (finally, a petaflop chip)
* Target 300W, power savings from process shrink
* This card should be on par with a DGX A100 and sell for $2000
* At this point, we have won.
* The core Verilog is open source, all the ASIC speed tricks are not.
* Cherry will dominate the market for years to come, and will be in every cloud.
* Sell the company for $1B+ to anyone but NVIDIA
As it was recently discussed at length here on HN [0] (401 comments), George Hotz (the lead of tinygrad) is taking time off his self-driving startup comma.ai [1]. Curious if this would help or hurt tinygrad progress.
How is it compared to JAX? After TensorFlow and PyTorch, JAX seems very simple, basically an accelerated numpy with just a few additional useful features like automatic differentiation, vectorization and jit-compilation. In terms of API I don't see how you can go any simpler.
I've just tried making a loop in a jit-compiled function and it just worked:
>>> import jax
>>> def a(y):
... x = 0
... for i in range(5):
... x += y
... return x
...
>>> a(5)
25
>>> a_jit = jax.jit(a)
>>> a_jit(5)
DeviceArray(25, dtype=int32, weak_type=True)
It definitely works, JAX only sees the unrolled loop:
x = 0
x += y
x += y
x += y
x += y
x += y
return x
The reason you might need `jax.lax.fori_loop` or some such is if you have a long loop with a complex body. Replicating a complex body many times means you end up with a huge computation graph and slow compilation.
Fused into one operation since the Tensor isn't resolved until I call .numpy()
kafka@tubby:/tmp$ cat fuse.py
from tinygrad.tensor import Tensor
x = Tensor.zeros(1)
for i in range(5):
x += i
print(x.numpy())
kafka@tubby:/tmp$ OPT=2 GPU=1 DEBUG=2 python3 fuse.py
using [<pyopencl.Device 'Apple M1 Max' on 'Apple' at 0x1027f00>]
**CL** 0 elementwise_0 args 1 kernels [1, 1, 1] None OPs 0.0M/ 0.00G mem 0.00 GB tm 0.15us/ 0.00ms ( 0.03 GFLOPS)
**CL** copy OUT (1,)
[10.]
He mentioned in a recent stream that he dislikes the complexity of the XLA instruction set used by JAX. So it's less the user-facing API, and more the inner workings of the library.
There are libraries like tensorflow and PyTorch that allow the user to define their neural net in simple, readable Python code, and they internally "compile" and optimize your neural net to run on GPUs and such.
Tinygrad is like a very, very lean PyTorch with a different philosophy -- it intends to keep the codebase and API surface very very small and focus most of its energy on optimizing the way the output neural net runs on physical hardware.
The author, George Hotz, has observed in the last few years that neural net performance is hindered by lack of optimization here, particularly around memory accesses.
It's funny that geohot/tinygrad chooses to not meet the PEP8 standards [0] just to stay on brand (<1000 lines). Black [1] or any other python autoformatter would probably 2x the lines of code.
I understand that the Python code is mostly driving faster low-level code, but I wonder how much time is effectively wasted by not using a lower-level language.
From my experience with game engines, it often turns out to be a bad idea (for performance and maintainability) to mix C/C++ and Lua or C#.
I would argue that there are performance *benefits* for a developer in running python code, due to how programs are run in python(Jupyter notebooks) you basically can change program on the fly, and not recompile and restart it, as you would do with compiled languages. And yeah, CPU does very very little in modern DL workloads and it is commonplace for CPU python code to be jitted and vectorized, so performance difference isn't as large as you would think.
Another benefit to interactivity is when exploring/using bad code. In academia, you'll often be importing the worst and least-well-documented code you've ever seen.
Being able to interactively experiment with someones 500-line 0-documentation function is often a better path to understanding than directly reading the code.
If you care exclusively about numerical stability and performance, why _this_ set of operators (e.g., there’re plenty of good reasons to include expm1 or log1p and certainly trigonometric functions)? It’d be an interesting research problem to measure and identify the minimal subset of operators (and I suspect it’d look differently than what you’d expect from an FPU).
If you care exclusively about minimalism, why not limit yourself to the Meijer-G function (or some other general-purpose alternative)?
It was ok as an educational tool, but now they don't count GPU implementation in 1000 lines, so it is not small. Considering the code style it is closer to 20k+ lines when formatted and GPU code included.
It also doesn't support bfloat16 so is doomed to be 2x slower.
Actual code of tinygrad is less than 5k lines. There is also 1600 lines of tests and around 2k lines of example models. And I didn't count unfinished support for geohot's own unfinished neural network accelerator(verilog for that accelerator sits in repo too), which is abandoned.
I believe neural networks are over hyped sometimes.
They are not always the best tool for the job. There are lots of other ML techniques such as SVM, naive Bayes, k-nearest neighbor, decision tree, logistic regression, random forest etc. nobody is using because they lack the hype factor.
If something lacks some keywords like neural network, deep learning, reinforced learning, than it is deemed not cool.
What happens, when your model exhibits a discriminating bias? How do you find out, what is going wrong? Knowing, what the model pays attention to can be pretty helpful.
Didn't all recommendations engines move to two-towers like models? I remember that it "solved" the freshness problem (ie when adding a new item to your catalog how do you recommend it to users if there are no ratings/interactions). Of course as long as you have a good model that creates items embeddings.
Regarding time series, don't everyone moved to attention based models?
Not challenging your answer, just curious. I work mostly with Graph NNs and quite a bit out of touch with the rest of the field.
The black box nature of a neural net is a problem. For model based design, a bit more accuracy out of a black box doesn't really help when you need, for example, state space matrices in a control design.
I'm no expert but can you show how those techniques can be used to solve the same problems NNs can? Like SOTA image recognition, chess / go, STT, TTS etc?
NN based sentiment analysis is certainly a lot better than non-NN based techniques.
Classification depends on the problem (and mostly the datasize). Boosting is certainly competitive on tabular data and widely everywhere I've worked.
No one talks about it (except on Kaggle) because it's pretty much at a local maximum. All the improvement comes from manual feature engineering.
But modern techniques using NNs on tabular data are are competitive with boosting and do away with a lot of the feature engineering. That's a really interesting development.
> NN based sentiment analysis is certainly a lot better than non-NN based techniques.
I wouldn't say this. Sentiment analysis trained on the standard datasets is one place where performance is barely better than old-school linear classifiers. They remained brittle and easy to trick until recent flexible systems systems based on question answering, zero-shot entailment or lotsa instruction finetuning (improving in that order). I strongly advice against using something fine-tuned solely on sentiment datasets. It'd be a total waste.
> Sentiment analysis trained on the standard datasets is one place where performance is barely better than old-school linear classifiers
Well yeah. But why would you do that?
Do what eveyrone does: Train on large scale a language corpus (or use a pre-trained model) then finetune for sentiment analysis.
> I strongly advice against using something fine-tuned solely on sentiment datasets
Did you mean trained on sentiment datasets? I agree with that.
Otherwise, well [1] is a decent overview of the field. I think Document Vectors using Cosine Similarity[2] at 17 is the highest rated that isn't a NN trained on large corpus and fine-tune on sentiment task. Even that uses document vectors that are trained on a large language corpus.
No, I meant finetuned. I also meant finetuned when I said trained. Experience with applying finetuned sentiment classifiers on real world data found gain vs cost of running to not be worth it. They remain nearly as brittle as cheaper classifiers and have a habit of gloming too much unto certain adjectives. They are also prone to overfitting on finetuned data's domain. Transformers trained not specifically on sentiment but on general domains like question answering or entailment are just leagues better for sentiment tasks.
But we aren't. Outside of using AEs for embeddings and then feeding them through a boosted tree model I don't know anyone using NNs for tabular data. We all use XGBoost or Catboost, etc.
Don't think you really know the field. On my team we almost exclusively use XGBoost or other boosted tree methods because it is typically the best model for tabular data. If we were working on CV or NLP that would be a different story and for that Neural Nets are by far the best models.
I come from probably the most atheistic country in the world (CZ)
Yet, I had made this bashing comment about bible.
IMHO anyone can believe in whatever they want. Christ., Islam, anything.
I (and I would say every friend of mine) don't care about what do you believe in, but if you publicly preach some religion, prepare to be made fun of, or take a stand and try defend it your religion with arguments. But no blind faith here.
(Personally, if I like some religion it's Shinto.)
I think it's orthogonal. There are tons of smart people who believe in God. (Knuth has already been mentioned.)
If God wanted, He could make himself apparent to everyone. Clearly that isn't the case; there is room to doubt or to believe no matter how smart or accomplished you are.
The Bible is a very specific version of a creator that the world already gives you enough evidence to disprove ten times. You could argue that the possibility of the Bible being divine is not zero, but it’s less than any epsilons people care about. The reason it endures are cultural, which is clear from its correlations with geography and demographics.
If anybody is dealing with procrastination watch George Hotz live streaming 10h straight working on this library [1][2]. Does he take some supplements to do this? There is even 19.5h stream [3].
Actually I have local obs setup to record myself, just instead of streaming I do recordings for my own inspection. Important part is to do the inspection after. It works wonders.
It isn't 19.5 hour stream, I went and randomly clicked on couple of timestamps and stumbled upon[1]. So it's two almost-10-hour-long streams put together because they are thematically similar.
>Does he take some supplements to do this? There is even 19.5h stream [3]
It's probably the fact he has an audience. I can't speak for him, but that'd sure as hell light a fire under my ass—or at least significantly reduce procrastination.
Also, I find in periods where I've worked ~17 hours straight that the tiredness calms by brain to the point I'm normal and makes focus easy, albeit difficult in a different way due to fatigue. There's a weird drone zone there that's nice. Not something to make a habit of, though.
If you enjoy what you are doing it's pretty easy to work on something that long, I've had gaming sessions last as long back in the day with friends and some of those games are as demanding in terms of focus as programming.
External motivation of having an audience would also help
I think it's combination of: 1) he's really passionate about what he's doing 2) he sees the problem as real challenge 3) he doesn't have corporate structure on his back giving him deadlines and pressure
Is 10 hours really _that_ strange? You are (hopefully) focusing 8 hours "straight" during work _every day_.
If you watch Hotz's streams he takes small breaks to talk with chat and to meme around (just like everyone else during their work days) and he eats lunch and whatever (again just like everyone else).
What I'm trying to say is that Hotz's isn't a superman on Adderall he is just working on stuff he is excited about.
> You are (hopefully) focusing 8 hours "straight" during work _every day_.
I refuse to believe that 8 hours straight focus every day is common.
I have about 4 - 6 hours of really focused, deep work focus available. 6 hours if I am really interested in the project and 4 hours for normal days. The rest is doing low focus work like writing mail, planning ahead, attending workshops, reading up on updates for relevant libraries, reading documentation etc.
When I was around 15 I used to do 10 hours of x86 assembly programming, and then several hours every day after school for a month or so in a row. Parents would have to force me from the computer.
I attribute it to a younger brain, NO internet and NO fun distractions. At 42 I just don't see how I did it, and I know I could never be that focused. Just sitting still for 4 hours make me feel quasy now, and I need to use my physical body in some way.
Yea I miss youth. I'm in my 30s and all nighters are not the same anymore :(. When I was young I'd do 2-3 in a week and with a four hour nap I would recover.
Now after those I lay down and I can't get up for a couple hours with all this aching in my limbs lol.
I'd say 99.9% of the western worlds workforce doesn't focus for 8 hours straight in the work day - its not really eve possible to in a great deal of jobs where there's context switching (meetings etc)
The bigger problem is recording taking too much cpu. That's why I don't record full work day, just chunks, whenever I feel I procrastinate. Youtube is an option here, I've tested it however cpu problem doesn't go away.
My plan is to make this obs-ndi plugin work on ubuntu, so I will be able to record on ubuntu to take the load off of mac which is my primary laptop.
PS. I forgot to read obs-ndi instructions properly, it works ok so now I can delegate regording to second laptop
It wouldn't be huge, I once recorded a week of me using my pc(so around 80 hours in total), and it was sub-100 gigs. It was 1080p with decent quality, don't remember FPS though.
I just quickly loop through them and categorieze chunks of time, just to see how I work. I have iphone as input, put on the table on the right which also captures my posture - I slouch almost all the time.
There is a problem however - I work on mac m1 and obs recordings take full 4 out of 8 cores so actually I am recording only when I notice I am starting to procrastinate. All recordings I remove afterwords to save up space. Obs is turned on all the time though.
I wanted to use obs-ndi to combine output from another laptop, but I have some issues with it so I just record mac atm. I also have powerfull desktop on the side and can ssh between all those by name, but desktop is noisy so it's off most of the time. Also there is raspberry pi with simple script with which I can turn on desktop remotely via Wait On Lan udp packet, dns handled via https://www.noip.com with which my router has an integration, but I actually never used it. I've done this setup to justify purchasing this powerful desktop in the first place :) humble brag, I know.
I think posts like this are only getting upvotes because George Hotz owns the project. I do see value in simple code, but the constraint of 1000 LOC makes little sense to me, especially when the code is formatted poorly.
This will get downvoted, but reading the comments here I dont understand the (cult/respect) for him.
Siding with the most successful CTF-team ever (PPP) he won defcon two times.
He made a startup with funding that makes a cool 'niche' product.
I just think a guy like Chris Lattner or Dave Cutler who made so much impact on real computing deserve so much more respect, but I guess that the norm here is to admire this guy.
Your list lacks the reason for his initial fame: iPhone and PS3 jailbreaking.
And I think you're downplaying the achievements of Comma AI -- it may still be somewhat niche, but its product is better than Tesla Autopilot for highway driving (they aren't there on city driving / FSD yet), all with an absolutely tiny team.
Re: Comma AI. This is what it tells me about my run-of-the-mill Toyota:
> openpilot upgrades your Toyota Highlander Hybrid with automated lane centering at all speeds, and adaptive cruise control that automatically resumes from a stop.
Both are annoying artificial limitations Toyota put presumably to avoid abuse by inattentive drivers.
I mean it can't change lanes. What does it do exactly?
Comma AI deliberately made lane change require a small bit of human intervention for safety reasons. The human hits the blinker and gives the wheel a tiny nudge in the direction, and then openpilot will complete the lane change and resume driving in the new lane.
The theory is that at the current ability of software like Tesla and Comma has, it's probably a good idea for a human to be paying more attention during a lane change maneuver. Comma is of the opinion that the level of autonomy Teslas have is probably unnecessarily unsafe. Comma cares a lot about safety (e.g. they have much more sophisticated driver monitoring than Tesla).
This doesn't matter. Just look at the performance achieved by CuDNN kernels (which back PyTorch), they're dynamically shaped and hit near peak. For dense linear algebra at the size of modern neural networks, optimizing for the loop bound condition won't help much.
> All tensors are lazy, so it can aggressively fuse operations.
This matters. PyTorch teams are trying to implement that now (they have LazyTensor, AITemplate, TorchDynamo), but I'm not sure of the status (it's been tried repeatedly).
> The backend is 10x+ simpler, meaning optimizing one kernel makes everything fast.
The first part of that sentence matters, the second part doesn't. Kernels are already fast and their reuse outside of being fused into each other (which you need a full linear algebra compiler to do) isn't very high. If you make sum fast, you have not made matrix multiplication fast even though MM has a sum in it. It just isn't that easy to compose operations and still hit 80+% of hardware efficiency.
But it is easier to iterate fast and build a seamless lazy compiler if your backend is simple. You can pattern match more easily and ensure you handle edge cases without insanely complicated things like alias analysis (which PyTorch has to do).