Hacker News new | past | comments | ask | show | jobs | submit | tehsauce's comments login

the metal backend does currently generate quite a lot of unnecessary command buffers, but in general performance seems solid.


I haven’t gone through the paper in detail yet but maybe someone can answer. If you remove the hidden state from an rnn as they say they’ve done, what’s left? An mlp predicting from a single token?


They didn't remove the hidden state entirely, they just removed it from the input, forget and update gates. I haven't digested the paper either, but I think that in the case of a GRU this means that the hidden state update masking (z_t and r_t in the paper's formulas) only depends on the new input, not the input plus the prior hidden state.


It doesn't completely remove it, it removes certain dependencies on it so that it can be computed by parallel scan, there is still a hidden state. It bears some similarity to what was done with Mamba.


I only had a quick look, but it looks like they tweaked the state update so the model can be run with parallel scan instead of having to do it sequentially.


The trick is to make sure the recursive dependency stays linear, that's how you enable parallel training.


The water consumed to produce a single hamburger is over 2000 liters, and the power likely well over 100 watt-hours.

That means gpt can write >1000 emails using the resources of feeding a single person lunch. The resource efficiency of these machines already is really quite astonishing.


Awesome article! Something slightly misleading though - the first image shows the intersection of a non-convex shape, but it isn't revealed until much later that the algorithm only works for convex shapes, not the type shown in the first image.


It is discussed that the algorithm handles non-convex shapes by breaking them into convex shapes.


Grokking is a sudden huge jump in test accuracy with increasing training steps, well after training accuracy has fully converged. Double descent is test performance increasing, decreasing, and then finally rising again as model parameters are increased.


What they share is a subversion of the naive framework that ML works simply by performing gradient descent over a loss landscape. Double descent subverts it by showing that learning isn't monotonic in parameter count; grokking subverts it by learning after training convergence.

I'd put the lottery ticket hypothesis in the same bucket of "things that may happen that don't make sense at all for a simple optimization procedure."


My takeaway from the paper is that you can guide training by adding/switching to a more difficult loss function after you got the basics right. Looks like they never got to overfitting grokking, so maybe there’s more to discover further down the training alley.


If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.


+1 for vast. they usually are the cheapest and have the most supply. some instances can be less reliable at the low end though


It's possible you might not need direct access to wave/subgroup ops to implement efficient stream compaction. There's a great old Nvidia blog post on "warp-aggregated atomics"

https://developer.nvidia.com/blog/cuda-pro-tip-optimized-fil...

where they show that their compiler is sometimes able to automatically convert global atomic operations into the warp local versions, and achieve the same performance as manually written intrinsics. I was recently curious if 10 years later these same optimizations had made it into other GPUs and platforms besides cuda, so I put together a simple atomics benchmark in WebGPU.

https://github.com/PWhiddy/webgpu-atomics-benchmark

The results seem to indicate that these optimizations are accessible through webgpu on chrome on both MacOS and Linux (with nvidia gpu). Note that I'm not directly testing stream compaction, just incrementing a single global atomic counter. So that would need to be tested to know for sure if the optimization still holds there. If you see any issues with the benchmark or this reasoning please let me know! I am hoping to solidify my knowledge in this area :)


500GB/s is going to limit it to at best 1/4 the DL performance of an nvidia gpu. I’m not sure what the floating point perf of these FPGAs are but I imagine that also might set a fundamental performance limit at a small fraction of a GPU.


Well I keep seeing all models quantized and for 2-bit, 4-bit and 1-bit quantizations I had good very good inference performance (either througput or latency) on CNNs and some RNNs on Alveo boards using FINN (so, mostly high level synthesis and very little actual fpga wrangling). No idea about the current status of all these, will read the paper though :-)


Systolic arrays are essentially how matmul is implemented in tensor cores in GPUs and TPUs.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: