More

tehsauce · 2024-12-25T17:35:45 1735148145

the metal backend does currently generate quite a lot of unnecessary command buffers, but in general performance seems solid.

tehsauce · 2024-10-03T17:49:38 1727977778

I haven’t gone through the paper in detail yet but maybe someone can answer. If you remove the hidden state from an rnn as they say they’ve done, what’s left? An mlp predicting from a single token?

bunderbunder · 2024-10-03T18:23:03 1727979783

They didn't remove the hidden state entirely, they just removed it from the input, forget and update gates. I haven't digested the paper either, but I think that in the case of a GRU this means that the hidden state update masking (z_t and r_t in the paper's formulas) only depends on the new input, not the input plus the prior hidden state.

jfcoa · 2024-10-03T18:17:08 1727979428

It doesn't completely remove it, it removes certain dependencies on it so that it can be computed by parallel scan, there is still a hidden state. It bears some similarity to what was done with Mamba.

statusfailed · 2024-10-03T18:01:00 1727978460

I only had a quick look, but it looks like they tweaked the state update so the model can be run with parallel scan instead of having to do it sequentially.

_0ffh · 2024-10-03T20:09:43 1727986183

The trick is to make sure the recursive dependency stays linear, that's how you enable parallel training.

tehsauce · 2024-09-24T07:14:58 1727162098

The water consumed to produce a single hamburger is over 2000 liters, and the power likely well over 100 watt-hours.

That means gpt can write >1000 emails using the resources of feeding a single person lunch. The resource efficiency of these machines already is really quite astonishing.

tehsauce · 2024-06-12T22:49:32 1718232572

Awesome article! Something slightly misleading though - the first image shows the intersection of a non-convex shape, but it isn't revealed until much later that the algorithm only works for convex shapes, not the type shown in the first image.

JadeNB · 2024-06-12T23:46:20 1718235980

It is discussed that the algorithm handles non-convex shapes by breaking them into convex shapes.

tehsauce · 2024-05-28T04:29:47 1716870587

Grokking is a sudden huge jump in test accuracy with increasing training steps, well after training accuracy has fully converged. Double descent is test performance increasing, decreasing, and then finally rising again as model parameters are increased.

scarmig · 2024-05-28T05:39:06 1716874746

What they share is a subversion of the naive framework that ML works simply by performing gradient descent over a loss landscape. Double descent subverts it by showing that learning isn't monotonic in parameter count; grokking subverts it by learning after training convergence.

I'd put the lottery ticket hypothesis in the same bucket of "things that may happen that don't make sense at all for a simple optimization procedure."

baq · 2024-05-28T07:12:55 1716880375

My takeaway from the paper is that you can guide training by adding/switching to a more difficult loss function after you got the basics right. Looks like they never got to overfitting grokking, so maybe there’s more to discover further down the training alley.

tehsauce · 2024-05-15T23:54:06 1715817246

If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.

tehsauce · 2024-05-05T01:07:36 1714871256

+1 for vast. they usually are the cheapest and have the most supply. some instances can be less reliable at the low end though

tehsauce · 2024-04-24T01:12:15 1713921135

It's possible you might not need direct access to wave/subgroup ops to implement efficient stream compaction. There's a great old Nvidia blog post on "warp-aggregated atomics"

https://developer.nvidia.com/blog/cuda-pro-tip-optimized-fil...

where they show that their compiler is sometimes able to automatically convert global atomic operations into the warp local versions, and achieve the same performance as manually written intrinsics. I was recently curious if 10 years later these same optimizations had made it into other GPUs and platforms besides cuda, so I put together a simple atomics benchmark in WebGPU.

https://github.com/PWhiddy/webgpu-atomics-benchmark

The results seem to indicate that these optimizations are accessible through webgpu on chrome on both MacOS and Linux (with nvidia gpu). Note that I'm not directly testing stream compaction, just incrementing a single global atomic counter. So that would need to be tested to know for sure if the optimization still holds there. If you see any issues with the benchmark or this reasoning please let me know! I am hoping to solidify my knowledge in this area :)

tehsauce · 2024-04-23T06:49:55 1713854995

500GB/s is going to limit it to at best 1/4 the DL performance of an nvidia gpu. I’m not sure what the floating point perf of these FPGAs are but I imagine that also might set a fundamental performance limit at a small fraction of a GPU.

touisteur · 2024-04-23T11:15:05 1713870905

Well I keep seeing all models quantized and for 2-bit, 4-bit and 1-bit quantizations I had good very good inference performance (either througput or latency) on CNNs and some RNNs on Alveo boards using FINN (so, mostly high level synthesis and very little actual fpga wrangling). No idea about the current status of all these, will read the paper though :-)

tehsauce · 2024-04-23T06:34:12 1713854052

Systolic arrays are essentially how matmul is implemented in tensor cores in GPUs and TPUs.