FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

chillee · 2024-08-08T15:33:05 1723131185

Hi, one of the authors of this blog post (Horace He), along with Driss Guessous, Yanbo Liang, and Joy Dong.

We’re quite happy with this abstraction - happy to answer any questions about it!

zaptrem · 2024-08-08T17:18:57 1723137537

For those of us using the 2D NATTEN kernel from their library along with torch.compile, is this faster? Especially given all their tricks (e.g., the non-deterministic KV-parallelism)

chillee · 2024-08-08T22:46:51 1723157211

In my (very amateurish) testing, I think the performance seemed pretty comparable (for non-dilated natten). I need to do some proper benchmarking though!

imjonse · 2024-08-09T12:33:34 1723206814

Is this for Ampere and newer only as FA2?

chillee · 2024-08-09T14:59:19 1723215559

I believe it should run on V100 as well (although definitely not tested as well), and an user reported that they got it running on T4 too.

visarga · 2024-08-08T14:45:01 1723128301

It's interesting that optimizing a computation that can be described in a single line of math takes so much work. It took forever even to discover Flash attention. And in the 6 years since transformers were invented, thousands of papers worked on making it faster.

Attention(Q,K,V) = Softmax(Q*K^T/sqrt(d_k))*V

FlexAttention seems to have found the right abstraction for the task.

d3m0t3p · 2024-08-08T15:08:06 1723129686

Yea, because the math have stripped down the whole thing to : I have data I do operation on them. while in reality we deal with multi head attention / grouped query and the positional encoding.

That’s all without taking into account the broadcasting done on the batch dimension

chillee · 2024-08-08T16:27:11 1723134431

I would agree with this. For example, how would you represent causal attention in the standard equation?

brrrrrm · 2024-08-08T15:50:57 1723132257

this is true of even just matrix multiplication (A*B) of which attention has two

brrrrrm · 2024-08-08T14:30:15 1723127415

For most LLM workloads today (short text chats), hundreds or a couple thousand tokens suffice. attention mechanisms don’t dominate (< 30% compute). But as the modalities inevitably grow, work in attention approximation/compression is going to be paramount.

Nice to see Pytorch already elegantly supporting this next step in research

hi_hi · 2024-08-08T23:12:53 1723158773

I didn't see any notice of this being CUDA only (like FlashAttention). I tried running on my Mac M3, python 3.11.8, following the quickstart (with the deviation of running it in a new venv). Got the following error:

/attention-gym/.venv/lib/python3.11/site-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.) cpu = _conversion_method_template(device=torch.device("cpu")) Traceback (most recent call last): File "/attention-gym/attn_gym/masks/document_mask.py", line 7, in <module> from torch.nn.attention.flex_attention import _mask_mod_signature ModuleNotFoundError: No module named 'torch.nn.attention.flex_attention'

chillee · 2024-08-08T23:14:29 1723158869

Ah sorry, should have put that in the blog post. This leverages Triton heavily, so it'll only work on machines that have Triton backends (at least, we've tested on Nvidia and AMD GPUs)

alecco · 2024-08-08T17:51:25 1723139485

> FlexAttention achieves 90% of FlashAttention2’s performance in the forward pass and 85% in the backward pass.

It's very good. But note FlashAttention-3 is 1.5x - 2x faster than FlashAttention-2.

chillee · 2024-08-08T18:07:03 1723140423

These benchmarks are on Ampere, where FA3 has no performance benefits over FA2.

On Hopper, FlexAttention is currently about 80% of FlashAttention3's performance (about 500 TFLOPs peak)

alecco · 2024-08-09T15:00:11 1723215611

Not bad.

gchamonlive · 2024-08-08T12:40:23 1723120823

Always had the curiosity to put something together with pytorch but it always seemed either a steep learning curve or there wasn't a big motivator (project, problem to solve, something in my daily routine to optimize).

Does anybody have a good starting point to learn with hands-on projects and also that could accommodate for flexattention?

ryneandal · 2024-08-08T14:19:36 1723126776

IMO the PyTorch getting started tutorials are really good (https://pytorch.org/tutorials/beginner/basics/intro.html).

A classifier for handwritten digits in the MNIST dataset is generally considered the "Hello World" of neural networks. I went over it in a course, but there are countless tutorials to be found online, i.e. https://www.digitalocean.com/community/tutorials/introductio...

Once you begin to understand how to handle data and how to define layers, you can start playing around with whatever your heart desires. The rabbit hole is vast and endless :)

jisaacso · 2024-08-08T14:48:27 1723128507

Agreed that PyTorch tutorials are a great place to start. Specific to flexattention, the blog references the accompanying attention gym, which has a series of examples of how to use flex: https://github.com/pytorch-labs/attention-gym/

sva_ · 2024-08-08T19:02:52 1723143772

Check Out Kaggle for the challenges

andy12_ · 2024-08-08T08:48:10 1723106890

This is so cool. I want to try to implement something with this right now.

barrenko · 2024-08-08T16:25:53 1723134353

Can someone do a short summary or TL;DR for this?

chillee · 2024-08-08T16:27:56 1723134476

https://x.com/chhillee/status/1821253769147118004?s=46

Perhaps this tweet thread would be better.

sva_ · 2024-08-08T19:04:00 1723143840

https://nitter.poast.org/chhillee/status/1821253769147118004

barrenko · 2024-08-08T19:48:45 1723146525

Thanks, just weaned myself of Twitter / X.