Flash Attention V2

Recap written by William Brandon

MIT MLSys Discussion Group

Key ideas

GPU compute hierarchy background

NVIDIA GPU conceptCPU concept
SM ("Streaming Multiprocessor")Group of CPU cores sharing an L1 cache
Warp schedulerSuperscalar CPU core with hyperthreading
WarpThread executing 32-wide SIMD instructions
ThreadSIMD lane
CUDA coreALU
Tensor coreMatrix accelerator unit

Schematic of a full GA100 GPU (which have more SMs than A100):

Schematic of a single Ampere SM:

Transformers background

FlashAttention 2 innovations

Questions

References

[1] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Tri Dao et al., 2022)

[2] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (Tri Dao, 2023)

[3] https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/