Attention Is a Memory Allocator Problem: FlashAttention-2 and PagedAttention Under the Microscope
Modern transformer performance is limited less by math and more by how precisely we move and allocate memory.
1 transmission tagged #gpu-memory
Modern transformer performance is limited less by math and more by how precisely we move and allocate memory.