DeepSeek Native Sparse Attention

(arxiv.org)

16 points | by bandwitch 2 days ago ago

1 comments

fovc 2 days ago ago
Sparse attention essentially combines 3 types of attention optimizations:
1. Compression of the query input vectors to reduce the size of the KV cache
2. Selectively computing uncompressed attention on a subset of tokens based on the compressed blocks with the highest attention scores
3. Using sliding window for local attention at full resolution
> Both Full Attention and sparse attention models are pretrained on 270⁢B tokens of 8⁢k-length texts, followed by continued training and supervised fine-tuning on 32⁢k-length texts with YaRN to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison.
> our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27⁢B total parameters with 3⁢B active parameters
Evaluated on MMLU, MMLU-PRO, CMMLU, BBH, GSM8K, MATH, DROP, MBPP, and HumanEval. NSA outperforms full attention on 7/9.
Beats out H2O, InfLLM, Quest, Exact-Top, and full attention on LongBench
Perfect retrieval on 64k needle-in-a-haystack
The CoT eval is less convincing, but outperforms the FA on AIME24.
Training speed of 2-9x vs. FlashAttention
Decoding speedup of 4-12x vs. full attention ["expected"? Didn't see comparison to other attention mechanisms]