DeepSeek's multi-head latent attention and other KV cache tricks

(pyspur.dev)

292 points | by t55 20 days ago ago

78 comments