DeepSeek V3 Tech Report

Q1: Why does rotary positional encoding only applies to half of query/key dimensions?

This is because rotary positional encoding is not compatible with multi-head latent attention, specifically the latent part. Here’s a more detailed explaination:

To begin with, for Multi-head Latent Attention:

During inference, the attention score can be rewritten as:

scoret,j=htTWQTWUKcjKV=htTWeffectivecjKV\text{score}_{t,j} = h_t^T W_Q^T W{UK} c_j^{KV} = h_t^T W_{effective} c_j^{KV}

where Weffective=WQTWUKW_{effective} = W_Q^T W_{UK}

This matrix absorption means there's no need to explicitly decompress the cached key vectors, significantly reducing memory usage and computation.

However, when we introduce Rotary Positional Embeddings (RoPE), complications arise:

Since RoPE applies position-dependent rotations, the matrix absorption optimization (WQTWUKW_Q^T W_{UK}) is no longer possible, as these rotations cannot be precomputed in a position-independent way.

To maintain both computational efficiency and positional awareness, the solution DeepSeek v3 adopts is to apply ROPE on only a subset of channels in the key vector, and leave the remaining channels uncompressed.

Q2: What is DualPipe, how do we interpret its illustration and relate it to its key performance characteristics?

Remark. Auxiliary loss free strategy for expert load balancing.