MIT MLSys Discussion Group

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Summary. The paper discusses strategies (mostly on partitioning models across different GPUs) and considerations (mostly on avoiding commutations between GPUs) that enables distributed training of an extremely large language models by contemporary standards (8 billion parameters). This work influenced training system designs for recent large language models.


  1. Training Compute-Optimal Large Language Models, NeurIPS 2022.
  2. BERT Busters: Outlier Dimensions that Disrupt Transformers, ACL 2021.