Reading Notes: “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning”
date
Feb 24, 2025
slug
alpa
status
Published
tags
MLSys
summary
type
Post
Motivation
Recent advancements in large-scale models have demonstrated impressive performance across a wide range of tasks. However, training such massive models on distributed clusters demands considerable engineering effort, as it involves configuring numerous task-specific or architecture-specific strategies and parameters. These configurations are often interdependent, adding further complexity to the process.
To address this challenge, automating the training process offers substantial benefits by reducing manual effort and improving efficiency. Under the conventional perspective, parallelism in distributed training is typically categorized into three types: data parallelism, operator parallelism, and pipeline parallelism. In practice, these approaches are frequently combined to maximize performance. However, the interdependence of their associated parameters creates a combinatorially explosive space of possible parallel execution plans, making manual optimization intractable.
To overcome this, Alpa introduces a compiler that automatically generates near-optimal parallel execution plans. It adopts a hierarchical view of parallelism, distinguishing between inter-operator and intra-operator parallelism, enabling a structured and efficient exploration of the execution plan space.
Approach
New Perspectives on Parallelism

In this paper, a new pespective on parallelism is proposed:
- Inter-op parallelism: the partition happens only at the boundary between operators (i.e. we do not partition operators)
- Intra-op parallelism: the partition happens inside operators (i.e. we partition operators)
These two types of parallelism can be applied orthogonally, and forms a hierarchy: we can first do an inter-op parallel partition, and inside each partition do an intra-op parallel partition.
It is notable that for inter-op, communication overhead is small since only P2P communication is needed at the boundary of stages, but may have lower device utilization, whereas collective communication like all-reduce is needed for intra-op at repartition points, but can have better device utilization.
Core Ideas

Alpa’s core idea is to exploit the similar hierarchical nature in both parallelism and device cluster:
- devices inside a node is connected by high bandwidth link (NVLink), which is suitable for intra-op parallelism.
- devices in different nodes are connected by slow links, which is suitable for inter-op parallelism.
By leveraging this hierarchical structure, Alpa defines a structured plan space that enables the automated optimization and selection of an efficient parallel execution strategy.
Alpa’s automated plan generation comprises of three stage:
- Inter-op pass: assign stage to device-sub mesh
- Intra-op pass: split operators inside a device-sub mesh
- runtime orchestration: ties everything together and generate some instructions needed for the final pipeline execution.
Intra-op Pass
- Employs an SPMD style: Partitions the operator evenly across devices and executes the same instructions on all devices to ensure efficient computation.
- Plan Space Definition: Represents it as a combination of sharding strategies for each tensor axis, determining how data is distributed across devices.
- Optimization Objective: Defines the total cost as the sum of compute cost and communication cost, and solves the optimization problem using Integer Linear Programming.
Inter-op Pass
- Plan Space Definition: the product of sequence partitions to stages and stages to device-sub mesh assignment.
- Optimization Objective: the total latency of the pipelined execution, with each stage costs computed from the intra-op pass. This optimization problem is solved by dynamic programming.
Runtime Orchestration
- generate instructions for collective communications needed in intra-op level and p2p communications needed in inter-op level.
- generate pipeline execution instructions for inter-op level.
Evaluation
Training on GPT-3, GShard MoE, and Wide-ResNet was conducted using both hand-tuned systems and Alpa.
The result demonstrates that Alpa can automate distributed deep learning training and achieve performance that matches or surpasses state-of-the-art manually optimized systems.