Reading Notes: “GPipe: Easy Scaling with Micro-Batch Pipeline”

date
Feb 23, 2025
slug
gpipe
status
Published
tags
MLSys
summary
type
Post

Motivation

In recent years, the power of scaling laws has become evident, with larger models achieving superior performance across various tasks. However, as model sizes continue to grow, hardware constraints such as memory limitations prevent them from fitting within a single accelerator. To address this, various model parallel approaches have been proposed.
However, many of these methods are either task- or architecture-specific or introduce significant communication overhead—particularly those relying on intra-operation parallelism (which could be a problem if there are communications happenning on slow links)—making them difficult to use practically.
GPipe offers a task- and architecture-agnostic model parallelism approach by leveraging pipelining across different stages of a model.

Approach

notion image

Core Idea

GPipe is based on the following core ideas:
  • a model is a sequence of layers
  • a model can be partitioned to a sequence of cells, each consisting a group of consecutive layers
  • the execution of this sequence of cells (stages) can be pipelined to reduce device idle time
  • mini-batch is further divided to micro-batch in execution
  • gradients are aggregated at the end of the processing of the mini-batch

Performance Optimizations

  • Activation checkpointing: only keep activation in the cell boundary, recompute other needed activations during backward pass to reduce the peak memory usage.
  • Communication Overhead: communication between cells only happens at the cell boundary.
  • Load Balancing: use a cost estimator to estimate the cost of stage execution, which help to prevent imbalance load between stages.

Evaluation

Evaluate the approach on image classification and machine translation tasks, some key observations:
  • Transformer get almost linearly increased speedup when scaling up (since it consists of identical blocks so the load is easily balanced.)
  • GPipe achieves notable speedup even without NVLink.

© Lifan Sun 2023 - 2025