Reading Notes: “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving”

date

Oct 20, 2025

slug

dist-serve

status

Published

Background & Motivation

Applications have diverse Service Level Objectives (SLO)

不同的应用有不同的 SLO, 对于 LLM 应用来说，两个最广泛使用的重要的 SLO 是 TTFT (Time-To-First-Token) 和 TPOT (Time-Per-Output-Token).

在衡量 LLM 的 serving cost 的时候，过去常常直接使用系统的总 throughput （每秒处理的请求数量）作为 dollar per request 的一个 proxy；在文中提出了一个更好的 proxy 称为 goodput （每秒处理的满足 SLO 的请求数量，一个直观的例子是如果某个请求没有满足 SLO 用户可能会直接 abort 这个请求开一个新的，所以这个请求不算有效的，在这种情况下 high throughput 不等于 high goodput）

因此，为了尽量降低 serving cost，LLM serving system 的设计目标是最大化 goodput.

Characteristics of Prefill & Decoding

一个对于 LLM inference 的重要观察是：prefill 和 decoding 有截然不同的特征。Prefill 倾向于称为 compute bound，而 decoding 常常是 memory bound.

目前的系统将 prefill 和 decode 放在一起执行实际上带来两个问题：

Interference：prefill 更加 intensive，会拖慢 decode，损害 TPOT；为了提升 decode 效率 batch 了很多 decode 和 prefill 一起也会拖慢 prefill

Resource Allocation：prefill 和 decode 耦合在一起导致我们难以针对性地提高某一个 task 的 SLO. 比如我们可能只想提高 TPOT，也就是主要要加速 decode，但是在 colocate 的情况下我们只能一起优化，这导致了 over-provision；此外也难以针对每个 task 来调整并行策略

Approach: DistServe

Basic Ideas

基本的思想就是把 prefill 和 decode 分离到不同的 GPU 上，在 prefill instance 上做完 prefill 之后，把 intermediate state 传输到 decode instance 上继续做 decode. 这样使得我们可以针对每个 task 的需求进行优化和调整。

Pratical Considerations

几个需要考虑的问题：

如何选择最优并行策略：Simulation + Brute Force

如何降低通讯开销：通过优化 GPU Placement 来尽量用 high-bandwidth interconnect