An Architecture Overview of ML Systems
date
Jan 22, 2025
slug
arch-mlsys-overview
status
Published
tags
MLSys
summary
type
Post
Introduction
Machine learning systems (mainly machine learning framework) such as TensorFlow and Pytorch have accelerated research and applications of machine learning algorithms in the past decades by providing convenient abstractions for data processing, model training and serving.
This post gives an overview on machine learning systems’ grand problem and how contemporary machine learning systems look like.
ML Sys’ Grand Problem
In the era of large language models and huge datasets, we expect a machine learning system to be:
- Fast
- Scalable
- Memory-efficient
- Run on diverse hardware
- Energy-efficient
- Easy to program/debug/deploy
To be more concrete, we may expect the following features [1]:
- Distributed Execution: Scalable / Fast
- Accelerator Support: Run on diverse hardware
- Training & Inference Support: Easy to program/debug/deploy
- Extensibility: Easy to program/debug/deploy
ML Systems Overview

ML System Overview, figure is from slides of UCSD CSE234 [2]
The above figure gives an architectural overview of machine learning systems.
Dataflow Graph & Autodiff
- Dataflow Graph: an interface layer for client to express computations in machine learning programs. These computations may include data processing, model structures, gradient computation and gradient update.
- Autodiff: an utility layer for automatic gradient computation based on dataflow graph.
Graph Optimization

Given a computational dataflow graph, graph optimization layer try to rewrite it to a equivalent graph that has some desired property, such as shorter runtime.
Ways to do graph optimization include: template match, and auto-discovery. A common graph optimization technique is operator fusion, which fuses several nodes in to one.
Parallelization

The goal of parallelization layer is to parallelize graph execution to device cluster (i.e. partition graph to subgraphs and dispatch to different devices and maybe different machines).
The main problem of parallelization layer includes:
- how to partition
- how to communicate: consider the nature of connections intra and inter nodes.
- how to schedule
- consistency
- auto-parallelize
Runtime and Scheduling
The goal of runtime and scheduling layer is somewhat similar to those of the operating systems: how to schedule compute/memory/communication efficiently within constraints (e.g. overlap communication and computation as much as possible).
Operator Implementation
The dataflow graph is composed by primitive mathematical operations, which will finally be executed on specific devices. A specific implementation of an operator is called a kernel.
This layer is concerned with implementing kernels (especially frequently used ones such as matmul and softmax) that:
- run fast
- run on different devices
- run on different precisions
- run for different shapes
Summary
Finally, we can make an analogy between layers in machine learning systems and traditional software systems:
- Programming Languages: Dataflow graph & Autodiff
- Compilers: Graph Optimization & Parallelization
- Operating Systems: Runtime and scheduling & Operator implementation