An Architecture Overview of ML Systems

date

Jan 22, 2025

slug

arch-mlsys-overview

status

Published

Introduction

Machine learning systems (mainly machine learning framework) such as TensorFlow and Pytorch have accelerated research and applications of machine learning algorithms in the past decades by providing convenient abstractions for data processing, model training and serving.

This post gives an overview on machine learning systems’ grand problem and how contemporary machine learning systems look like.

ML Sys’ Grand Problem

In the era of large language models and huge datasets, we expect a machine learning system to be:

Fast

Scalable

Memory-efficient

Run on diverse hardware

Energy-efficient

Easy to program/debug/deploy

To be more concrete, we may expect the following features [1]:

Distributed Execution: Scalable / Fast

Accelerator Support: Run on diverse hardware

Training & Inference Support: Easy to program/debug/deploy

Extensibility: Easy to program/debug/deploy

ML Systems Overview

ML System Overview, figure is from slides of UCSD CSE234 [2]

The above figure gives an architectural overview of machine learning systems.

Dataflow Graph & Autodiff

Dataflow Graph: an interface layer for client to express computations in machine learning programs. These computations may include data processing, model structures, gradient computation and gradient update.

Autodiff: an utility layer for automatic gradient computation based on dataflow graph.

Graph Optimization

Given a computational dataflow graph, graph optimization layer try to rewrite it to a equivalent graph that has some desired property, such as shorter runtime.

Ways to do graph optimization include: template match, and auto-discovery. A common graph optimization technique is operator fusion, which fuses several nodes in to one.

Parallelization

The goal of parallelization layer is to parallelize graph execution to device cluster (i.e. partition graph to subgraphs and dispatch to different devices and maybe different machines).

The main problem of parallelization layer includes:

how to partition

how to communicate: consider the nature of connections intra and inter nodes.

how to schedule

consistency

auto-parallelize

Runtime and Scheduling

The goal of runtime and scheduling layer is somewhat similar to those of the operating systems: how to schedule compute/memory/communication efficiently within constraints (e.g. overlap communication and computation as much as possible).

Operator Implementation

The dataflow graph is composed by primitive mathematical operations, which will finally be executed on specific devices. A specific implementation of an operator is called a kernel.

This layer is concerned with implementing kernels (especially frequently used ones such as matmul and softmax) that:

run fast

run on different devices

run on different precisions

run for different shapes

Summary

Finally, we can make an analogy between layers in machine learning systems and traditional software systems:

Programming Languages: Dataflow graph & Autodiff

Compilers: Graph Optimization & Parallelization

Operating Systems: Runtime and scheduling & Operator implementation

References

[1] TensorFlow: A system for large-scale machine learning

[2] CSE234: Data Systems for Machine Learning @ UCSD