Summary Notes on “Deep Learning Systems: Algorithms and Implementation” course

date
Jun 7, 2024
slug
dlsyscourse-notes
status
Published
tags
MLSys
summary
type
Post

1 - Introduction / Logistics

A claim(may not be true): 深度学习系统,尤其是基于 python 的深度学习框架的出现,是使得深度学习被广泛使用的最大推动力。

1. Why learn deep learning systems?

  • To build deep learning systems.
  • To use existing deep learning systems effectively.
  • Deep learning systems are fun.

2. Elements of deep learning systems

  • Compose multiple tensor operations to build modern deep learning models.
  • Transform a sequence of operations. (auto-diff)
  • Accelerate computation via specialized hardware.
  • Extend more hardware backends and more operators.

2 - ML Refresher / Softmax Regression

1. Basics of machine learning

  • Machine learning is a data-driven programming approach.
  • The ingredients of machine learning algorithms:
    • The hypothesis class is the “program structure,” which is a parameterized function family.
    • Loss function: evaluate how well the model performs on the training set.
    • Optimization method: a procedure to minimize the loss function with respect to the parameters.

2. Example: Softmax Regression

  • hypothesis(linear functions):
  • loss function: cross-entropy loss
  • optimization method: mini-batch SGD
notion image
Old fashion: derive gradient formula by hand and program it, which is quite cumbersome.

3 - Manual Neural Networks / Backprop

1. From linear to non-linear hypothesis class

  • Not all data is linear separable → we need to create non-linear features.
  • Used to create non-linear features by hand in old days ML → feature engineering
  • The idea behind deep learning is to automate the feature engineering process.

2. NN

  • A neural network refers to a particular type of hypothesis class consisting of multiple, parameterized differentiable functions (a.k.a. “layers”) composed together in any manner to form the output.
  • Why deep: simple but unsatisfying answer → they are empirically better when the parameter count is fixed.

3. BackProp

  • Back Prop is a way to compute gradient for deep neural networks.
notion image
  • closer look at the gradient computation formula
notion image
  • This can be generalized to successors and predecessors on a graph structure, which is exactly the process of automatic differentiation.

HW - notes on hw0

Softmax numerical stability issue:

When is large, , which causes .
Note that the nan is caused by inf, which is caused by a large
If we choose , then we can prevent the overflow issue. The numerator ranges from 0 - 1, the denominator is at least 1 (on the max x_j).

4 - Automatic Differentiation

1. Intro to differentiation methods

Where does differentiation fit into deep learning: almost all deep learning model training is based on some kind of gradient descent.
  • Numerical differentiation: directly compute the gradient based on its definition numerically.
There is a variant formula which leads to more numerically accurate implementation:
This formula can be derived using the Taylor series formula.
The numerical method is costly in computation but is still useful when we need to test our autodiff method, as we can use it to do gradient checking.
  • Symbolic differentiation: derive the gradient formula by hand or computers. (need a lot of computation)
  • Automatic differentiation based on computation graph: represent the computation as a DAG, and compute forward and backward pass to calculate gradient.
notion image

2. Automatic Differentiation

  • Forward Mode AD: compute all output gradients with respect to an input in a forward pass in topological order.
  • Reverse Mode AD: compute output gradients with respect to all inputs in a backward pass in reverse topological order.
In practice, we have the loss function to have a lot of input and a single output usually, in which the reverse mode AD is more efficient.
notion image
In the modern deep learning framework, we implement reverse mode AD by extending the computation graph, which enables more flexible computation, such as the gradient of the gradient.
notion image
The computation graph can also be generalized to non-arithmetic operations.

5 - Automatic Differentiation Implementation

1. Lazy and eager mode

The computation graph can be executed in different fashions:
  • Lazy: run the graph when needed.
  • Eager: run the graph on the fly as it is constructed.
Lazy mode makes it possible for more optimization, while eager mode can enable concurrent execution of graph execution and graph construction, which can improve the efficiency in some cases when graph construction is a bottleneck.

2. Additional Remarks

  • Tensor represents the tensor abstraction, backed by the NDArray, which can be manipulated by array_api.
  • gradient function is implemented by constructing a new computation graph node using tensor computation.
 

© Lifan Sun 2023 - 2025