Rotatry Positional Encoding
Reading Notes: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”
Reading Notes: “Efficient Memory Management for Large Language Model Serving with PagedAttention”
Reading Notes: “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”
Reading Notes: “Training Compute-Optimal Large Language Models”
Reading Notes: “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”
Reading Notes: GPT Series
Reading Notes: “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning”
Reading Notes: “GPipe: Easy Scaling with Micro-Batch Pipeline”
Distributed Training Basics
Reading Note: Megatron-LM v1
Quantization for NN Inference
Reading Note: TVM
Reading Note: Triton
Moore’s Law, and the future of computing beyond Moore’s Law
Deep Learning Performance Background
Reading Notes: MI300X vs H100 vs H200 Benchmark Part 1: Training – CUDA Moat Still Alive
An Architecture Overview of ML Systems
PMPP Reading Notes
Two ways of adapting LLMs for Recommender Systems