My Blog Posts

Paper Summaries

DeepSeek V3 Technical Report

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Fast Inference from Transformers via Speculative Decoding

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

SGLang: Efficient Execution of Structured Language Model Programs

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot

NIRVANA: Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models

Projects

Read about FairShare!

Opinions

How should AI be used in the future?