CS 598: Systems for Generative AI (S'26)

Logistics

Lectures: 0216 Siebel Center for Computer Science, WF: 12:30 PM – 01:45 PM

Member (NetID)	Role	Office Hours
Fan Lai (fanlai)	Instructor	3128 Siebel Center. F 2 PM – 3 PM
Jimmy Shong (jimmys2) Yuhan Ding (yuhand7)	TAs	Zoom. W 7:00 PM - 8:00 PM

Canvas: ALL communication regarding this course must be via Canvas. This includes questions, discussions, announcements, assignments, as well as private messages.

Presentation slides should be submitted to Canvas (Presentation Evaluation Form).

Course Description

Learning Objectives: This course will introduce the key concepts and the state-of-the-art in practical, scalable, and fault-tolerant software systems for emerging Generative AI (GenAI). At the end of the course you will be able to:

Critique and evaluate the design details of state-of-the-art GenAI systems
Develop and utilize tools to profile and understand the performance of GenAI systems
Propose new research ideas in topics related to support practical GenAI

Structure: The course will be a mix of lectures, student presentations, seminar-style discussions, and a semester-long project on GenAI topics. We will cover GenAI topics from top conferences that take a systems view to the relevant challenges, including:

Basics of GenAI models from a systems perspective;
Systems for GenAI lifecycle (pre-training, training, fine-tuning/alignment, inference serving, and grounding);
GenAI for systems and etc.

Note that this course is NOT focused on AI methods. Instead, we will focus on how one can build software systems so that existing AI methods can be used in practice and new AI methods can emerge.

Prerequisites: Students are expected to have good programming skills and must have taken at least one undergraduate-level systems-related course (from operating systems, databases, distributed systems, or networking). Having an undergraduate ML/AI course is helpful but not required.

Tentative Schedule and Reading List

This is an evolving list and subject to changes due to the breakneck pace of GenAI innovations.

Date	Readings	Presenter	Companion	Reviewer
Jan 21 (GenAI Systems)	Introduction How to Read a Paper How to Give a Bad Talk The Shift from Models to Compound AI Systems	Fan (Slides)
	GenAI Basics
Jan 23 (LLM Fundamentals)	The Illustrated Transformer Flash Attention	Jimmy (Slides)
Jan 28 (Transformers Deep Dive)	FlashAttention-V2 Native Sparse Attention	Fan (Slides)
Jan 30 (Scalable ML)	Scaling Distributed Machine Learning with the Parameter Server Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	Fan (Slides)
Feb 4	No Class (Proposal Feedback Session)	Fan (SC 3128)
Feb 6 (Diffusion Models)	The Illustrated Stable Diffusion Scalable Diffusion Models with Transformers (Required)	snian2, zhangw2, tuanmn2, yaqiq2, tonyhong (Slides)	ruoanw2, yihangj3, dc29, huiren2, changl25	porters2, bhatkar3, htong9, runying2
	Pre-Training
Feb 11 (Hybrid Parallelism)	Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (Required) Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	porters2, bhatkar3, htong9, runying2 (Slides)	mae10, dhaw2, yuf7, namanr2, ktrikha2	mihirs2, riyajj2, kmaka5, pjakka3, aparekh6
Feb 13 (Training MoEs)	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Required) FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models	ronita2, dsingh14, rto4, kv22, yanniz3 (Slides)	porters2, bhatkar3, htong9, runying2	ruoanw2, yihangj3, dc29, huiren2, changl25
<<<<<<< HEAD
Feb 18 (Fault Tolerance)	TrainVerify: Equivalence-Based Verification for Distributed LLM Training (Required) Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates	mihirs2, riyajj2, kmaka5, pjakka3, aparekh63 (Slides)	xinyugu4, yuqiliu6, jtu9	khota2, dcku2, sankalp6, james40
=======
Feb 18 (Fault Tolerance)	TrainVerify: Equivalence-Based Verification for Distributed LLM Training (Required) Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates	mihirs2, riyajj2, kmaka5, pjakka3, aparekh63	yuqiliu6, jtu9	khota2, dcku2, sankalp6, james40

55361f3 (Removed student) | | Post-Training | Feb 20
(Finetuning Techniques) | LoRA: Low-Rank Adaptation of Large Language Models (Required)
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving | ruoanw2, yihangj3, dc29, huiren2, changl25 | apr11, rmotati2, cc215, vc50, svasani2 | hhunma2, cud2, pranav33, enyasun2, ambikas2 | | Feb 21 | Project Proposal Due | | | | | Feb 25
(Exploiting Sparsity) | AWQ: Activation-aware Weight Quantization (Required)
Radial Attention: O(n log n) Sparse Attention with Energy Decay for Long Video Generation | yuqiliu6, jtu9 | mihirs2, riyajj2, kmaka5, pjakka3, aparekh6 | yoojinc3, cc213, yonghan4, weishao3, bosyuan2 | | Feb 27
(RLHF Systems) | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Required)
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning | jinjieg2, yuchen87, yunqili4, haoqic2, jrliao2 | hhunma2, cud2, pranav33, enyasun2, ambikas2 | ma169, kaizhuo3, hsk8, cjchong3, kaidifu2 | | Mar 4
(RLHF Systems) | Optimizing RLHF Training for Large Language Models with Stage Fusion (Required)
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs | yoojinc3, cc213, yonghan4, weishao3, bosyuan2 | khota2, dcku2, sankalp6, james40 | jinjieg2, yuchen87, yunqili4, haoqic2, jrliao2 | | | Inference | Mar 6
(Inference Runtime) | Efficient Memory Management for Large Language Model Serving with PagedAttention (Required)
Orca: A Distributed Serving System for Transformer-Based Generative Models | raghavt3, vdar2, khombal2, aditia8, sdevata2 | ma169, kaizhuo3, hsk8, cjchong3, kaidifu2 | yuqiliu6, jtu9 | | Mar 11
(Optimizing Throughput) | Fast Inference from Transformers via Speculative Decoding (Required)
NanoFlow: Towards Optimal Large Language Model Serving Throughput | ma169, kaizhuo3, hsk8, cjchong3, kaidifu2 | neildeo2, mshin29, chauht2, seokjoo3 | raghavt3, vdar2, khombal2, aditia8, sdevata2 | | Mar 13
(Optimizing Latency) | Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (Required)
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | neildeo2, mshin29, chauht2, seokjoo3 | raghavt3, vdar2, khombal2, aditia8, sdevata2 | ronita2, dsingh14, rto4, kv22, yanniz3 | | Mar 14–22 | Spring Break | Mar 25
(Optimizing User Experience) | JITServe: SLO-aware LLM Serving with Imprecise Request Information (Required)
Mooncake: Trading More Storage for Less Computation | npunati2, srd8, obp2 | snian2, zhangw2, tuanmn2, yaqiq2, tonyhong | yuyangw5, yeyu4, hw98, leolu2 | | Mar 27
(Optimizing MLLMs) | ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving (Required)
Cornserve: Efficiently Serving Any-to-Any Multimodal Models | hhunma2, cud2, pranav33, enyasun2, ambikas2 | jerry8, yuchen85, haorany7, pw29 | npunati2, srd8, obp2 | | Mar 27 | Mid-semester Proposal Due | | | | | Apr 1 | Mid-Semester Project Feedback Session | | | | | Apr 3 | Mid-Semester Project Feedback Session | | | | | Apr 8
(In-Context Caching) | IC-Cache: Efficient Large Language Model Serving via In-context Caching (Required)
Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models | yuyangw5, yeyu4, hw98, leolu2 | npunati2, srd8, obp2 | snian2, zhangw2, tuanmn2, yaqiq2, tonyhong | | Apr 10
(Diffusion Serving) | PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference (Required)
StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation | khota2, dcku2, sankalp6, james40 | yuyangw5, yeyu4, hw98, leolu2 | jerry8, yuchen85, haorany7, pw29 | | | Agentic AI Systems | | | | | Apr 15
(RAG Systems) | CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (Required)
AVA: Towards Agentic Video Analytics with Vision Language Models | apr11, rmotati2, cc215, vc50, svasani2 | jinjieg2, yuchen87, yunqili4, haoqic2, jrliao2 | neildeo2, mshin29, chauht2, seokjoo3 | | Apr 17
(Agent for Sys) | Measuring Agents in Production (Required)
Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework | jerry8, yuchen85, haorany7, pw29 | jiateng5, bl61, jiajunf3 | apr11, rmotati2, cc215, vc50, svasani2 | | Apr 22
(Agent Coordination) | In-the-Flow Agentic System Optimization for Effective Planning and Tool Use (Required)
Towards End-to-End Optimization of LLM-based Applications with Ayo | mae10, adhaw2, yuf7, namanr2, ktrikha2 | ronita2, dsingh14, rto4, kv22, yanniz3 | jiateng5, bl61, jiajunf3 | | Apr 24
(Agent Failures) | Why Do Multi-Agent LLM Systems Fail? (Required)
Where LLM Agents Fail and How They can Learn From Failures | jiateng5, bl61, jiajunf3 | yoojinc3, cc213, yonghan4, weishao3, bosyuan2 | mae10, adhaw2, yuf7, namanr2, ktrikha2 | | Apr 29 | Final Project Presentation | | | | | May 1 | Final Project Presentation | | | | | May 6 | Final Project Presentation | | | | | May 15 | Final Report Due | | | |

Tentative Grading

Groups: Panel discussion and research project will be performed in groups of 4-5 students. Form a group and declare your group's membership and paper preferences by Feb 1. After this date, we will form groups from the remaining students.

	Weight
Attendance	10%
Reading summary	20% (opt-in 14 out of 19 readings)
Paper presentation	15%
Panel discussion	5% (two panels)
Final project presentation	15%
Project report	35% (5% + 10% + 20%)

Academic integrity: The University's Honor Code applies to all activities related to this course. All material you submit in this course (reading summary, project reports, and presentation materials) must be your own. If you use someone else’s material, you must cite them properly.

AI Tool Policy: AI tools may be used for grammar checking and refining initial brainstorms, but the final reviews and codes must be authored by the student. Students are responsible for the entire content and must adhere to the Academic Integrity Policy.

Policies

Participation

Before Each Lecture: Some lectures may include one required reading. You must submit a summary of the paper to Canvas by 11:59 PM on the due date.

During Lectures: Active participation is crucial for both your own understanding and to improve the overall quality of the course. You are expected to attend all lectures (up to 2 absences allowed for legitimate reasons), and more importantly, participate in class discussions. Not everyone must have add something every day, but it is expected that everyone has something to share over the semester.

Paper Summaries

You need to select and write paper summaries from 14 papers out of 19 listed papers. The summary should be done independently and include the following contents (five paragraphs):

P1: The problem the paper is trying to tackle. What's the impact of the work, e.g., why is it an important problem to solve?
P2: The main proposed idea(s). A summary of your understanding of different components of the proposed technique, e.g., the purpose of critical design choices.
P3: Your perceived strengths and weaknesses of the work, e.g., novelty, significance of improvements, quality of the evaluation, easy-to-use.
P4: Is there room for improvement? If so, what idea do you have for improving the techniques?

You do not need to write super long paragraphs, as long as you have the key points listed out in each paragraph. You can discuss the paper with other students, but all of your writing work should be your own. DO NOT use AI tools to draft it!

In terms of grading criteria, each summary has 8 points in total. For each review item above, you get:

2: The summary item demonstrates a clear understanding of the paper.
1: The summary item misses the point of the paper.
0: The summary item is missing.

Due to selecting the 14/19 paper summaries, late submissions will NOT be accepted and will receive 0 points.

Paper Presentation

The course will be conducted as a seminar. Only one group will present in each class. Each group will be assigned one lecture over the course of the semester. Presenters will be expected to cover both of the readings, not just the required one. Presentations should last at most 50 minutes without interruption. However, presenters should expect questions and interruptions throughout.

In the presentation, you should:

Provide a brief background to motivate the problem (e.g., simplifying this by referencing previous talks)
Present the high level idea, approach, and/or insight (using examples, whenever appropriate) in both of the readings.
Discuss technical details so that one can understand key details without carefully reading (quickly skim the evaluations).
Explain the differences between related works as well as the additional reading.
Identify strengths and weaknesses of both of the readings and propose directions of future research.

The slides for a presentation must be submitted to Canvas (in *.pptx format) at least 48 hours prior to the corresponding class for feedbacks.

Post-Lecture Panel Discussion

To foster a deeper understanding of the papers and encourage critical thinking, lectures with paper summary will be followed by a panel discussion. This discussion will involve three distinct roles played by different student groups, covering both papers (including the optional one) and simulating an interactive and dynamic scholarly exchange.

Roles and Responsibilities

The Companion (Author) Group

Responsibility: As authors, you are expected to defend your paper against critiques, answer questions, and discuss how you might improve or extend your research in the future, akin to writing a rebuttal during the peer-review process.

The Reviewer Group

Responsibility: Reviewers critically assess the paper, posing challenging questions and highlighting potential weaknesses or areas for further investigation. Your goal is to engage in a constructive critique of the paper, simulating a peer review scenario.

Rest of the Class

Responsibility: During the panel discussions, feel free to actively ask questions and engage in the dialogue.

The lecturer will also pose challenging questions to both the companion and reviewer groups, so please come well-prepared!

Project

You will have to complete substantive work an instructor-approved problem and have original contribution. Surveys are not permitted as projects; instead, each project must contain a survey of background and related work.

You must meet the following milestones (unless otherwise specified in future announcements) to ensure a high-quality project at the end of the semester:

Turn in a 2-page draft proposal (template), plus as many pages as needed for references, by February 21. Remember to include the names and UIUC email addresses of the group members.
Each group must turn in a 4-page mid-semester report, plus as many pages as needed for references, via email on or before 6:00PM CST on March 27.
Each group must schedule project discussion with the instructor during class hours or office hours in the week of March 30 and April 3.
Each group must turn in an 8-page final report, plus as many pages as needed for references, and your code via email on or before 6:00PM CST on May 15. The report must be submitted as a PDF file, with formatting similar to that of the papers you've read in the class. The self-contained (i.e., include ALL dependencies) code must be submitted to Canvas as a zip file. Each zip file containing the code must include a README file with a step-by-step guide on how to compile and run the provided code.
You can find how to access GPU resources here.

Acknowledgements

This course is heavily inspired by other excellent system seminar courses, particularly UMich CSE 585. Acknowledgments to SymbioticLab.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
Resources		Resources
Slides		Slides
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS 598: Systems for Generative AI (S'26)

Logistics

Course Description

Tentative Schedule and Reading List

Tentative Grading

Policies

Participation

Paper Summaries

Paper Presentation

Post-Lecture Panel Discussion

Roles and Responsibilities

Project

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

fanlai0990/CS598

Folders and files

Latest commit

History

Repository files navigation

CS 598: Systems for Generative AI (S'26)

Logistics

Course Description

Tentative Schedule and Reading List

Tentative Grading

Policies

Participation

Paper Summaries

Paper Presentation

Post-Lecture Panel Discussion

Roles and Responsibilities

Project

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages