For serverless LLM inference, how do you handle cold starts and their associated latency overhead while maintaining reliability? #3691

sthama121-del · 2025-10-17T01:10:04Z

sthama121-del
Oct 17, 2025

For serverless LLM inference, how do you handle cold starts and their associated latency overhead while maintaining reliability?

Prashanth-InferX · 2026-03-02T09:49:03Z

Prashanth-InferX
Mar 2, 2026

Cold starts are fundamentally a state problem.

Most serverless systems reload weights, reinitialize CUDA, and rebuild runtime state from scratch. That’s where the 30–120s latency comes from on large models.

A different approach is to snapshot the fully initialized GPU state once and restore that directly into VRAM instead of reloading the model. That eliminates weight loading and CUDA reinit entirely.

The hard part is reliability:
• making snapshot restore deterministic
• ensuring driver/runtime alignment
• handling multi-model scheduling safely
• isolating GPU comms at the NCCL layer

If restore is robust, cold start becomes a memory restore problem, not a model loading problem.

That’s when sub-2s starts become realistic even for large models.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For serverless LLM inference, how do you handle cold starts and their associated latency overhead while maintaining reliability? #3691

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

For serverless LLM inference, how do you handle cold starts and their associated latency overhead while maintaining reliability? #3691

Uh oh!

sthama121-del Oct 17, 2025

Replies: 1 comment

Uh oh!

Prashanth-InferX Mar 2, 2026

sthama121-del
Oct 17, 2025

Prashanth-InferX
Mar 2, 2026