Introduction

The multi-model serving problem: You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff:

  1. Keep both models loaded → Requires 2x the GPU memory (expensive, often impossible)
  2. Reload models on-demand → 30-100+ seconds per switch (slow, wasteful)

vLLM Sleep Mode

vLLM Sleep Mode offers a third way: Models hibernate in seconds and wake up fast—delivering the efficiency of on-demand loading with the speed of persistent serving.

Two Sleep Levels for Different Needs

Both levels are 18-200x faster than full reload and work seamlessly with Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP).

Why Sleep Mode Beats Fast Weight Loaders

Even with instant weight loading, every cold start pays hidden costs that Sleep Mode avoids:

Cost Description Fast Weight Loaders Sleep Mode
1. VRAM load time Copying weights to GPU ✅ Optimized ✅ Preserved
2. Memory allocator setup CUDA allocator initialization ❌ Every time ✅ Preserved
3. CUDA graph capture Record execution graphs ❌ Every time ✅ Preserved
4. GPU kernel JIT compilation DeepGEMM, FlashInfer, TorchInductor ❌ Every time ✅ Preserved (after initial warmup)
5. Cache warm-up First-request overhead ❌ Every time ⚡ Quick re-warm

By keeping the process alive, Sleep Mode preserves infrastructure (#2-4) and avoids expensive reinitialization. This is why benchmarks show Sleep Mode inference is 61-88% faster than cold starts.

This post covers:

Quick Start: Using Sleep Mode

Online Serving API

Start two vLLM servers with Sleep Mode enabled:

# Terminal 1: Start Phi-3-vision
export VLLM_SERVER_DEV_MODE=1
vllm serve microsoft/Phi-3-vision-128k-instruct --enable-sleep-mode --port 8001

# Terminal 2: Start Qwen3-0.6B
export VLLM_SERVER_DEV_MODE=1
vllm serve Qwen/Qwen3-0.6B --enable-sleep-mode --port 8002

Sleep and Wake Models

# Put Phi-3-vision to sleep (Level 2 - minimal RAM usage)
curl -X POST 'localhost:8001/sleep?level=2'

# Put Qwen3-0.6B to sleep (Level 2)
curl -X POST 'localhost:8002/sleep?level=2'

# Wake up Phi-3-vision for inference
curl -X POST 'localhost:8001/wake_up'
curl -X POST 'localhost:8001/collective_rpc' \
  -H 'Content-Type: application/json' \
  -d '{"method":"reload_weights"}'

# IMPORTANT: Reset prefix cache after waking (Level 2 only)
curl -X POST 'localhost:8001/reset_prefix_cache'

# Now run inference on Phi-3-vision...
# (your inference requests here)

# Put back to sleep when done
curl -X POST 'localhost:8001/sleep?level=2'

# Wake up Qwen3-0.6B
curl -X POST 'localhost:8002/wake_up'
# (Level 1 doesn't need reload_weights or reset_prefix_cache)

# Run inference on Qwen3-0.6B...

Note

For Level 2 sleep, you must call reload_weights and reset_prefix_cache after waking. Level 1 sleep doesn’t require these extra steps.

Warning

Security: The /sleep, /wake_up, /collective_rpc, and /reset_prefix_cache endpoints require VLLM_SERVER_DEV_MODE=1 and should only be exposed in trusted networks. These administrative endpoints can disrupt service and are intended for closed environments like training clusters or backend applications.

Performance Overview

Let’s see how Sleep Mode performs compared to traditional model reloading.

Sleep Mode L1 vs No Sleep Mode Performance

The interactive chart below shows the total time to perform 5 model switches: running inference on Model A, switching to Model B, running inference on Model B, then repeating this pattern (A→B→A→B→A→B).

With Sleep Mode: Models sleep/wake between switches, preserving infrastructure. Without Sleep Mode: Each switch requires a full vLLM restart and reload.

Model A: Qwen3-235B-A22B-Instruct-2507-FP8 (TP=4) | Model B: Qwen3-Coder-30B-A3B-Instruct (TP=1)
GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Inference Performance Boost

Beyond faster model switching, Sleep Mode also delivers faster inference times. Because models are already warmed up when woken from sleep, they skip the cold start overhead that affects freshly loaded models.

Inference time comparison showing wake mode (already warmed up) vs cold start (just loaded).
Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Why Sleep Mode Improves Inference Speed

The 61-88% inference speedup isn’t from faster weight loading—it’s from preserving expensive infrastructure that cold starts must rebuild from scratch.

What Sleep Mode Preserves:

Component Preserved? Cold Start Must Pay
Memory allocator (CuMemAllocator) ✅ Yes ❌ Reinitialize every time
CUDA graphs ✅ Yes ❌ Re-capture every time
Process state (Python, CUDA context) ✅ Yes ❌ Restart every time
GPU kernel JIT cache ✅ Yes (after initial warmup) ❌ Recompile every time

The Critical Difference:

Note

Timing varies significantly by model size, GPU generation, and configuration. See the Impact of Warm-Up section for detailed measurements showing 5-7x slowdown without warm-up.

Model Switching Performance

The most dramatic benefit of Sleep Mode is in model switching time. Waking a sleeping model is 18-20x faster than loading a fresh vLLM instance.

Model switching time: Wake from sleep vs cold start (fresh load).
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Hardware Scalability: A4000 GPU Results

Sleep Mode benefits aren’t limited to high-end GPUs. Here’s the same workload on an A4000 GPU with smaller models, demonstrating that the performance gains scale across different hardware tiers and model sizes.

Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

A4000: Inference Performance

Inference time comparison on A4000: wake mode (already warmed up) vs cold start (just loaded).
Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

A4000: Model Switching Performance

Model switching time on A4000: Wake from sleep vs cold start (fresh load).
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Key Observations on A4000:

Sleep Levels: Choosing the Right Mode

vLLM Sleep Mode offers two levels with different tradeoffs:

Level 1 (Default): Offloads model weights to CPU memory, discards KV cache

Level 2: Discards model weights and KV cache, keeps only buffers (rope scaling tensors, etc.) in CPU

Performance Comparison: Level 1 vs Level 2 vs No Sleep

Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
GPU: A100 (TP=1) | vLLM 0.11.0 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
Comparing all three modes: Level 1 (fastest), Level 2 (minimal RAM), No Sleep. Hover for exact timing.

Performance Summary:

Mode Total Time Wake Time (A/B) CPU RAM Best For
No Sleep 357.1s N/A (full reload) Minimal Single model, no switching
Level 1 112.6s 0.26s / 0.82s High (~GB per model) Frequent switching, ample RAM
Level 2 124.6s 0.85s / 2.58s Minimal (~MB per model) Limited RAM, cost optimization

Key Insights:

Why Level 2 is Still Faster Than No Sleep Mode

At first glance, this seems counterintuitive: Level 2 reloads weights from SSD (just like “No Sleep Mode”), so why is it 23-45x faster overall?

The Answer: Weight loading is only ONE of FIVE costs

When you reload a model without Sleep Mode, you pay all these costs:

Cost Level 2 No Sleep Mode
1. Weight load (SSD → VRAM) ❌ Must pay ❌ Must pay
2. Process initialization Skipped ❌ Must pay
3. Memory allocator setup Skipped ❌ Must pay
4. CUDA graph capture Skipped ❌ Must pay
5. GPU kernel JIT compilation Preserved (already compiled) ❌ Full compilation + warm-up

Level 2 Strategy:

No Sleep Mode Reality:

The benchmark data proves it: For 5 model switches:

Even though both reload weights from SSD, Level 2 is 2.9x faster overall because it preserves the expensive infrastructure (process state, allocator, CUDA graphs) that No Sleep Mode must rebuild from scratch every single time.

Level 2: Inference Performance

Inference time comparison with Sleep Level 2: wake mode vs cold start.
Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Level 2: Model Switching Performance

Model switching time with Sleep Level 2: wake from sleep vs cold start.
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Key Observations:

Metric No Sleep Level 2 Improvement
Total Time (5 switches) 357.1s 124.6s 65% faster
Qwen3-0.6B Switch Time 37.6s avg 0.85s avg 45x faster
Phi-3-vision Switch Time 58.1s avg 2.58s avg 23x faster
Qwen3-0.6B Inference 3.67s avg 0.53s avg 86% faster
Phi-3-vision Inference 6.30s avg 0.76s avg 88% faster
Wake Time vs Level 1 - 3-10x slower Trade CPU RAM for speed

When to Use Level 2:

Level 1 vs Level 2 Comparison:

Ablation Studies

Impact of Warm-Up on Sleep Mode

Does skipping the warm-up phase affect performance? Warm-up pre-compiles CUDA graphs during initial load, which can take several seconds. Let’s compare with and without warm-up.

Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
Comparing with warm-up (pre-compiled) vs without warm-up (lazy compilation). Hover for exact timing.

Key Findings:

Metric With Warm-Up Without Warm-Up Difference
Initial Load Time 108.7s (includes 8.4s warm-up) 101.1s (no warm-up) 7.6s saved initially
First Inference (A) 0.45s 2.59s 5.8x slower without warm-up
First Inference (B) 0.93s 6.61s 7.1x slower without warm-up
Subsequent Inferences 0.43s avg 0.41s avg No difference
Total Time (5 switches) 119.5s 119.0s Nearly identical

Insights:

Impact of Quantization on Sleep Mode

Does quantization (FP8) affect Sleep Mode performance? We tested the same workload with and without FP8 quantization on A100 GPU.

Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
Comparing BF16 (baseline) vs FP8 quantization. Hover for exact timing.

Ablation: Inference Performance (BF16 vs FP8)

Inference time comparison: BF16 vs FP8 quantization with Sleep Mode.
Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Ablation: Model Switching (BF16 vs FP8)

Model switching time: BF16 vs FP8 quantization with Sleep Mode.
Error bars show min/max variation across multiple runs. Values displayed on bars.
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE

Key Findings:

Metric BF16 FP8 Improvement
Total Time (5 switches) 108.2s 113.6s -5% (slightly slower)
Qwen3-0.6B Wake Time 0.27s avg 0.18s avg 33% faster
Phi-3-vision Wake Time 0.90s avg 0.78s avg 13% faster
Qwen3-0.6B Inference 0.41s avg 0.44s avg -7% (slightly slower)
Phi-3-vision Inference 0.81s avg 0.57s avg 30% faster
Initial Load Time 90.5s 96.9s -7% (longer warmup)

Insights:

Decision Guide: Which Sleep Level to Use?

Use Sleep Level 1 When:

Use Sleep Level 2 When:

Skip Sleep Mode When:

Conclusion

vLLM Sleep Mode transforms multi-model GPU serving from a 30-100 second reload penalty into sub-second switches. The benchmarks speak for themselves:

The future of LLM serving is multi-model. Sleep Mode makes it practical today.

Acknowledgements

Special thanks to Vensen Mu, Jeff Aw, Jun Kang Chow, Tun Jian Tan, Pin Siang Tan, Amir Balwel, Ye Hur Cheong, Zhiyao Cen and Kaichao You for developing the Sleep Mode feature and this blog post.