Building Clean, Maintainable vLLM Modifications Using the Plugin System

Dhruvil Bhatt (AWS SageMaker) Nov 20, 2025

[!NOTE] Originally posted on this Medium article. Source: https://github.com/vllm-project/vllm-ascend A...

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

vLLM Semantic Router Team Nov 19, 2025

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where...

Docker Model Runner Integrates vLLM for High-Throughput Inferencing

Docker Team Nov 19, 2025

Expanding Docker Model Runner’s Capabilities Today, we’re excited to announce that Docker Model Runner now integrate...

Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems

Donglu Wang (Cohere) Nov 13, 2025

[!NOTE] Originally posted on the Cohere blog. Introducing Shared Memory IPC Caching — a high-performance caching...

Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM

Intel vLLM Team Nov 11, 2025

Intel® Arc™ Pro B-Series GPU Family GPUs deliver powerful AI capabilities with a focus on accessibility and exception...

No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan

vLLM and TorchTitan Teams Nov 10, 2025

We demonstrate an open-source bitwise consistent on-policy RL run with TorchTitan as the training engine and vLLM as ...

Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM

NVIDIA Nemotron Team Oct 31, 2025

We are excited to release NVIDIA Nemotron Nano 2 VL, supported by vLLM. This open vision language model (VLM) is buil...

Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM

Linian Wang (Peking University) Oct 28, 2025

TL;DR: For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8...

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent) Oct 27, 2025

Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tu...

Zero-Reload Model Switching with vLLM Sleep Mode

Embedded LLM Oct 26, 2025

Introduction The multi-model serving problem: You have two LLMs that each fit on your GPU, but not both at once. Tra...

Now Serving NVIDIA Nemotron with vLLM

NVIDIA Nemotron Team Oct 23, 2025

Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in deve...

No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL

The Agent Lightning (AGL) Team Oct 22, 2025

TL;DR. Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and o...

vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU

Google Team Oct 16, 2025

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTor...

SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference

vLLM Team Oct 9, 2025

Introduction Over the past several months, we’ve been collaborating closely with NVIDIA to unlock the full potential...

DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action

vLLM Team Sep 29, 2025

Introduction We are excited to announce Day 0 support for DeepSeek-V3.2-Exp, featuring DeepSeek Sparse Attention (DS...

The First vLLM Meetup in Korea

vLLM Team Sep 16, 2025

The first vLLM meetup in Korea was held on August 19, 2025, in Seoul, hosted by Rebellions and Red Hat with sup...

vLLM Semantic Router: Next Phase in LLM inference

vLLM Semantic Router Team Sep 11, 2025

Industry Status: Inference ≠ More Is Better Over the past year, hybrid reasoning and automatic routing have increa...

vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency

The vLLM Team Sep 11, 2025

We’re excited to announce that vLLM now supports Qwen3-Next, the latest generation of foundation models from the Qwen...

Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM

Christian Pinto (IBM Research Europe - Dublin), Michele Gazzetti (IBM Research Europe - Dublin), Michael Johnston (IBM Research Europe - Dublin), Maximilien Philippe Marie de Bayser (IBM Research - Brazil) Sep 5, 2025

Introduction Until recently, generative AI infrastructure has been tightly coupled with autoregressive text generati...

Inside vLLM: Anatomy of a High-Throughput LLM Inference System

Aleksa Gordic Sep 5, 2025

[!NOTE] Originally posted on Aleksa Gordic’s website. From paged attention, continuous batching, prefix caching,...

Introduction to torch.compile and How It Works with vLLM

Luka Govedič (Red Hat), Richard Zou (Meta), Addie Stevens (Red Hat), Kaichao You (Tsinghua University), Michael Goin (Red Hat), Saša Zelenović (Red Hat) Aug 20, 2025

[!NOTE] This blog originated from our biweekly vLLM office hours, a community forum hosted by Red Hat with vLLM pr...

GLM-4.5 Meets vLLM: Built for Intelligent Agents

Yuxuan Zhang Aug 19, 2025

Introduction General Language Model (GLM) is a family of foundation models created by Zhipu.ai (now renamed to Z.ai)...

CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond

Kaichao You Aug 11, 2025

TL;DR: If you hit an illegal memory access was encountered error, you can enable CUDA core dump to debug the issue. S...

vLLM Now Supports gpt-oss

The vLLM Team Aug 5, 2025

We’re thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x ...

MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference

MiniMax Jun 30, 2025

This article explores how MiniMax-M1’s hybrid architecture is efficiently supported in vLLM. We discuss the model’s u...

Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU

The Ascend Team on vLLM May 12, 2025

Since December 2024, through the joint efforts of the vLLM community and the Ascend team on vLLM, we have completed t...

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

The OpenRLHF Team Apr 23, 2025

As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedba...

Transformers modeling backend integration in vLLM

The Hugging Face Team Apr 11, 2025

The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures...

Llama 4 in vLLM

The vLLM Team Apr 5, 2025

We’re excited to announce that vLLM now supports the Llama 4 herd of models: Scout (17B-16E) and Maverick (17B-128E)....

PTPC-FP8: Boosting vLLM Performance on AMD ROCm

AMD and Embedded LLM Feb 24, 2025

TL;DR: vLLM on AMD ROCm now has better FP8 performance! What’s new? PTPC-FP8 quantization is now supported in vLL...

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

AIBrix Team Feb 21, 2025

Today, we are excited to announce vllm-project/aibrix: a battery-included vLLM Kubernetes serving stack developed by ...

Distributed Inference with vLLM

vLLM Team Feb 17, 2025

Motivation Serving large models often leads to memory bottlenecks, such as the dreaded CUDA out of memory error. To ...

vLLM V1: A Major Upgrade to vLLM's Core Architecture

vLLM Team Jan 27, 2025

We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on...

Introducing vLLM Inference Provider in Llama Stack

Yuan Tang (Red Hat) and Ashwin Bharambe (Meta) Jan 27, 2025

We are excited to announce that vLLM inference provider is now available in Llama Stack through the collaboration bet...

High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack

LMCache Team Jan 21, 2025

TL;DR vLLM boasts the largest open-source community, but what does it take to transform vLLM from the best singl...

Structured Decoding in vLLM: a gentle introduction

Guest Post by BentoML and Red Hat Jan 14, 2025

TL/DR: Structured decoding allows precise control over LLM output formats vLLM now supports both outlines and X...

vLLM 2024 Retrospective and 2025 Vision

vLLM Team Jan 10, 2025

The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the de ...

Installing and Developing vLLM with Ease

vLLM Team Jan 10, 2025

The field of LLM inference is advancing at an unprecedented pace. With new models and features emerging weekly, the t...

Serving LLMs on AMD MI300X: Best Practices

Guest Post by Embedded LLM and Hot Aisle Inc. Oct 23, 2024

TL;DR: vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-t...

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

vLLM Team Oct 17, 2024

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and l...

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

vLLM Team Sep 5, 2024

TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x hi...

vLLM’s Open Governance and Performance Roadmap

vLLM Team Jul 25, 2024

We would like to share two updates to the vLLM community. Future of vLLM is Open We are excited to see vLLM i...

Announcing Llama 3.1 Support in vLLM

vLLM Team Jul 23, 2024

Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3....

Notes on vLLM v.s. DeepSpeed-FastGen

vLLM Team Nov 14, 2023

TL;DR: vLLM matches DeepSpeed-FastGen’s speed in common scenarios and surpasses it when handling longer outputs....

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution) Jun 20, 2023

GitHub | Documentation | Paper LLMs promise to fundamentally change how we use AI across all industries. However, ...

vLLM Blog