Since December 2024, through the joint efforts of the vLLM community and the Ascend team on vLLM, we have completed the Hardware Pluggable RFC. This proposal allows hardware integration into vLLM in a decoupled manner, enabling rapid and modular support for different hardware platforms.


Why vLLM Hardware Plugin?

Currently, vLLM already supports multiple backends. However, as the number of vLLM backends continues to grow, several challenges have emerged:

Recognizing the need for a flexible and modular approach to integrating hardware backends, we proposed hardware plugins as a feasible solution:


What is the vLLM Hardware Plugin?

Before introducing the vLLM Hardware Plugin, let’s first look at two prerequisite RFCs:

Based on these RFCs, we proposed [RFC] Hardware Pluggable, which integrates the Platform module into vLLM as a plugin. Additionally, we refactored Executor, Worker, ModelRunner, AttentionBackend, and Communicator to support hardware plugins more flexibly.

Currently, the vLLM community has successfully implemented the Platform module introduced in the RFC. The functionality is validated through the vllm-project/vllm-ascend and vllm-project/vllm-spyre projects. Using this plugin mechanism, we successfully integrated vLLM with the Ascend NPU and IBM Spyre backends.


How to Integrate a New Backend via vLLM Hardware Plugin Mechanism

This section will dive into integrating a new backend via the hardware plugin in both developer and user perspective.

Developer Perspective

To integrate a new backend into vLLM using the hardware plugin, follow these steps:

Step 1: Create a New Project and Initialize the Platform

Start by creating a Python project for the new backend and adding a platform.py file. Then, import the Platform class from vllm.platforms and implement the required attributes and methods.

You can refer to the platform.py in vLLM Ascend project for an example.

Step 2: Implement Custom Worker, Model Runner, Attention Backend, and Communicator Modules

Depending on the new backend’s requirements, implement the following modules:

from vllm.worker.worker_base import WorkerBase
from vllm.worker.model_runner_base import ModelRunnerBase
from vllm.attention.backends.abstract import AttentionBackend
from vllm.distributed.device_communicators.base_communicator import CommunicatorBase

Each of these classes has a corresponding base class in vLLM. Again, you can refer to vLLM Ascend’s implementation for an example.

Step 3: Register the Plugin

Register the plugin in setup.py using the entrypoint mechanism of python:

setup(
    entry_points={'vllm.platform_plugins': ["{your_platform_name} = {code_path}:{register_function}"]}
)

Refer to setup.py in vLLM Ascend for a practical example.


User Perspective

Users only need to install vllm and your plugin before running, taking vllm-ascend as an example:

pip install vllm vllm-ascend

On startup, you will observe the following logs, which means the backend plugin is working properly:

INFO 02-06 15:49:01 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 02-06 15:49:01 __init__.py:32] name=ascend, value=vllm_ascend:register
… …
INFO 02-06 15:49:01 __init__.py:44] plugin ascend loaded.
INFO 02-06 15:49:01 __init__.py:181] Platform plugin ascend is activated

What’s Next?

Moving forward, we will continue collaborating with developers in the vLLM community to enhance the following aspects:

  1. Continuous enhancements to the V1 Engine and VLMs.
  2. Expanding plugin support for more modules and features, such as scheduler, graph mode and custom operators.
  3. Better user experience and higher performance.
  4. Maintenance and enhancement of a stable plugin architecture for appropriate hardware platforms

We encourage everyone to try out this new feature! If you have any questions, join the vLLM Slack and participate in the #sig-extensible-hardware channel for discussions. 🚀

Acknowledgements

This flexible hardware backend plugin mechanism would not have been possible without the efforts of many vLLM contributors. Thus we are deeply grateful to the vLLM maintainers, including Kaichao You, Simon Mo, Cyrus Leung, Robert Shaw, Michael Goin and Jie Li for related refactor, deep discussion and quick review, Xiyuan Wang, Shanshan Shen, Chenguang Li and Mengqing Cao from the Ascend team on vLLM for mechanism design and implementation, Joe Runde and Yannick Schnider from the Spyre team on vLLM for pluggable scheduler design and implementation, and other contributors, including yancong for extendable quantization method design and implementation, Aviv Keshet for extendable SamplingParams.