DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Zhe Liu1,
Runhui Huang1,
Rui Yang1,
Siming Yan2,
Zining Wang2,
Lu Hou2,
Di Lin3,
Xiang Bai4,
Hengshuang Zhao1,✉
1 The University of Hong Kong,
2 Yinwang Intelligent Technology Co. Ltd.,
3 Tianjin University,
4 Huazhong University of Science and Technology
✉ Corresponding author.
-
Unified Spatial-aware 4D MLLM Framework. DrivePI is the first unified framework that seamlessly integrates coarse-grained linguistic spatial understanding with fine-grained 3D perception capabilities, bridging the gap between vision-action (VA) and vision-language-action (VLA) paradigms in autonomous driving. 💪
-
Multi-modal Sensing. DrivePI incorporates LiDAR as a complementary sensing modality alongside camera imagery, providing high-precision 3D geometric information that better elicits the spatial understanding capabilities of MLLMs. 💪
-
Fine-grained 3D Perception and Prediction. DrivePI enables accurate 3D perception (e.g., 3D occupancy) and prediction (e.g., occupancy flow), which effectively enhances the interpretability and safety assurances for autonomous driving systems. 💪
-
Strong Performance. Despite utilizing only a compact 0.5B parameter MLLM backbone (Qwen2.5), DrivePI outperforms existing VA models in 3D occupancy and occupancy flow while maintaining comparable interactive capabilities with existing VLA frameworks. 💪
- 2025.12.15: DrivePI paper released. 🔥
- 2025.12.15: GenieDrive (Physics-Aware Driving World Model) paper released. 🔥
- 2025.11.04: Our previous work UniLION has been released. Check out the codebase for unified autonomous driving model with Linear Group RNNs. 🚀
- 2024.09.26: Our work LION has been accepted by NeurIPS 2024. Visit the codebase for Linear Group RNN for 3D Object Detection. 🚀
- Release the paper.
- Release checkpoints of DrivePI.
- Release all code of DrivePI.
- Release the dataset.
- Vision-Action (VA) models take visual information (LiDAR point clouds, images) as inputs and output action signals through a modular framework. While these methods achieve promising results through accurate spatial perception, they are limited in language-based scene interaction.
- Vision-Language-Action (VLA) approaches leverage the reasoning capabilities of multimodal large language models (MLLMs). These methods achieve superior interaction capabilities but often struggle due to the absence of fine-grained intermediate 3D perception and prediction.
DrivePI bridges this gap by combining the strengths of both approaches, serving as a unified Vision-Language-Action framework that is also compatible with vision-action models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture.
Our multi-stage data pipeline consists of:
- Caption Annotation: We use InternVL3-78B to generate captions of front and back views separately, then merge and polish them to create comprehensive scene descriptions.
- 4D Spatial Understanding Annotation: We leverage ground-truth occupancy and flow data to generate diverse text-occupancy and text-flow QA pairs through multi-turn conversations, enabling fine-grained 3D understanding.
- Planning Reasoning Annotation: We create planning QA pairs based on future trajectory annotations to enhance planning interpretability, enabling the MLLM to predict future actions of the ego-vehicle.
Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models:
- Compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes.
- Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes.
| Method | VLM-based | OccScore | RayIoU (3D Occ.) |
mAVE (Occ. Flow) |
RayIoU (1m) | RayIoU (2m) | RayIoU (4m) |
|---|---|---|---|---|---|---|---|
| OccNeRF | 28.5 | 31.7 | -- | 16.6 | 29.3 | 49.2 | |
| RenderOcc | 33.0 | 36.7 | -- | 20.3 | 32.7 | 49.9 | |
| LetOccFlow | 36.4 | 40.5 | -- | 25.5 | 39.7 | 56.3 | |
| OccNet | 35.7 | 39.7 | -- | 29.3 | 39.7 | 50.0 | |
| BEVDetOcc-SF | 33.0 | 36.7 | 1.420 | 31.6 | 37.3 | 41.1 | |
| FB-Occ | 39.2 | 39.0 | 0.591 | 32.7 | 39.9 | 44.4 | |
| F-Occ | 41.0 | 39.9 | 0.491 | 33.9 | 40.7 | 45.2 | |
| CascadeFlow | 40.9 | 39.6 | 0.470 | 33.5 | 40.3 | 45.0 | |
| ALOcc-Flow-3D | 43.0 | 41.9 | 0.556 | 35.6 | 42.8 | 47.4 | |
| DrivePI (Ours) | ✓ | 49.3 | 49.3 | 0.509 | 45.0 | 50.0 | 52.9 |
| Method | VLM-based | RayIoU | RayIoU (1m) | RayIoU (2m) | RayIoU (4m) |
|---|---|---|---|---|---|
| RenderOcc | 19.5 | 13.4 | 19.6 | 25.5 | |
| SimpleOcc | 22.5 | 17.0 | 22.7 | 27.9 | |
| BEVFormer | 32.4 | 26.1 | 32.9 | 38.0 | |
| BEVDet-Occ | 32.6 | 26.6 | 33.1 | 38.2 | |
| FB-Occ | 33.5 | 26.7 | 34.1 | 39.7 | |
| SparseOcc | 36.1 | 30.2 | 36.8 | 41.2 | |
| OPUS | 41.2 | 34.7 | 42.1 | 46.7 | |
| DrivePI (Ours)* | ✓ | 46.0 | 42.2 | 46.7 | 49.2 |
*DrivePI trained exclusively on the 3D occupancy task of Occ3D-nuScenes.
| Method | VLM-based | Ego Status | L2 (m) | Collision Rate (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1s | 2s | 3s | avg. | 1s | 2s | 3s | avg. | |||
| ST-P3 | 1.33 | 2.11 | 2.90 | 2.11 | 0.23 | 0.62 | 1.27 | 0.71 | ||
| FF | 0.55 | 1.20 | 2.54 | 1.43 | 0.06 | 0.17 | 1.07 | 0.43 | ||
| EO | 0.67 | 1.36 | 2.78 | 1.60 | 0.04 | 0.09 | 0.88 | 0.33 | ||
| UniAD | 0.48 | 0.96 | 1.65 | 1.03 | 0.05 | 0.17 | 0.71 | 0.31 | ||
| VAD | 0.41 | 0.70 | 1.05 | 0.72 | 0.07 | 0.17 | 0.41 | 0.22 | ||
| VAD | ✓ | 0.17 | 0.34 | 0.60 | 0.37 | 0.07 | 0.10 | 0.24 | 0.14 | |
| OmniDrive | ✓ | ✓ | 0.14 | 0.29 | 0.55 | 0.33 | 0.00 | 0.13 | 0.78 | 0.30 |
| ORION | ✓ | ✓ | 0.17 | 0.31 | 0.55 | 0.34 | 0.05 | 0.25 | 0.80 | 0.37 |
| OpenDriveVLA-7B | ✓ | ✓ | 0.20 | 0.58 | 1.21 | 0.66 | 0.00 | 0.22 | 0.55 | 0.25 |
| DrivePI (Ours) | ✓ | 0.24 | 0.46 | 0.78 | 0.49 | 0.38 | 0.27 | 0.48 | 0.38 | |
| DrivePI (Ours) | ✓ | ✓ | 0.19 | 0.36 | 0.64 | 0.40 | 0.00 | 0.05 | 0.28 | 0.11 |
| Method | Exist | Count | Object | Status | Comparison | Accuracy |
|---|---|---|---|---|---|---|
| LLaMA-AdapV2 | 19.3 | 2.7 | 7.6 | 10.8 | 1.6 | 9.6 |
| LLaVA1.5 | 45.8 | 7.7 | 7.8 | 9.0 | 52.1 | 26.2 |
| LiDAR-LLM | 74.5 | 15.0 | 37.8 | 45.9 | 57.8 | 48.6 |
| BEVDet+BUTD | 83.7 | 20.9 | 48.8 | 52.0 | 67.7 | 57.0 |
| OpenDriveVLA-0.5B | 83.9 | 22.0 | 50.2 | 57.0 | 68.4 | 58.4 |
| OpenDriveVLA-3B | 84.0 | 22.3 | 50.3 | 56.9 | 68.5 | 58.5 |
| OpenDriveVLA-7B | 84.2 | 22.7 | 49.6 | 54.5 | 68.8 | 58.2 |
| DrivePI (Ours) | 85.3 | 22.4 | 57.5 | 59.1 | 68.3 | 60.7 |
| # | Text Head | Vision Head | 3D Occ. RayIoU |
Occ. Flow mAVE |
Planning | QA Acc. |
|
|---|---|---|---|---|---|---|---|
| L2 | Col. | ||||||
| I | ✓ | -- | -- | -- | -- | -- | 61.2 |
| II | -- | ✓ | 47.5 | 0.69 | 1.02 | 0.39 | -- |
| III | ✓ | ✓ | 49.3 | 0.51 | 0.49 | 0.38 | 60.7 |
@article{liu2025drivepi,
title={DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning},
author={Liu, Zhe and Huang, Runhui and Yang, Rui and Yan, Siming and Wang, Zining and Hou, Lu and Lin, Di and Bai, Xiang and Zhao, Hengshuang},
journal={arXiv preprint},
year={2025}
}We thank these great works and open-source repositories: UniLION, MMDectection3D, InternVL3, LLaVA, and EMOVA.


