Xpeng’s world model framework addresses a gap that hinders autonomous driving: Inability to predict rather than react. The program, presented at the Fundamental Model workshop of the computer vision and pattern recognition conference in Denver this month, complements Xpeng’s three-pillar Physical AI research program with X-World and X-Foresight. General architecture of X-Mind. The predictive world model is embedded within the large driving model. Recurrent Block Distribution performs progressive denoising across hierarchical inner layers in a single forward pass to create a compact abstract drawing. The planner derives the optimal ego-vehicle trajectory based on this projected physical future. Blue arrows indicate the training data flow; black arrows indicate inference. Traditional autonomous driving works on a reactive perception-action loop by processing immediate visual input without modeling how the surrounding environment will evolve. X-Mind provides a visual train of thought that runs a spatio-temporal simulation within the system before any action is generated, allowing vehicles to predict traffic conditions rather than simply respond. Visualization of structured abstract sketch. Such annotations serve as high-quality supervisory signals to train the world model and include: (a) dynamic traffic light situations, (b) adaptive navigation objectives, (c) speed compatibility profiles. Dense, structurally salient annotations are critical for the model to learn complex physical and semantic driving rules. Using a deep compression autoencoder, the framework’s Thought Sketch module compresses 12 projected future frames into 96 tokens, discarding texture data irrelevant to planning while preserving road topology, traffic light situations, and navigation intent. The Recurrent Block Distribution mechanism then generates future representations in a single forward pass, achieving significantly higher image quality than single-step denoising at comparable inference latency. Overview of Repetitive Block Diffusion. Transformer layers are divided into five blocks; During training, the draw token features in each block are replaced by linear combinations of noise and ground truth. During inference, the outputs of previous blocks feed into subsequent blocks via Euler integration with a fixed time step; all within one large language model forward pass. In benchmark tests, X-Mind reduced lateral and longitudinal displacement error relative to traditional vision-language-action models, with gains concentrated in complex long-tail scenarios where security and traffic compatibility are most critical. Inference latency is described as compatible with automotive-grade hardware under constraints; this is a distribution threshold that heavier 3D reconstruction approaches cannot meet. Qualitative comparison of future bird’s eye view (BEV) forecasts. The images show the results of future spatial inferences under both day and night scenarios. Compared to baseline methods based on single-step generation (middle row), the Recurrent Block Diffusion (RBD) framework proposed by X-Mind (bottom row) provides highly accurate and temporally consistent predictions. It is crucial that the RBD framework exhibits a cognitive ability to predict the motion of dynamic objects, even when dynamic objects are not under ground truth (GT) supervision. X-Mind, X-World, and Xpeng stated that the architecture is being extended beyond autonomous driving to embodied intelligence applications.
Information: This content was prepared and published using AutomobileMagazine’s artificial intelligence-supported publishing system, in line with the information shared by international automotive manufacturers and reliable press sources.
Automobile Magazine – English News
Source link 2026-06-30 04:53:00






















