2026-03-26

VLA 研究日報 Pulsar

LIVE

— AI 線今日無資料 —— AI 线今日无资料 —

VLA 線VLA 线 · cs.RO · cs.AI · cs.LG

Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning [Wisconsin] 离线RL中引入可微世界模型进行推理时MPC优化，方法相邻但非VLA直接相关，无机器人操作实验，潜在可迁移至VLA推理优化。 CS.LG
EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards [CUHK] 提出逆动力学奖励对齐视频世界模型与可执行机器人动作，解决视觉生成与物理控制的可执行性差距，可直接集成至现有VLA推理流程。 [Pass3降级: Overlaps significantly with recent ⚡ OmniVTA (World Modeling) and IDM-video alignment has ≥3 pre-2024 precedents.] CS.RO
Design, Mapping, and Contact Anticipation with 3D-printed Whole-Body Tactile and Proximity Sensors [Colorado] 3D打印全身触觉+接近传感器硬件设计，Franka机器人实验验证接触预判，触觉感知相邻方向但非VLA算法贡献，硬件导向。 CS.RO
Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models [Roma] DILLO蒸馏语言-动作世界模型实现快速steering层，从simulate-then-act转向describe-then-act，避免视觉模拟延迟，可即插即用。 [Pass3降级: High overlap with recent 🔧 Latent Action Diffusion; latent foresight is not a new paradigm.] HF-PAPER
VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents [NJU] VLM生成imaginary rollouts增强离线RL交互数据，方法相邻但侧重离线RL而非VLA架构，无明确机器人操作实验验证。 CS.LG
Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation [Buffalo|Yang] gaze正则化训练框架对齐VLA内部注意力与人类视觉模式，无需架构修改或推理开销，可直接集成至现有VLA提升细粒度操作。 [Pass3降级: Gaze-guided robotics has ≥3 pre-2024 precedents; attention regularization is a minor training tweak.] CS.CV
VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models [Tongji] 训练-free视觉token剪枝通过交互对齐，Interaction-First范式保留结构关键区域，降低VLA推理成本，资源受限平台可立即部署。 [Pass3降级: Token pruning is a saturated optimization technique (pre-2024 ≥3 works); interaction alignment is incremental.] CS.CV
CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models 竞争博弈多Agent RL框架用于具身视觉跟踪，引入CoMaTrack-Bench基准，VLA应用但聚焦跟踪而非操作，方向较窄。 CS.AI
Point What You Mean: Visually Grounded Instruction Policy [Tongji] Point-VLA用边界框等视觉线索增强语言指令解决指代歧义，自动数据标注管线，真实机器人指代任务验证，杂乱场景泛化显著提升。 CS.RO
EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation [Wayne] 对称等变策略学习框架用于双臂操作，强制观测-动作双边等变性，方法相邻但聚焦双臂对称性而非VLA核心架构，需进一步验证。 CS.RO
Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks [NASA] DA-DP显式纳入推理延迟至策略学习，校正零延迟轨迹至延迟补偿对应，架构无关可迁移，但非VLA特定贡献，属通用策略改进。 CS.RO
Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion [Harbin] SE(3)姿态引导模仿学习用于精密插入任务，扩散策略预测相对姿态轨迹，实证研究性质，无新架构贡献，特定任务导向。 CS.RO

2026-03-26

VLA 研究日報VLA 研究日报

19 篇 3 篇共 22 篇

3-Pass 過濾漏斗

29 RSS

24A + 5B Pass1

+7 Pass2 晉升

-4 Pass3 降級

🔧3 📖19 最終

🔧 技術技术

Practical VLA [Tongji] 2026-03-26

Point What You Mean: Visually Grounded Instruction Policy

Point-VLA用边界框等视觉线索增强语言指令解决指代歧义，自动数据标注管线，真实机器人指代任务验证，杂乱场景泛化显著提升。

cs.RO 閱讀原文

展開摘要

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

[Tongji]

Practical VLA [Fudan] 2026-03-26

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

14B Diffusion Transformer世界模型，DPO后训练抑制非物理行为，3M操作剪辑数据集，物理对齐视频生成可直接用于VLA规划。

cs.RO 閱讀原文

展開摘要

Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.

[Fudan]

Practical VLA [UIUC|Chowdhary] 2026-03-26

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

缓存视觉-语言可通行性框架，visuosemantic缓存机制减少85.7%在线VLM查询，四足机器人室内外验证，零样本embodiment感知。

cs.RO 閱讀原文

展開摘要

Navigating unstructured environments requires assessing traversal risk relative to a robot's physical capabilities, a challenge that varies across embodiments. We present CATNAV, a cost-aware traversability navigation framework that leverages multimodal LLMs for zero-shot, embodiment-aware costmap generation without task-specific training. We introduce a visuosemantic caching mechanism that detects scene novelty and reuses prior risk assessments for semantically similar frames, reducing online VLM queries by 85.7%. Furthermore, we introduce a VLM-based trajectory selection module that evaluates proposals through visual reasoning to choose the safest path given behavioral constraints. We evaluate CATNAV on a quadruped robot across indoor and outdoor unstructured environments, comparing against state-of-the-art vision-language-action baselines. Across five navigation tasks, CATNAV achieves 10 percentage point higher average goal-reaching rate and 33% fewer behavioral constraint violations.

[UIUC|Chowdhary]

📖 背景閱讀背景阅读

Background VLA [Wisconsin] 2026-03-26

Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning

离线RL中引入可微世界模型进行推理时MPC优化，方法相邻但非VLA直接相关，无机器人操作实验，潜在可迁移至VLA推理优化。

cs.LG 閱讀原文

展開摘要

Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without further refinement. We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables endto-end gradient computation through imagined rollouts for policy optimization at inference time based on MPC. We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.

[Wisconsin]

Background VLA [CUHK] 2026-03-26

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

提出逆动力学奖励对齐视频世界模型与可执行机器人动作，解决视觉生成与物理控制的可执行性差距，可直接集成至现有VLA推理流程。 [Pass3降级: Overlaps significantly with recent ⚡ OmniVTA (World Modeling) and IDM-video alignment has ≥3 pre-2024 precedents.]

cs.RO 閱讀原文

Pass3 降級：Overlaps significantly with recent ⚡ OmniVTA (World Modeling) and IDM-video alignment has ≥3 pre-2024 precedents.

展開摘要

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

[CUHK]

Background VLA [Colorado] 2026-03-26

Design, Mapping, and Contact Anticipation with 3D-printed Whole-Body Tactile and Proximity Sensors

3D打印全身触觉+接近传感器硬件设计，Franka机器人实验验证接触预判，触觉感知相邻方向但非VLA算法贡献，硬件导向。

cs.RO 閱讀原文

展開摘要

Robots operating in dynamic and shared environments benefit from anticipating contact before it occurs. We present GenTact-Prox, a fully 3D-printed artificial skin that integrates tactile and proximity sensing for contact detection and anticipation. The artificial skin platform is modular in design, procedurally generated to fit any robot morphology, and can cover the whole body of a robot. The skin achieved detection ranges of up to 18 cm during evaluation. To characterize how robots perceive nearby space through this skin, we introduce a data-driven framework for mapping the Perisensory Space -- the body-centric volume of space around the robot where sensors provide actionable information for contact anticipation. We demonstrate this approach on a Franka Research 3 robot equipped with five GenTact-Prox units, enabling online object-aware operation and contact prediction.

[Colorado]

Background VLA [Roma] 2026-03-26

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

DILLO蒸馏语言-动作世界模型实现快速steering层，从simulate-then-act转向describe-then-act，避免视觉模拟延迟，可即插即用。 [Pass3降级: High overlap with recent 🔧 Latent Action Diffusion; latent foresight is not a new paradigm.]

hf-papers 閱讀原文

Pass3 降級：High overlap with recent 🔧 Latent Action Diffusion; latent foresight is not a new paradigm.

展開摘要

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

[Roma]

Background VLA [NJU] 2026-03-26

VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

VLM生成imaginary rollouts增强离线RL交互数据，方法相邻但侧重离线RL而非VLA架构，无明确机器人操作实验验证。

cs.LG 閱讀原文

展開摘要

Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.

[NJU]

Background VLA [Buffalo|Yang] 2026-03-26

Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

gaze正则化训练框架对齐VLA内部注意力与人类视觉模式，无需架构修改或推理开销，可直接集成至现有VLA提升细粒度操作。 [Pass3降级: Gaze-guided robotics has ≥3 pre-2024 precedents; attention regularization is a minor training tweak.]

cs.CV 閱讀原文

Pass3 降級：Gaze-guided robotics has ≥3 pre-2024 precedents; attention regularization is a minor training tweak.

展開摘要

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.

[Buffalo|Yang]

Background VLA [Tongji] 2026-03-26

VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

训练-free视觉token剪枝通过交互对齐，Interaction-First范式保留结构关键区域，降低VLA推理成本，资源受限平台可立即部署。 [Pass3降级: Token pruning is a saturated optimization technique (pre-2024 ≥3 works); interaction alignment is incremental.]

cs.CV 閱讀原文

Pass3 降級：Token pruning is a saturated optimization technique (pre-2024 ≥3 works); interaction alignment is incremental.

展開摘要

Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8\% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.

[Tongji]

Background VLA 2026-03-26

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

竞争博弈多Agent RL框架用于具身视觉跟踪，引入CoMaTrack-Bench基准，VLA应用但聚焦跟踪而非操作，方向较窄。

cs.AI 閱讀原文

展開摘要

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench

Background VLA [Wayne] 2026-03-26

EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation

对称等变策略学习框架用于双臂操作，强制观测-动作双边等变性，方法相邻但聚焦双臂对称性而非VLA核心架构，需进一步验证。

cs.RO 閱讀原文

展開摘要

Robotic imitation learning has achieved impressive success in learning complex manipulation behaviors from demonstrations. However, many existing robot learning methods do not explicitly account for the physical symmetries of robotic systems, often resulting in asymmetric or inconsistent behaviors under symmetric observations. This limitation is particularly pronounced in dual-arm manipulation, where bilateral symmetry is inherent to both the robot morphology and the structure of many tasks. In this paper, we introduce EquiBim, a symmetry-equivariant policy learning framework for bimanual manipulation that enforces bilateral equivariance between observations and actions during training. Our approach formulates physical symmetry as a group action on both observation and action spaces, and imposes an equivariance constraint on policy predictions under symmetric transformations. The framework is model-agnostic and can be seamlessly integrated into a wide range of imitation learning pipelines with diverse observation modalities and action representations, including point cloud-based and image-based policies, as well as both end-effector-space and joint-space parameterizations. We evaluate EquiBim on RoboTwin, a dual-arm robotic platform with symmetric kinematics, and evaluate it across diverse observation and action configurations in simulation. We further validate the approach on a real-world dual-arm system. Across both simulation and physical experiments, our method consistently improves performance and robustness under distribution shifts. These results suggest that explicitly enforcing physical symmetry provides a simple yet effective inductive bias for bimanual robot learning.

[Wayne]

Background VLA [NASA] 2026-03-26

Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks

DA-DP显式纳入推理延迟至策略学习，校正零延迟轨迹至延迟补偿对应，架构无关可迁移，但非VLA特定贡献，属通用策略改进。

cs.RO 閱讀原文

展開摘要

As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization from zero delay to measured delay during training and inference. We introduce Delay-Aware Diffusion Policy (DA-DP), a framework for explicitly incorporating inference delays into policy learning. DA-DP corrects zero-delay trajectories to their delay-compensated counterparts, and augments the policy with delay conditioning. We empirically validate DA-DP on a variety of tasks, robots, and delays and find its success rate more robust to delay than delay-unaware methods. DA-DP is architecture agnostic and transfers beyond diffusion policies, offering a general pattern for delay-aware imitation learning. More broadly, DA-DP encourages evaluation protocols that report performance as a function of measured latency, not just task difficulty.

[NASA]

Background VLA [Harbin] 2026-03-26

Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion

SE(3)姿态引导模仿学习用于精密插入任务，扩散策略预测相对姿态轨迹，实证研究性质，无新架构贡献，特定任务导向。

cs.RO 閱讀原文

展開摘要

Imitation learning is promising for robotic manipulation, but \emph{precise insertion} in the real world remains difficult due to contact-rich dynamics, tight clearances, and limited demonstrations. Many existing visuomotor policies depend on high-dimensional RGB/point-cloud observations, which can be data-inefficient and generalize poorly under pose variations. In this paper, we study pose-guided imitation learning by using object poses in $\mathrm{SE}(3)$ as compact, object-centric observations for precise insertion tasks. First, we propose a diffusion policy for precise insertion that observes the \emph{relative} $\mathrm{SE}(3)$ pose of the source object with respect to the target object and predicts a future relative pose trajectory as its action. Second, to improve robustness to pose estimation noise, we augment the pose-guided policy with RGBD cues. Specifically, we introduce a goal-conditioned RGBD encoder to capture the discrepancy between current and goal observations. We further propose a pose-guided residual gated fusion module, where pose features provide the primary control signal and RGBD features adaptively compensate when pose estimates are unreliable. We evaluate our methods on six real-robot precise insertion tasks and achieve high performance with only $7$--$10$ demonstrations per task. In our setup, the proposed policies succeed on tasks with clearances down to $0.01$~mm and demonstrate improved data efficiency and generalization over existing baselines. Code will be available at https://github.com/sunhan1997/PoseInsert.

[Harbin]

Background VLA [CMU] 2026-03-26

DiSCo: Diffusion Sequence Copilots for Shared Autonomy

扩散序列copilot用于共享自主性，平衡专家动作符合度与用户意图对齐，人机协作方向相邻但非纯VLA，应用场景特定。

cs.RO 閱讀原文

展開摘要

Shared autonomy combines human user and AI copilot actions to control complex systems such as robotic arms. When a task is challenging, requires high dimensional control, or is subject to corruption, shared autonomy can significantly increase task performance by using a trained copilot to effectively correct user actions in a manner consistent with the user's goals. To significantly improve the performance of shared autonomy, we introduce Diffusion Sequence Copilots (DiSCo): a method of shared autonomy with diffusion policy that plans action sequences consistent with past user actions. DiSCo seeds and inpaints the diffusion process with user-provided actions with hyperparameters to balance conformity to expert actions, alignment with user intent, and perceived responsiveness. We demonstrate that DiSCo substantially improves task performance in simulated driving and robotic arm tasks. Project website: https://sites.google.com/view/disco-shared-autonomy/

[CMU]

Background VLA [CAS] 2026-03-26

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

视频-触觉-动作多模态世界建模框架，触觉感知补充接触丰富场景，方向前沿但摘要无量化结果，需正文验证实际增益幅度。

cs.RO 閱讀原文

展開摘要

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

[CAS]

Background VLA [Beihang] 2026-03-26

Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation

E3Flow统一rectified flow与多模态等变学习，球谐表示保证SO(3)等变性，方法前沿但摘要无基准对比数据，需进一步验证。

cs.RO 閱讀原文

展開摘要

While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen and further conduct 4 real-world experiments to validate its effectiveness in physical environments. Simulation results show that E3Flow achieves a 3.12% improvement in average success rate over the state-of-the-art Spherical Diffusion Policy (SDP) while simultaneously delivering a 7x inference speedup. E3Flow thus demonstrates a new and highly effective trade-off between performance, efficiency, and data efficiency for robotic policy learning. Code: https://github.com/zql-kk/E3Flow.

[Beihang]

Background VLA [NVIDIA] 2026-03-26

Agile-VLA: Few-Shot Industrial Pose Rectification via Implicit Affordance Anchoring

边缘设备分层VLA框架，隐式affordance anchoring解耦感知控制，Jetson Orin Nano部署，工业场景特定但方法可迁移性待验证。

cs.RO 閱讀原文

展開摘要

Deploying Vision-Language-Action (VLA) models on resource-constrained edge platforms encounters a fundamental conflict between high-latency semantic inference and the high-frequency control required for dynamic manipulation. To address the challenge, this paper presents Agile-VLA, a hierarchical framework designed for industrial pose reorientation tasks on edge devices such as the NVIDIA Jetson Orin Nano. The core innovation is an Implicit Affordance Anchoring mechanism that directly maps geometric visual cues, specifically centroid and rim keypoint anchors, into structured parametric action primitives, thereby substantially reducing reliance on high-latency semantic inference during closed-loop control. By decoupling perception (10 Hz) from control (50 Hz) via an asynchronous dual-stream architecture, the system effectively mitigates the frequency mismatch inherent in edge-based robot learning. Experimental results on a standard 6-DoF manipulator demonstrate that Agile-VLA achieves robust rectification of complex, irregular workpieces using only 5-shot demonstrations through extrinsic dexterity.

[NVIDIA]

Background VLA [HKUST] 2026-03-26

Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

灵巧操作sim-to-real泛化实证研究，考察域随机化/渲染/策略架构四维度，无新方法贡献但提供系统性实验洞见供参考。

cs.RO 閱讀原文

展開摘要

Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.

[HKUST]

Background VLA [UCSD] 2026-03-26

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

空间grounded VLA用于移动操作，13维动作空间协调基座+手臂+夹爪，多视图RGB+深度+时序历史，方法清晰但无量化对比。

cs.RO 閱讀原文

展開摘要

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancement. Our method addresses the challenge of controlling a 13-dimensional action space involving coordinated base motion, arm articulation, and gripper actuation. To enrich spatial understanding, the model incorporates multi-view RGB observations, depth cues, and short temporal history, providing perspectives of both global scene structure and local manipulation context. To improve representation quality, we co-train auxiliary decoders that reconstruct interpretable intermediate signals - including global robot position, joint configurations, grasp affordances, target-object relative pose, and segmentation masks - from shared visual-language features. These objectives provide dense supervision that encourages the backbone to develop spatially grounded, manipulation-aware latent representations. Through extensive evaluation on home rearrangement tasks, our approach achieves consistent improvements across picking, placing, opening, and closing operations, substantially outperforming direct imitation learning. Our findings suggest that spatial grounding through auxiliary and multi-modal learning provides a strong direction for scaling VLA models toward general-purpose domestic robots.

[UCSD]

Background VLA [Berkeley] 2026-03-26

CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Code-as-Policy基准框架CaP-Gym+CaP-Bench，评估12模型在不同抽象层级表现，开放访问框架补充VLA方法但非核心算法贡献。

cs.RO 閱讀原文

展開摘要

"Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation--through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning--substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents.

[Berkeley]

Background VLA [HKUST] 2026-03-26

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

统一MLLM框架分解网格为sim-ready关节资产，Sparse 3D VQ-VAE减少70%token，仿真资产生成相邻方向，非VLA直接相关。

hf-papers 閱讀原文

展開摘要

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

[HKUST]

Pass1 過濾論文 5 篇

以下論文在 Pass1 被分入 B 桶（相關性較低），未進入 LLM 精評。

Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging 科学成像，无机器人
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles 自动驾驶，非操作
GenExam: A Multidisciplinary Text-to-Image Exam 文生图基准，无机器人
AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation 视频分割，非具身
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences 激光雷达，偏自动驾驶

首頁首页 / VLA 日報VLA 日报 / 2026-03-26