Skip to content
VLA 線 · 查看同日 AI 報告 →查看同日 AI 报告 →
VLA 研究日報 Pulsar
LIVE
— AI 線今日無資料 —— AI 线今日无资料 —

VLA 研究日報VLA 研究日报

共 30 篇

📖 背景閱讀背景阅读

VLA

MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers

arXiv:2603.15265v1 Announce Type: new Abstract: The ability of robots to handle multiple tasks under a unified policy is critical for deploying embodied intelligence in real-world household and industrial applications. However, out-of-distribution variation across tasks often causes severe task interference and negative transfer when training general robotic policies. To address this challenge, we propose a lightweight multi-task imitation learning framework for bimanual manipulation, termed Mix

VLA

HapticVLA: Contact-Rich Manipulation via Vision-Language-Action Model without Inference-Time Tactile Sensing

arXiv:2603.15257v1 Announce Type: new Abstract: Tactile sensing is a crucial capability for Vision-Language-Action (VLA) architectures, as it enables dexterous and safe manipulation in contact-rich tasks. However, reliance on dedicated tactile hardware increases cost and reduces reproducibility across robotic platforms. We argue that tactile-aware manipulation can be learned offline and deployed without direct haptic feedback at inference. To this end, we present HapticVLA, which proceeds in two

VLA

NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation

arXiv:2603.15186v1 Announce Type: new Abstract: Simulating realistic environments for robots is widely recognized as a critical challenge in robot learning, particularly in terms of rendering and physical simulation. This challenge becomes even more pronounced in navigation tasks, where trajectories often extend across multiple rooms or entire floors. In this work, we present NavGSim, a Gaussian Splatting-based simulator designed to generate high-fidelity, large-scale navigation environments. Bu

VLA

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

arXiv:2603.15169v1 Announce Type: new Abstract: Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the

VLA

Master Micro Residual Correction with Adaptive Tactile Fusion and Force-Mixed Control for Contact-Rich Manipulation

arXiv:2603.15152v1 Announce Type: new Abstract: Robotic contact-rich and fine-grained manipulation remains a significant challenge due to complex interaction dynamics and the competing requirements of multi-timescale control. While current visual imitation learning methods excel at long-horizon planning, they often fail to perceive critical interaction cues like friction variations or incipient slip, and struggle to balance global task coherence with local reactive feedback. To address these cha

VLA

Confusion-Aware In-Context-Learning for Vision-Language Models in Robotic Manipulation

arXiv:2603.15134v1 Announce Type: new Abstract: Vision-language models (VLMs) have significantly improved the generalization capabilities of robotic manipulation. However, VLM-based systems often suffer from a lack of robustness, leading to unpredictable errors, particularly in scenarios involving confusable objects. Our preliminary analysis reveals that these failures are mainly caused by shortcut learning problem inherently in VLMs, limiting their ability to accurately distinguish between conf

VLA

AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

arXiv:2603.15046v1 Announce Type: new Abstract: In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their dep

VLA

ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy

arXiv:2603.14977v1 Announce Type: new Abstract: Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizi

VLA

Learning from Mistakes: Post-Training for Driving VLA with Takeover Data

arXiv:2603.14972v1 Announce Type: new Abstract: Current Vision-Language-Action (VLA) paradigms in end-to-end autonomous driving rely on offline training from static datasets, leaving them vulnerable to distribution shift. Recent post-training methods use takeover data to mitigate this by augmenting the dataset with high-quality expert takeover samples, yet they suffer from two key limitations: supervision restricted to the period after the takeover moments leads to policies with limited safety m

VLA

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

arXiv:2603.14811v1 Announce Type: new Abstract: Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational locat

VLA

Tactile Modality Fusion for Vision-Language-Action Models

arXiv:2603.14604v1 Announce Type: new Abstract: We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact for

VLA

One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation

arXiv:2603.14522v1 Announce Type: new Abstract: Cross-embodiment manipulation is crucial for enhancing the scalability of robot manipulation and reducing the high cost of data collection. However, the significant differences between embodiments, such as variations in action spaces and structural disparities, pose challenges for joint training across multiple sources of data. To address this, we propose One-Policy-Fits-All (OPFA), a framework that enables learning a single, versatile policy acros

VLA

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

arXiv:2603.14401v1 Announce Type: new Abstract: We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced d

VLA

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

arXiv:2603.14371v1 Announce Type: new Abstract: Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computa

VLA

OmniClone: Engineering a Robust, All-Rounder Whole-Body Humanoid Teleoperation System

arXiv:2603.14327v1 Announce Type: new Abstract: Whole-body humanoid teleoperation enables humans to remotely control humanoid robots, serving as both a real-time operational tool and a scalable engine for collecting demonstrations for autonomous learning. Despite recent advances, existing systems are validated using aggregate metrics that conflate distinct motion regimes, masking critical failure modes. This lack of diagnostic granularity, compounded by tightly coupled and labor-intensive system

VLA

A Real-Time Neuro-Symbolic Ethical Governor for Safe Decision Control in Autonomous Robotic Manipulation

arXiv:2603.14221v1 Announce Type: new Abstract: Ethical decision governance has become a critical requirement for autonomous robotic systems operating in human-centered and safety-sensitive environments. This paper presents a real-time neuro-symbolic ethical governor designed to enable risk-aware supervisory control in autonomous robotic manipulation tasks. The proposed framework integrates transformer-based ethical reasoning with a probabilistic ethical risk field formulation and a threshold-ba

VLA

ToMPC: Task-oriented Model Predictive Control via ADMM for Safe Robotic Manipulation

arXiv:2603.13944v1 Announce Type: new Abstract: This paper proposes a task-oriented model predictive control (ToMPC) framework for safe and efficient robotic manipulation in open workspaces. The framework unifies collision-free motion and robot-environment interaction to address diverse scenarios. Additionally, it introduces task-oriented obstacle avoidance that leverages kinematic redundancy to enhance manipulation efficiency in obstructed environments. This complex optimization problem is solv

VLA

SmoothVLA: Aligning Vision-Language-Action Models with Physical Constraints via Intrinsic Smoothness Optimization

arXiv:2603.13925v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. However, existing post-training methods face a dilemma between stability and exploration: Supervised Fine-Tuning (SFT) is constrained by demonstration quality and lacks generalization, whereas Reinforcement Learning (RL) improves exploration but often induces erratic, jittery trajectories that violate physical constraints. To bridge this gap, we propos

VLA

GraspADMM: Improving Dexterous Grasp Synthesis via ADMM Optimization

arXiv:2603.13832v1 Announce Type: new Abstract: Synthesizing high-quality dexterous grasps is a fundamental challenge in robot manipulation, requiring adherence to diversity, kinematic feasibility (valid hand-object contact without penetration), and dynamic stability (secure multi-contact forces). The recent framework Dexonomy successfully ensures broad grasp diversity through dense sampling and improves kinematic feasibility via a simulator-based refinement method that excels at resolving exact

VLA

Building Explicit World Model for Zero-Shot Open-World Object Manipulation

arXiv:2603.13825v1 Announce Type: new Abstract: Open-world object manipulation remains a fundamental challenge in robotics. While Vision-Language-Action (VLA) models have demonstrated promising results, they rely heavily on large-scale robot action demonstrations, which are costly to collect and can hinder out-of-distribution generalization. In this paper, we propose an explicit-world-model-based framework for open-world manipulation that achieves zero-shot generalization by constructing a physi

VLA

ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

arXiv:2603.13788v1 Announce Type: new Abstract: Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D represent

VLA

Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection

arXiv:2603.13782v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuri

VLA

KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

arXiv:2603.13781v1 Announce Type: new Abstract: Generative Control Policies (GCPs) show immense promise in robotic manipulation but struggle to simultaneously model stable global motions and high-frequency local corrections. While modern architectures extract multi-scale spatial features, their underlying Probability Flow ODEs apply a uniform temporal integration schedule. Compressed to a single step for real-time Receding Horizon Control (RHC), uniform ODE solvers mathematically smooth over spa

VLA

Exploration-assisted Bottleneck Transition Toward Robust and Data-efficient Deformable Object Manipulation

arXiv:2603.13756v1 Announce Type: new Abstract: Imitation learning has demonstrated impressive results in robotic manipulation but fails under out-of-distribution (OOD) states. This limitation is particularly critical in Deformable Object Manipulation (DOM), where the near-infinite possible configurations render comprehensive data collection infeasible. Although several methods address OOD states, they typically require exhaustive data or highly precise perception. Such requirements are often im

VLA

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

arXiv:2603.13707v1 Announce Type: new Abstract: Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller, leading to poor command tracking, compounding distribution

VLA

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

arXiv:2603.13616v1 Announce Type: new Abstract: Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is s

VLA

Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis

arXiv:2603.13528v1 Announce Type: new Abstract: While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-

VLA

Spatially Grounded Long-Horizon Task Planning in the Wild

arXiv:2603.13433v1 Announce Type: new Abstract: Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the

VLA

Bi-HIL: Bilateral Control-Based Multimodal Hierarchical Imitation Learning via Subtask-Level Progress Rate and Keyframe Memory for Long-Horizon Contact-Rich Robotic Manipulation

arXiv:2603.13315v1 Announce Type: new Abstract: Long-horizon contact-rich robotic manipulation remains challenging due to partial observability and unstable subtask transitions under contact uncertainty. While hierarchical architectures improve temporal reasoning and bilateral imitation learning enables force-aware control, existing approaches often rely on flat policies that struggle with long-horizon coordination. We propose Bi-HIL, a bilateral control-based multimodal hierarchical imitation l

VLA

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We levera

hf-papers 閱讀原文