PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
Shizhe Chen et al. · 提出PointACT架构,通过多尺度点云-动作交互机制克服现有VLA依赖2D视觉表征的局限,显著提升了3D空间下的操作精度与泛化能力。在LIBERO等基准上超越SOTA,解决了VLA三维感知瓶颈。
Shizhe Chen et al. · 提出PointACT架构,通过多尺度点云-动作交互机制克服现有VLA依赖2D视觉表征的局限,显著提升了3D空间下的操作精度与泛化能力。在LIBERO等基准上超越SOTA,解决了VLA三维感知瓶颈。
Edison Velasco-Sanchez et al. · 结合视觉与触觉反馈的模型预测控制(MPC)用于轮廓跟踪任务。提供了具体的触觉-视觉融合控制方案,可直接应用于需要高精度接触操作的VLA底层控制器设计。
Benedict Quartey et al. · 通过联合学习谓词和动作实现技能的零样本组合,解决了传统LfD在未见技能组合上的泛化难题。该方法为VLA的技能模块化与组合泛化提供了可复用的训练范式。
Zijian Zhang et al. · 引入前馈式3D高斯世界模型以增强VLA的策略生成能力,弥补了纯模仿学习在长期规划中的不足。该工作将世界模型与VLA结合,为提升长程任务成功率提供了新视角。
Alex S. Huang et al. · 发布了低成本、可复现的真实世界VLA评估基准VLA-REPLICA,填补了仿真与真实部署间的评估空白。团队可直接利用此基准进行VLA模型的鲁棒性验证与对比实验。
Hanxiang Ren et al. · 提出DISC框架,通过策略生成解耦指令与状态条件控制,消除了观察泄漏导致的捷径学习。该方法提升了VLA对指令变化的鲁棒性,具有明确的架构改进价值。
Haoran Huang et al. · 针对移动操作提出跨视图扩散策略与解耦运动学方法,解决了基座运动带来的动作标签污染和推理延迟问题。为移动VLA(Mobile VLA)提供了高效的工程解决方案。
Zhuohang Li et al. · 提出无缝手-臂干预机制以纠正VLA在灵巧操作中的累积误差,通过交互式模仿学习提升长视界任务成功率。为改善VLA在复杂接触任务中的稳定性提供了有效策略。
Qian He et al. · 结合频域优化分块与局部锚定流匹配,解决视动策略轨迹不连贯问题。该方法改进了Diffusion Policy的动作生成质量,可直接集成至现有VLA推理流程中。
Ayush Agarwal et al. · 推出基于智能手机云端遥操作的众包数据收集平台COBALT,旨在降低大规模高质量演示数据的获取成本。为VLA训练提供了可扩展的数据采集基础设施。
Yixiang Zhu et al. · 针对VLA异步推理导致的预测-执行错位,提出基于流匹配的似然估计反事实微调方法DEFLECT,增强了策略对延迟的鲁棒性。解决了VLA部署中的关键工程痛点。
Bosun Liang et al. · 提出隐式动作分块方法以消除强化学习中常见的高频振荡,实现平滑连续控制。该方法可作为后处理模块或直接替换现有VLA的动作输出头,提升执行稳定性。
Pietro Mazzaglia et al. · 探索混合训练策略(如CoT与直接动作生成的结合)以提升VLA性能。提供了关于如何平衡语义推理与动作执行的具体训练技巧,具有实操指导意义。
Yue Feng et al. · 提出时空最优传输注意力机制,专门用于视触模仿学习中的富接触操作任务。有效解决了部分可观性和不连续动力学挑战,是触觉VLA的重要算法补充。
Senlan Yao et al. · 仅利用关节传感器数据,通过本体感知Transformer实现鲁棒的灵巧手内操作。减少了对视觉/触觉传感器的依赖,为低成本VLA部署提供了新路径。
Kana Miyamoto et al. · 利用注意力机制中的分布差异从失败演示中筛选有效数据,提升模仿学习效率。该方法可直接应用于VLA训练数据清洗阶段,挖掘负样本价值。
提出基于结构潜在点的3D视觉表示方法,旨在解决隐式神经场缺乏显式几何线索的问题。属于感知层改进,未直接涉及VLA决策架构或策略学习。
Doguhan Yeke et al. · 针对具身智能体中VLM作为高层规划器时的“盲目服从”问题,提出了新的基准测试以评估其拒绝执行不安全指令的能力。属于评估体系构建,非核心算法突破。
Xiao-Ming Wu et al. · 提供构建强VLA模型的详细配方与最佳实践指南,涵盖数据、训练及评估环节。虽无新算法突破,但对工程落地具有重要参考价值,属综述性质。
Andrew Choi et al. · 提出基于自监督动作排序的离线到在线RL方法,提升大状态空间下的样本效率。虽与VLA精调相关,但更偏向通用RL算法,非VLA特有架构创新。
Puyi Wang et al. · 生成可执行的室内场景程序代码,支持铰接物体的编辑与模拟。主要贡献在于场景合成与仿真环境构建,间接服务于VLA训练,非直接策略学习。
Dillon Z. Chen et al. · 结合符号世界模型与双层策略学习以解决长视界规划问题。虽涉及规划,但未明确展示与VLA大模型的结合方式,更多偏向传统分层RL方法。
Xuan Cai et al. · 发表了一种仿生离子热感受器硬件,用于机器人的热触觉感知。属于新型传感器硬件创新,虽对触觉VLA有价值,但非算法或架构层面的进展。
Elle Miller et al. · 发布第二代机器人触觉奥林匹克基准roto 2.0,旨在标准化触觉RL研究。作为重要数据集/基准发布,值得了解但非即时可用的新方法。
Yuchen Wang et al. · 提出知识编码的可扩展轨迹世界模型WestWorld,支持多种机器人系统。主要贡献在于世界模型的泛化能力,需进一步验证其在VLA闭环控制中的具体效用。
Can Li et al. · arXiv:2605.09586v2 Announce Type: replace-cross Abstract: World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observ
Ahmet H. G\"uzel et al. · arXiv:2605.18803v1 Announce Type: cross Abstract: Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adve
Sebastian Stapf et al. · arXiv:2605.18813v1 Announce Type: cross Abstract: World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state-space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we sugge
Roman Kniazev et al. · arXiv:2605.18847v1 Announce Type: cross Abstract: Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell
Pu Zhao et al. · arXiv:2605.19242v1 Announce Type: cross Abstract: World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning inp
Emmy Liu et al. · arXiv:2605.19341v1 Announce Type: cross Abstract: Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or re
Jun Ma et al. · arXiv:2605.19371v1 Announce Type: cross Abstract: Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classic
Zhen Fang et al. · arXiv:2605.08063v4 Announce Type: replace-cross Abstract: Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model commu
Eric Tillmann Bill et al. · arXiv:2605.20316v1 Announce Type: new Abstract: Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that up
Yifu Luo et al. · arXiv:2510.21583v2 Announce Type: replace Abstract: Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent `chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the n
Andr\'e Eberhard et al. · arXiv:2601.15133v3 Announce Type: replace Abstract: Recent years have seen substantial progress in neural generation of text, images, and audio, supported by mature training pipelines and large-scale optimization. For graphs, however, comparable progress has been more limited. We attribute this gap to graph-specific optimization and representation challenges that undermine the effectiveness of training neural networks with backpropagation and gradient descent. We argue that representing graphs o
Ssharvien Kumar Sivakumar et al. · arXiv:2605.16530v2 Announce Type: replace Abstract: Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism,
Jingxuan Wu et al. · arXiv:2510.09060v2 Announce Type: replace-cross Abstract: Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geome
Harsh Parikh et al. · arXiv:2605.21458v1 Announce Type: cross Abstract: Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes th
Li Ju et al. · arXiv:2601.21662v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hypers
Rohan Deb et al. · arXiv:2603.22430v2 Announce Type: replace Abstract: Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories duri
Shukai Du et al. · arXiv:2605.15419v2 Announce Type: replace Abstract: Flow matching trains a neural velocity field by regression against a target velocity associated with a prescribed probability path connecting a simple initial distribution to the data distribution. A central design choice is the path itself. Existing constructions, including rectified and optimal-transport-based paths, transport samples along straight lines between coupled endpoints and thus cover only a narrow class of dynamics. We observe tha