Skip to content
VLA 線 · 查看同日 AI 報告 →查看同日 AI 报告 →
VLA 研究日報 Pulsar
LIVE
— AI 線今日無資料 —— AI 线今日无资料 —

VLA 研究日報VLA 研究日报

共 42 篇

⚡ 突破

🔧 技術技术

VLA [McGill University]

Hybrid Training for Vision-Language-Action Models

Pietro Mazzaglia et al. · 探索混合训练策略(如CoT与直接动作生成的结合)以提升VLA性能。提供了关于如何平衡语义推理与动作执行的具体训练技巧,具有实操指导意义。

📖 背景閱讀背景阅读

VLA

roto 2.0: The Robot Tactile Olympiad

Elle Miller et al. · 发布第二代机器人触觉奥林匹克基准roto 2.0,旨在标准化触觉RL研究。作为重要数据集/基准发布,值得了解但非即时可用的新方法。

VLA

DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

Can Li et al. · arXiv:2605.09586v2 Announce Type: replace-cross Abstract: World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observ

VLA

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

Ahmet H. G\"uzel et al. · arXiv:2605.18803v1 Announce Type: cross Abstract: Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adve

VLA

Composition of Memory Experts for Diffusion World Models

Sebastian Stapf et al. · arXiv:2605.18813v1 Announce Type: cross Abstract: World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state-space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we sugge

VLA

Transformers Linearly Represent Highly Structured World Models

Roman Kniazev et al. · arXiv:2605.18847v1 Announce Type: cross Abstract: Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell

VLA

PhyWorld: Physics-Faithful World Model for Video Generation

Pu Zhao et al. · arXiv:2605.19242v1 Announce Type: cross Abstract: World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning inp

VLA [LAC+USC Medical Center]

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

Emmy Liu et al. · arXiv:2605.19341v1 Announce Type: cross Abstract: Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or re

VLA [North Minzu University]

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

Jun Ma et al. · arXiv:2605.19371v1 Announce Type: cross Abstract: Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classic

VLA

Flow-OPD: On-Policy Distillation for Flow Matching Models

Zhen Fang et al. · arXiv:2605.08063v4 Announce Type: replace-cross Abstract: Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model commu

VLA

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

Eric Tillmann Bill et al. · arXiv:2605.20316v1 Announce Type: new Abstract: Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that up

VLA

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

Yifu Luo et al. · arXiv:2510.21583v2 Announce Type: replace Abstract: Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent `chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the n

VLA

Building Deep Graph Predictors with Graph Imitation Learning

Andr\'e Eberhard et al. · arXiv:2601.15133v3 Announce Type: replace Abstract: Recent years have seen substantial progress in neural generation of text, images, and audio, supported by mature training pipelines and large-scale optimization. For graphs, however, comparable progress has been more limited. We attribute this gap to graph-specific optimization and representation challenges that undermine the effectiveness of training neural networks with backpropagation and gradient descent. We argue that representing graphs o

VLA [Government Mohan Kumaramangalam Medical College]

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

Ssharvien Kumar Sivakumar et al. · arXiv:2605.16530v2 Announce Type: replace Abstract: Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism,

VLA [Zhejiang Ocean University]

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

Jingxuan Wu et al. · arXiv:2510.09060v2 Announce Type: replace-cross Abstract: Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geome

VLA

Mind the Sim-to-Real Gap & Think Like a Scientist

Harsh Parikh et al. · arXiv:2605.21458v1 Announce Type: cross Abstract: Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes th

VLA

Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

Li Ju et al. · arXiv:2601.21662v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hypers

VLA

Inference Time Policy Optimization for Offline RL with Differentiable World Models

Rohan Deb et al. · arXiv:2603.22430v2 Announce Type: replace Abstract: Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories duri

VLA

Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design

Shukai Du et al. · arXiv:2605.15419v2 Announce Type: replace Abstract: Flow matching trains a neural velocity field by regression against a target velocity associated with a prescribed probability path connecting a simple initial distribution to the data distribution. A central design choice is the path itself. Existing constructions, including rectified and optimal-transport-based paths, transport samples along straight lines between coupled endpoints and thus cover only a narrow class of dynamics. We observe tha