Skip to content
VLA 線 · DEEP DIVE ARCHIVEVLA 线 · DEEP DIVE ARCHIVE

VLA 深度追蹤VLA 深度追踪

Vision-Language-Action:讓機器人看→想→做的端到端模型Vision-Language-Action:让机器人看→想→做的端到端模型

METHOD FAMILY TRENDS

data thru 2026-06-10 · 713 papers · 50d window · 15 families
high
▼ 2 declining 15 families · 199 papers covered
FAMILY MOM 7d 14d 30d Δ7d Δ14d Δ30d CHART ST
Lang. Grounding 60 123 240 1.11x 0.97x 1.64x
World Model 23 41 115 1.16x 0.58x 1.35x
Multi-Task 18 35 77 0.90x 0.78x 0.93x
RL Fine-tuning 17 44 103 0.73x 0.69x 1.24x
Human-Robot 16 35 78 0.80x 0.81x 0.94x
Flow Matching 12 21 77 0.60x 0.40x 0.93x
Long Horizon 12 20 52 0.60x 0.46x 0.63x
Diffusion Policy 11 24 46 0.55x 0.56x 0.55x
Dexterous Hand 8 13 25 0.40x 0.30x 0.30x
Tactile 7 14 34 0.35x 0.32x 0.41x
Mobile Manip. 5 6 12 0.25x 0.14x 0.14x
3D Repr. 4 8 13 0.20x 0.19x 0.16x
Sim-to-Real 4 7 16 0.20x 0.16x 0.19x
Cross-Embodiment 2 4 12 0.10x 0.09x 0.14x
Instr. Tuning 0 0 8 0.00x 0.00x 0.10x
COMPETITION PAIRS 6 matchups · hover for details
VLA vs WAM
Paradigm war: end-to-end action prediction vs world model planning
Lang. Ground. vs World Model
72%
28%
8.4% · x1.11 ratio 2.61 3.2% · x1.16
Lang. Grounding

Language Grounding: connecting natural language instructions to robot actions; vision-language-action alignment

World Model

World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction

Why they compete

The central paradigm war in embodied AI. VLA (Vision-Language-Action) maps observations directly to actions end-to-end — simple, scalable, but needs massive data and generalizes poorly. WAM (World-Action Model) first learns how the world works, then plans actions through mental simulation — better generalization and data efficiency, but world models are often inaccurate. The boundary is blurring: Pi0.5 uses flow matching (generative, WAM-like), GR00T adds video prediction. The winner likely is a hybrid.

ACTION HEAD ROUTE
Continuous action generation: denoising vs optimal transport
Diffusion Pol. vs Flow Matching
48%
52%
1.5% · x0.55 ratio 0.92 1.7% · x0.60
Diffusion Policy

Diffusion Policy: iterative denoising process (DDPM) to generate continuous robot actions; strong on multi-modal action distributions

Flow Matching

Flow Matching: optimal-transport-based generative model (e.g. Pi0); faster inference than diffusion with comparable quality

Why they compete

Both generate continuous actions from the same VLA backbone but take different mathematical routes: diffusion iteratively denoises random noise into actions (slow, expressive), while flow matching uses optimal transport for a direct trajectory (fast, efficient). If flow matching matches diffusion quality, it could replace it as the default action head.

POST-TRAINING ROUTE
Model adaptation: supervised tuning vs reward optimization
Instr. Tuning vs RL Fine-tune
0%
100%
0.0% · x0.00 ratio 0.00 2.4% · x0.73
Instr. Tuning

Instruction Tuning: supervised fine-tuning (SFT) on language-action pairs; simpler but limited to offline data distribution

RL Fine-tuning

RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data

Why they compete

After pretraining a VLA, two competing strategies exist: SFT directly imitates expert demonstrations (simple, stable), while RL fine-tuning (GRPO/DPO) optimizes a reward signal to go beyond the demonstration distribution. RL can discover novel strategies but is harder to stabilize.

LEARNING SIGNAL
Training paradigm: imagination-based vs reward-based
World Model vs RL Fine-tune
58%
42%
3.2% · x1.16 ratio 1.36 2.4% · x0.73
World Model

World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction

RL Fine-tuning

RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data

Why they compete

World models learn by predicting the future (imagination-based planning), while RL learns from reward feedback. If world models become accurate enough, they could reduce the need for expensive real-world RL exploration.

MANIPULATION SENSING
Manipulation approach: tactile feedback vs dexterous control
Tactile vs Dext. Hand
47%
53%
1.0% · x0.35 ratio 0.88 1.1% · x0.40
Tactile

Tactile Sensing: force/torque and GelSight contact sensors; provides direct manipulation feedback for delicate tasks

Dexterous Hand

Dexterous Hand: multi-finger manipulation control; achieves fine-grained object interaction without dedicated sensors

Why they compete

Two approaches to dexterous manipulation: tactile sensing adds explicit touch feedback (hardware cost, rich signal), while dexterous hand control relies on proprioception and vision alone (simpler hardware, harder control). The winner depends on sensor cost-to-performance ratio.

TRANSFER APPROACH
Domain bridging: simulation transfer vs cross-embodiment
Sim2Real vs Cross-Embod.
67%
33%
0.6% · x0.20 ratio 2.00 0.3% · x0.10
Sim-to-Real

Sim-to-Real: train in simulation, deploy on real hardware; uses domain randomization to bridge the reality gap

Cross-Embodiment

Cross-Embodiment: transfer policies across different robot morphologies; aims for universal robot foundation models

Why they compete

Sim-to-Real trains one robot in simulation then transfers (cheap data, reality gap risk), while Cross-Embodiment trains across multiple real robots directly (expensive data, natural generalization). The approaches represent different bets on where generalization should happen.

EMERGING SIGNALS

2026-06-10 · 30/163 unmatched · 7d window
2 signals
TERM COUNT AGE VELOCITY STATUS SAMPLE
robot manipulation
11 7d -- x-2.5 CANDIDATE Revisiting Articulated Parts Perception in Robot M...
embodied ai
8 7d -- x5.0 RISING Rein3D: Reinforced 3D Indoor Scene Generation with...

TOP INSTITUTIONS

30d window · 20 labs tracked · VLA domain
12 active / 30d
INSTITUTION TOTAL BEST LAST SEEN ACTIVITY
1 LIBERO Team 3 🔧 05-29
2 清华 13 05-31
3 Peking University 2 🔧 06-03
NVIDIA 4 🔧 05-28
ETH 4 🔧 06-02
MIT 1 🔧 06-09
Microsoft 1 05-12
π 1 🔧 05-26
Hutter 1 🔧 06-02
Agrawal 1 🔧 06-09
NYU 1 🔧 06-10
LeCun 1 🔧 06-10
CMU 5 04-17
Berkeley 5 🔧 05-01
浙大 3 🔧 03-19
Princeton 3 🔧 04-22
DeepMind 3 🔧 03-19
Stanford 2 🔧 04-17
上交 2 📖 03-19
中科院 2 🔧 03-13

📐 理論文章庫📐 理论文章库

231 查看 GitHub 全庫查看 GitHub 全库
最近 2 週最近 2 周 35
2 天前 vla core

3DThinkVLA:通过3D思维引导协同训练赋予VLA模型隐式3D空间推理能力 (3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training)

在 GitHub 閱讀在 GitHub 阅读
2 天前 vla core

势函数引导的 Flow Matching 用于 VLA 策略优化 (Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement)

在 GitHub 閱讀在 GitHub 阅读
2 天前 vla core

TempoVLA:速度可控的 Vision-Language-Action 策略 (TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies)

在 GitHub 閱讀在 GitHub 阅读
2 天前 foundation

3DThinkVLA:通过3D思维引导协同训练赋予VLA隐式3D空间推理能力 (3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training)

在 GitHub 閱讀在 GitHub 阅读
2 天前 foundation

势函数引导 Flow Matching 实现 VLA 策略自引导改进 (Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement)

在 GitHub 閱讀在 GitHub 阅读
4 天前 vla core

潜入场景:通过焦点计划生成打破视觉语言决策中的感知瓶颈 (Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation)

在 GitHub 閱讀在 GitHub 阅读
5 天前 foundation

Instant-Fold:单演示驱动的柔性物体折叠学习 (Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation)

在 GitHub 閱讀在 GitHub 阅读
5 天前 foundation

部署中学习:面向通用机器人策略的车队规模强化学习 (Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies)

在 GitHub 閱讀在 GitHub 阅读
5 天前 vla core

Dream.exe:视频生成模型能否"梦想"可执行的机器人操作?(Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?)

在 GitHub 閱讀在 GitHub 阅读
6 天前 foundation

SimuScene:从单图重建仿真就绪的组合 3D 场景 (SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image)

在 GitHub 閱讀在 GitHub 阅读
7 天前 vla core

LEGS:在高斯泼溅世界微调免遥操作 VLA 实现人形机器人全身操作 (LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World)

在 GitHub 閱讀在 GitHub 阅读
7 天前 vla core

集合监督扩散策略:通过修正学习动作分块扩散 (Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections)

在 GitHub 閱讀在 GitHub 阅读
7 天前 foundation

Dexterity-BEV: 对齐3D世界与动作以增强策略泛化 (Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning)

在 GitHub 閱讀在 GitHub 阅读
8 天前 vla core

AnySlot:零样本槽位级放置的目标条件 VLA (Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement)

在 GitHub 閱讀在 GitHub 阅读
8 天前 vla core

Hyper-DP3:频域感知的3D扩散策略轻量化重构 (Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control)

在 GitHub 閱讀在 GitHub 阅读
10 天前 vla core

DynaFLIP:三模态动力学引导的机器人感知重构 (DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation)

在 GitHub 閱讀在 GitHub 阅读
11 天前 vla core

GaussianDream:前馈式 3D 高斯世界模型赋能机器人操作 (GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation)

在 GitHub 閱讀在 GitHub 阅读
11 天前 vla core

VLA-Trace:通过表征与行为追踪诊断 VLA 模型 (VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing)

在 GitHub 閱讀在 GitHub 阅读
11 天前 vla core

动态混合渐进式参数高效专家库用于终身机器人学习 (Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning)

在 GitHub 閱讀在 GitHub 阅读
12 天前 vla core

神经隐式动作场:从离散路点到连续函数 (Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models)

在 GitHub 閱讀在 GitHub 阅读
12 天前 vla core

CogVLA:认知对齐的视觉-语言-动作模型 (CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification)

在 GitHub 閱讀在 GitHub 阅读
13 天前 vla core

能力与鲁棒性不可兼得:VLA 模型的信息论边界 (Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models)

在 GitHub 閱讀在 GitHub 阅读
13 天前 vla core

SOLE-R1:视频语言推理作为机器人在线强化学习的唯一奖励信号 (SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning)

在 GitHub 閱讀在 GitHub 阅读
13 天前 vla core

弥合语义-动作鸿沟:面向高效 VLA 推理的视觉 Token 剪枝 (Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference)

在 GitHub 閱讀在 GitHub 阅读
14 天前 vla core

World-VLA-Loop:视频世界模型与 VLA 策略的闭环联合学习 (Closed-Loop Learning of Video World Model and VLA Policy)

在 GitHub 閱讀在 GitHub 阅读
14 天前 vla core

INSIGHT: 推理时序列自省生成人工辅助触发器 (INference-time Sequence Introspection for Generating Help Triggers in VLA Models)

在 GitHub 閱讀在 GitHub 阅读
14 天前 vla core

LIBERO-PRO: 超越死记硬背的 VLA 鲁棒与公平评估 (LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization)

在 GitHub 閱讀在 GitHub 阅读
14 天前 vla core

动静解耦高效长视界 VLA (Static-Dynamic Disentanglement for Efficient Multi-Frame Vision-Language-Action Models)

在 GitHub 閱讀在 GitHub 阅读
更多文章 · 全部在 GitHub更多文章 · 全部在 GitHub 196
🏛️ VLA Core  ·  103
VGAS:价值引导的动作块选择用于少样本 VLA 适配 (Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation) USIM and U0:面向通用水下机器人的 VLA 数据集与模型 (USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots) LACY:基于视觉语言模型的双向语言-动作循环,实现机器人自我改进操作 (LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation) WorldKV:通过检索与压缩实现高效世界记忆 (Efficient World Memory with World Retrieval and Compression) 噪声空间归因与分块边界伪影控制 (Noise-Space Attribution and Control of Chunk-Boundary Artifact) DSSP:基于全历史编码的扩散状态空间策略 (Diffusion State Space Policy with Full-History Encoding) 仅凭本体感知实现灵巧手内操作:本体感知 Transformer (Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer) 手在环中:通过无缝手-臂干预改善 VLA 灵巧操作策略 (Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention) SWEET:用图像编辑做稀疏世界模型 (SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution) 面向长寿机器人:通过强化微调实现 VLA 持续学习 (Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning) OxyGen:面向多任务并行的 VLA 统一 KV Cache 管理 (Unified KV Cache Management for VLA Inference under Multi-Task Parallelism) CLARE: VLA 持续学习通过适配器路由与扩展 (Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion) 离线语义引导的 VLA 策略高效蒸馏 (Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation) 学习结果发散之处:通过概率分块掩码加速 VLA RL 后训练 (Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking) 分块自适应缓存加速扩散策略 (Block-wise Adaptive Caching for Accelerating Diffusion Policy) D-VLA:高并发分布式异步强化学习框架 (D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models) 在相机帧中统一机器人动作 (Unify Robot Actions in Camera Frame) CoWorld-VLA:多专家世界模型中的思考式推理 (CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving) ALAM: 代数一致潜在动作模型 (Algebraically Consistent Latent Action Model for Vision-Language-Action Models) UniJEPA:统一连续与离散表征学习的机器人策略 (UniJEPA: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning) 行为模式发现:微调多模态生成策略时防止模式坍塌 (Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies) 从想象未来到可执行动作:混合潜在动作机器人操作 (From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation) 统一噪声引导:高效人类指导的 VLA 适配 (Unified Noise Steering for Efficient Human-Guided VLA Adaptation) 视觉预见VLA的测试时训练 (Test-Time Training for Visual Foresight Vision-Language-Action Models) PriorVLA:保留先验的 VLA 微调框架 (Prior-Preserving Adaptation for Vision-Language-Action Models) VEGA:视觉编码器接地对齐实现空间感知 VLA (VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models) ALAM:代数一致的潜在转移(Algebraically Consistent Latent Transitions for Vision-Language-Action Models) Hydra-DP3:面向视觉运动控制的3D扩散策略频域瘦身 (Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control) Sword: 风格鲁棒的世界模型模拟器 (Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training) TAIL-Safe:面向模仿学习策略的任务无关安全监控框架 AsyncVLA:异步流匹配视觉-语言-动作模型 (AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models) 持续演化 VLA 技能知识 (Continually Evolving Skill Knowledge in Vision Language Action Model) MobileEgo Anywhere:用消费级手机采集长视界第一人称数据 (MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware) 动作到动作流匹配 (Action-to-Action Flow Matching) LaST-R1: 通过自适应物理潜在推理强化机器人操作 (Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning) 显式物理可行性能否提升 VLA 学习?一项实证研究 (Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study) FingerViP: 指尖视觉感知灵巧操作 (FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception) MolmoAct2:面向真实世界部署的动作推理模型 (MolmoAct2: Action Reasoning Models for Real-world Deployment) STEP:时空一致性预测的热启动视觉运动策略 (Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction) 用文字和图像思考:长程机器人操作的交错视觉-语言推理轨迹 (Thinking in Text and Images: Interleaved Vision-Language Reasoning Traces for Long-Horizon Robot Manipulation) VLA 受限于训练但具备新指令泛化能力 (VLAs are Confined yet Capable of Generalizing to Novel Instructions) MotuBrain:面向机器人控制的世界-动作统一生成模型 (MotuBrain: An Advanced World Action Model for Robot Control) PRTS:基于对比表示的基元推理与任务系统 (PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations) RopeDreamer:基于运动学递归状态空间模型的柔性线性物体动力学预测 (RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects) 弹性视觉智能体的架构模式语言 (A Pattern Language for Resilient Visual Agents) 从动作标签到动作集合:重新思考纠正反馈下的模仿学习动作监督 (From Action Labels to Sets: Rethinking Action Supervision for Imitation Learning from Corrective Feedback) 提升具身世界模型用于规划与控制 (Lifting Embodied World Models for Planning and Control) 将世界模型想象力蒸馏进 VLM:面向动态空间推理的训练框架 (World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning) RISE: 用组合世界模型实现 VLA 策略自改进 (Self-Improving Robot Policy with Compositional World Model) DIAL:通过潜在世界建模解耦意图与动作的端到端 VLA (DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA) HANDFUL:资源感知的序列灵巧操作 (Sequential Grasp-Conditioned Dexterous Manipulation with Resource Awareness) KERV:运动学校正推测解码用于具身 VLA 模型 (Kinematic-Rectified Speculative Decoding for Embodied VLA Models) RoboECC: VLA 模型的多因素感知云边协同部署框架 (RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models) SARM:阶段感知奖励建模用于长视界机器人操作 (SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation) 通过对象中心与几何接地提升杂乱环境下的 VLA 鲁棒性 (Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding) dWorldEval:基于离散扩散世界模型的-scalable 机器人策略评估 (Scalable Robotic Policy Evaluation via Discrete Diffusion World Model) GazeVLA:用人类注视学习操作意图 (Learning Human Intention for Robotic Manipulation) 行为克隆策略有多脆弱?通用对抗扰动攻击现代BC策略 (How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies) VistaBot: 视角鲁棒机器人操作通过时空感知视图合成 (View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis) 长视界操作:轨迹条件化 VLA 规划 (Long-Horizon Manipulation via Trace-Conditioned VLA Planning) 从噪声到意图:用残差桥接锚定生成式 VLA 策略 (From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges) PhysMem: 测试时物理记忆扩展 (Scaling Test-time Physical Memory for Robot Manipulation) 基于漂移的策略优化:面向在线机器人控制的单步原生策略学习 (Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control) 世界-价值-动作模型:VLA 系统的隐式规划 (World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems) DeepThinkVLA:增强视觉-语言-动作模型的推理能力 (DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models) 长程记忆赋能 VLA 智能体在开放世界任务执行 (Long-Term Memory for VLA-based Agents in Open-World Task Execution) 从看到仿真:用数字表亲生成高保真仿真环境 (From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation) 分层时空动作分词器用于上下文模仿学习 (A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics) 力场流匹配:从单演示生成力觉数据学习 3D 顺应性策略 (Flow with the Force Field: Learning 3D Compliant Flow Matching Policies from Force and Demonstration-Guided Simulation Data) 无需微调部署 VLA:即插即用推理时策略引导 (Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion) cuRoboV2:高自由度机器人的动力学感知运动生成 (cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots) DockAnywhere: 通过演示生成提升移动操作数据效率 (DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation) HAMLET:将视觉 - 语言 - 动作模型转换为历史感知策略 (HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy) X-Diffusion: 跨具身人类演示训练扩散策略 (X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations) 人工三元智能:生物启发的物理 AI 传感器优先架构 (Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI) IGen: 从开放世界图像可扩展生成机器人学习数据 (IGen: Scalable Data Generation for Robot Learning from Open-World Images) BLaDA:在 3DGS 场中桥接语言与功能性灵巧动作 (BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields) 迭代组合式数据生成用于机器人控制 (Iterative Compositional Data Generation for Robot Control) HazardArena:评估 VLA 模型的语义安全 (HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models) StaMo:从紧凑状态表示中涌现通用机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation) StaMo: 从紧凑状态表示中涌现机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation) StaMo:从紧凑状态表示中涌现通用机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation) Déjà Vu:具身智能的经验反馈学习框架 (Dejavu: Towards Experience Feedback Learning for Embodied Intelligence) 你有一张金票:用单个噪声向量提升生成式机器人策略 (You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector) 2D 还是 3D:谁主导 VLA 模型中的显著性?—— 三阶段 Token 剪枝框架与模态显著性感知 (2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness) 用自由语言指令操控人形机器人:统一运动词汇的大型语言动作模型 (Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary) 基于反思的任务适应:自改进 VLA 框架 (Reflection-Based Task Adaptation for Self-Improving VLA) 可证明概率安全:具身 AI 系统的大规模部署新范式 (Towards Provable Probabilistic Safety for Scalable Embodied AI Systems) DailyArt: 从单张静态图像发现关节结构 (DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics) Orion-Lite:将 LLM 推理能力蒸馏至高效纯视觉驾驶模型 (Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models) HiF-VLA:通过运动表示实现后见、洞察与前瞻的视觉 - 语言 - 动作模型 (HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models) UniLACT:深度感知 RGB 潜在动作学习 (UniLACT: Depth-Aware RGB Latent Action Learning for VLA Models) RoSHI: 野外便携式全身动捕套装 (RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild) Genie Sim PanoRecon:从单张全景图快速生成沉浸式 3D 场景 (Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama) SnapFlow:流匹配 VLA 的单步动作生成 (SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation) 即时 VLA 自适应 via 测试时强化学习 (On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning) GeoPredict:利用预测运动学与 3D 高斯几何实现精确 VLA 操作 (GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation) 学习加法组合潜在动作用于具身 AI (Learning Additively Compositional Latent Actions for Embodied AI) PALM: 通过可供性推理实现进度感知的策略学习 (PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation) 通过质量多样性提示生成对 VLA 模型进行红队测试 (Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies) 开环规划,闭环验证:VLA 的推测验证框架 (Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA) Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception (EyeVLA) 从视觉语言模型学习结构化机器人策略 via 合成神经符号监督 (Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision)
🏗️ Foundation & Training  ·  44
CrossVLA: 跨范式后训练与推理优化 (Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models) 通过保守 SFT 保护流匹配 VLA 的基础能力 (Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT) HEX:跨具身全身操控的类人对齐专家架构 (HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation) 通过模仿生成视频实现机器人操作(Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations) RoboEval:机器人操作的结构化与可扩展评估 (RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation) CLAMP: 3D 多视图对比预训练用于机器人操作 (Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining) GeCO:时间无条件流匹配用于自适应鲁棒机器人控制 (Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control) 基于基础模型先验的强化学习:让具身智能体自主高效学习 (Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own) 掩码世界模型:预测什么对机器人策略学习最重要 (Mask World Model: Predicting What Matters for Robust Robot Policy Learning) 人类数据是伪装成另一种形式的机器人数据:Danfei Xu 深度访谈(2026) 潜空间综述:语言模型的"原生思维空间"与具身智能的统一接口 免微调部署 VLA:即插即用推理时策略引导 (Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion) VLA 数据工程指南:从采集到训练的完整链路 StarVLA-α:简化视觉 - 语言 - 动作系统的强基线 (StarVLA-α: Reducing Complexity in Vision-Language-Action Systems) HY-Embodied-0.5:具身基础模型实战解析 (HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents) 联合训练 (Co-training) 数据处理 (Data Processing) 具身智能深度:数据飞轮与跨模态迁移 (Data Flywheel & Cross-modal Transfer) DCP:凸性检测规则与 CVX/CVXPY 建模心法 (Disciplined Convex Programming) DoRA:权重分解的低秩适配 (DoRA: Weight-Decomposed Low-Rank Adaptation) 评估体系详解 (Evaluation Protocols Deep Dive) Flash Attention: 高效 Transformer 推理的关键 🏗️ 基础理论 — ML 工具箱主线总纲 更新成本摊销:Doc-to-LoRA / Text-to-LoRA 让 LLM “瞬时内化” (Cost Amortization for Instant LLM Updates) 知识蒸馏 (Knowledge Distillation) Knowledge Insulation: 防止灾难性遗忘 当我们谈论 AI 推理的 KV Cache,我们在说什么? (KV Cache in LLM Inference) 终身模仿学习与多模态潜在回放 (Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment) VLA 文献核心技术归纳 (Literature Technical Review) VLA 数学必备:从直觉到实作 弹性模组化架构 Table 生成器(VLA Modular Pipelines) NeurIPS 2025 最佳论文:具身智能视角解读 VLA 论文索引 (Paper Index) 高效微调理论 (PEFT & LoRA) 量化理论 (Quantization Theory) 动作空间敏感量化:QVLA (QVLA: Not All Channels Are Equal in VLA Quantization) RDT2:UMI 数据规模化与跨本体零样本部署 (RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization) RoboGene: 通过多样性驱动的智能体框架提升 VLA 预训练 (Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation) 自监督学习 (Self-Supervised Learning) Shallow-π:Flow-based VLA 的层深蒸馏 (Shallow-π: Knowledge Distillation for Flow-based VLAs) 迁移学习 (Transfer Learning) Transformer vs CNN: 核心架构对比 VideoWeaver:具身智能体的多模态多视角视频迁移 (VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents) VLA-Forget:具身基础模型的视觉 - 语言 - 动作遗忘机制 (VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models)
🔧 Deployment & Hardware  ·  19
🔧 部署与硬件 — 实战落地主线总纲 DexGrasp-Zero:形态对齐的零样本跨本体灵巧抓取策略 (DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping) 中金人机系列05(灵巧手)→ VLA/控制/硬件的“可计算约束”框架(理论侧整理) 灵巧手机械学深度解析 (Dexterous Hand Mechanics) — 修订整合版 v2 机器人开可乐/发牌有多难?灵巧手:硬件路线 × 接触数学 × 数据金字塔(访谈摘录整理) EquiBim:双臂操作中的对称等变策略学习 (EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation) GR-Dexter(ByteDance Seed):把 VLA 扩展到高自由度灵巧手的“硬件-数据-模型”全栈框架 抓取算法与仿真平台 (Grasp Algorithms & Simulation Platforms) House of Dextra: 灵巧手机器人形态 - 控制协同设计 (House of Dextra: Cross-embodied Co-design for Dexterous Hands) 产业视角:通用性与“元学习”路径(从一张路线图说起) Isaac Lab: GPU 加速的多模态机器人学习仿真框架 Lightning Grasp:Contact Field 驱动的超高速灵巧手抓取合成 (Lightning Grasp: Procedural Grasp Synthesis with Contact Fields) NVIDIA 的 AI 五层蛋糕:从能源到机器人应用的基础设施观 (AI Is a 5-Layer Cake) 英伟达物理 AI 的第一刀:为什么先砍向汽车 (Why NVIDIA's First Physical AI Wedge Hits Cars First) Physical Intelligence Layer:机器人基础模型 API 的产品化范式 (The Physical Intelligence Layer) RoboPocket:把“机器人博士”装进口袋的无本体即时策略迭代 (RoboPocket: Improve Robot Policies Instantly with Your Phone) 机械臂运动学、动力学与控制 (Robot Arm Kinematics, Dynamics & Control) 机器人动力学系统分类 (Classification of Robot Dynamical Systems) 机器人“开源基建”三分法:成果展示 / 生态绑定 / 基础设施(以 RoboParty Roboto_Origin 为例)
🌊 Diffusion & Flow  ·  13

🏆 SOTA 排行SOTA 排行

Evo-SOTA 完整榜Evo-SOTA 完整榜 30
CALVIN ABCD-D 飽和饱和 avg_len
# Model Score vs Prev Date Paper
1 Xiaomi-Robotics-0 4.8 Flower VLA +0.13 2026-06-05 arxiv →
2 Xiaomi-Robotics-0 4.8 Flower VLA +0.13 2026-05-29 arxiv →
3 MMaDA-VLA 4.78 Xiaomi-Robotics-0 +0.03 2026-06-05 arxiv →
4 MMaDA-VLA 4.78 Xiaomi-Robotics-0 +0.03 2026-05-29 arxiv →
5 AVA-VLA 4.65 AnchorRefine +0.25 2026-06-05 arxiv →
6 AVA-VLA 4.65 AnchorRefine +0.25 2026-05-29 arxiv →
7 GR-2 4.64 DFM-VLA +0.20 2026-06-05 arxiv →
8 GR-2 4.64 DFM-VLA +0.20 2026-05-29 arxiv →
9 NS-VLA 4.56 AtomicVLA +0.29 2026-06-05 arxiv →
10 NS-VLA 4.56 AtomicVLA +0.29 2026-05-29 arxiv →
11 Flower VLA 4.35 RoboUniview +0.49 2026-06-05 arxiv →
12 Flower VLA 4.35 RoboUniview +0.49 2026-05-29 arxiv →
13 MCIL 1.82 2026-06-05 arxiv →
14 MCIL 1.82 2026-05-29 arxiv →
LIBERO standard-closed 飽和饱和 average
# Model Score vs Prev Date Paper
1 LaST-R1 99.8 PriorVLA +0.70 2026-06-05 arxiv →
2 LaST-R1 99.8 PriorVLA +0.70 2026-05-29 arxiv →
3 CORAL 99.3 SRPO +0.10 2026-06-05 arxiv →
4 CORAL 99.3 SRPO +0.10 2026-05-29 arxiv →
5 PLD 99.17 NS-VLA +0.57 2026-06-05 arxiv →
6 PLD 99.17 NS-VLA +0.57 2026-05-29 arxiv →
LIBERO Plus standard-closed total
# Model Score vs Prev Date Paper
1 TAG 87.24 ProGAL-VLA +1.74 2026-06-05 arxiv →
2 ACoT-VLA 86.6 pi0.5 +0.90 2026-06-05 arxiv →
3 CorridorVLA 83.21 NS-VLA +3.81 2026-06-05 arxiv →
MetaWorld non-standard average
# Model Score vs Prev Date Paper
1 MPI 86 iRe-VLA +3.00 2026-06-05 arxiv →
2 pi-RL 85.8 Evo-depth +1.40 2026-06-05 arxiv →
3 ALAM 85 2026-06-05 arxiv →
RoboCasa-GR1-Tabletop standard-opensource avg_success_rate
# Model Score vs Prev Date Paper
1 DIAL 70.2 FrameSkip +10.70 2026-06-05 arxiv →
2 PhysBrain 1.0 64.5 JoyAI-RA 0.1 +1.30 2026-06-05 arxiv →
RoboChallenge standard-opensource score
# Model Score vs Prev Date Paper
1 DM0 72.25 Giga-Brain-0.1 +3.91 2026-06-05 arxiv →
2 StarVLA-alpha 54.5 2026-06-05 arxiv →