VLA 深度追蹤VLA 深度追踪
Vision-Language-Action:讓機器人看→想→做的端到端模型Vision-Language-Action:让机器人看→想→做的端到端模型
METHOD FAMILY TRENDS
high
METHOD FAMILY TRENDS
COMPETITION PAIRS 6 matchups · hover for details
Language Grounding: connecting natural language instructions to robot actions; vision-language-action alignment
World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction
The central paradigm war in embodied AI. VLA (Vision-Language-Action) maps observations directly to actions end-to-end — simple, scalable, but needs massive data and generalizes poorly. WAM (World-Action Model) first learns how the world works, then plans actions through mental simulation — better generalization and data efficiency, but world models are often inaccurate. The boundary is blurring: Pi0.5 uses flow matching (generative, WAM-like), GR00T adds video prediction. The winner likely is a hybrid.
Diffusion Policy: iterative denoising process (DDPM) to generate continuous robot actions; strong on multi-modal action distributions
Flow Matching: optimal-transport-based generative model (e.g. Pi0); faster inference than diffusion with comparable quality
Both generate continuous actions from the same VLA backbone but take different mathematical routes: diffusion iteratively denoises random noise into actions (slow, expressive), while flow matching uses optimal transport for a direct trajectory (fast, efficient). If flow matching matches diffusion quality, it could replace it as the default action head.
Instruction Tuning: supervised fine-tuning (SFT) on language-action pairs; simpler but limited to offline data distribution
RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data
After pretraining a VLA, two competing strategies exist: SFT directly imitates expert demonstrations (simple, stable), while RL fine-tuning (GRPO/DPO) optimizes a reward signal to go beyond the demonstration distribution. RL can discover novel strategies but is harder to stabilize.
World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction
RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data
World models learn by predicting the future (imagination-based planning), while RL learns from reward feedback. If world models become accurate enough, they could reduce the need for expensive real-world RL exploration.
Tactile Sensing: force/torque and GelSight contact sensors; provides direct manipulation feedback for delicate tasks
Dexterous Hand: multi-finger manipulation control; achieves fine-grained object interaction without dedicated sensors
Two approaches to dexterous manipulation: tactile sensing adds explicit touch feedback (hardware cost, rich signal), while dexterous hand control relies on proprioception and vision alone (simpler hardware, harder control). The winner depends on sensor cost-to-performance ratio.
Sim-to-Real: train in simulation, deploy on real hardware; uses domain randomization to bridge the reality gap
Cross-Embodiment: transfer policies across different robot morphologies; aims for universal robot foundation models
Sim-to-Real trains one robot in simulation then transfers (cheap data, reality gap risk), while Cross-Embodiment trains across multiple real robots directly (expensive data, natural generalization). The approaches represent different bets on where generalization should happen.
EMERGING SIGNALS
2 signals
EMERGING SIGNALS
TOP INSTITUTIONS
12 active / 30d
TOP INSTITUTIONS
📐 理論文章庫📐 理论文章库
231 篇篇 查看 GitHub 全庫查看 GitHub 全库 →
AEGIS:物理 AI 的備份反射機制 (A Backup Reflex for Physical AI)
在 GitHub 閱讀在 GitHub 阅读RhinoVLA 技术报告:面向端侧实时部署的跨本体 VLA 系统 (RhinoVLA Technical Report)
在 GitHub 閱讀在 GitHub 阅读3DThinkVLA:通过3D思维引导协同训练赋予VLA模型隐式3D空间推理能力 (3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training)
在 GitHub 閱讀在 GitHub 阅读势函数引导的 Flow Matching 用于 VLA 策略优化 (Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement)
在 GitHub 閱讀在 GitHub 阅读TempoVLA:速度可控的 Vision-Language-Action 策略 (TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies)
在 GitHub 閱讀在 GitHub 阅读3DThinkVLA:通过3D思维引导协同训练赋予VLA隐式3D空间推理能力 (3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training)
在 GitHub 閱讀在 GitHub 阅读势函数引导 Flow Matching 实现 VLA 策略自引导改进 (Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement)
在 GitHub 閱讀在 GitHub 阅读潜入场景:通过焦点计划生成打破视觉语言决策中的感知瓶颈 (Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation)
在 GitHub 閱讀在 GitHub 阅读Instant-Fold:单演示驱动的柔性物体折叠学习 (Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation)
在 GitHub 閱讀在 GitHub 阅读部署中学习:面向通用机器人策略的车队规模强化学习 (Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies)
在 GitHub 閱讀在 GitHub 阅读Dream.exe:视频生成模型能否"梦想"可执行的机器人操作?(Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?)
在 GitHub 閱讀在 GitHub 阅读OMP:单步均值流策略与方向对齐 (One-step MeanFlow Policy with Directional Alignment)
在 GitHub 閱讀在 GitHub 阅读SimuScene:从单图重建仿真就绪的组合 3D 场景 (SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image)
在 GitHub 閱讀在 GitHub 阅读LEGS:在高斯泼溅世界微调免遥操作 VLA 实现人形机器人全身操作 (LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World)
在 GitHub 閱讀在 GitHub 阅读集合监督扩散策略:通过修正学习动作分块扩散 (Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections)
在 GitHub 閱讀在 GitHub 阅读Dexterity-BEV: 对齐3D世界与动作以增强策略泛化 (Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning)
在 GitHub 閱讀在 GitHub 阅读好的具身奖励模型需要坏行为数据 (Good Embodied Reward Models Need Bad Behavior Data)
在 GitHub 閱讀在 GitHub 阅读AnySlot:零样本槽位级放置的目标条件 VLA (Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement)
在 GitHub 閱讀在 GitHub 阅读Hyper-DP3:频域感知的3D扩散策略轻量化重构 (Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control)
在 GitHub 閱讀在 GitHub 阅读双流扩散世界模型增强 VLA (Dual-Stream Diffusion for World-Model Augmented VLA)
在 GitHub 閱讀在 GitHub 阅读DynaFLIP:三模态动力学引导的机器人感知重构 (DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation)
在 GitHub 閱讀在 GitHub 阅读GaussianDream:前馈式 3D 高斯世界模型赋能机器人操作 (GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation)
在 GitHub 閱讀在 GitHub 阅读VLA-Trace:通过表征与行为追踪诊断 VLA 模型 (VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing)
在 GitHub 閱讀在 GitHub 阅读动态混合渐进式参数高效专家库用于终身机器人学习 (Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning)
在 GitHub 閱讀在 GitHub 阅读神经隐式动作场:从离散路点到连续函数 (Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models)
在 GitHub 閱讀在 GitHub 阅读LEIA:交互式架构材料的世界模型 (LEIA: Learned Environment for Interactive Architected Materials)
在 GitHub 閱讀在 GitHub 阅读CogVLA:认知对齐的视觉-语言-动作模型 (CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification)
在 GitHub 閱讀在 GitHub 阅读能力与鲁棒性不可兼得:VLA 模型的信息论边界 (Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models)
在 GitHub 閱讀在 GitHub 阅读SOLE-R1:视频语言推理作为机器人在线强化学习的唯一奖励信号 (SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning)
在 GitHub 閱讀在 GitHub 阅读弥合语义-动作鸿沟:面向高效 VLA 推理的视觉 Token 剪枝 (Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference)
在 GitHub 閱讀在 GitHub 阅读World-VLA-Loop:视频世界模型与 VLA 策略的闭环联合学习 (Closed-Loop Learning of Video World Model and VLA Policy)
在 GitHub 閱讀在 GitHub 阅读INSIGHT: 推理时序列自省生成人工辅助触发器 (INference-time Sequence Introspection for Generating Help Triggers in VLA Models)
在 GitHub 閱讀在 GitHub 阅读LIBERO-PRO: 超越死记硬背的 VLA 鲁棒与公平评估 (LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization)
在 GitHub 閱讀在 GitHub 阅读动静解耦高效长视界 VLA (Static-Dynamic Disentanglement for Efficient Multi-Frame Vision-Language-Action Models)
在 GitHub 閱讀在 GitHub 阅读更多文章 · 全部在 GitHub更多文章 · 全部在 GitHub 196 篇篇
🏆 SOTA 排行SOTA 排行
Evo-SOTA 完整榜Evo-SOTA 完整榜 →
30
CALVIN ABCD-D 飽和饱和 avg_len
| # | Model | Score | vs Prev | Date | Paper |
|---|---|---|---|---|---|
| 1 | Xiaomi-Robotics-0 | 4.8 | Flower VLA +0.13 | 2026-06-05 | arxiv → |
| 2 | Xiaomi-Robotics-0 | 4.8 | Flower VLA +0.13 | 2026-05-29 | arxiv → |
| 3 | MMaDA-VLA | 4.78 | Xiaomi-Robotics-0 +0.03 | 2026-06-05 | arxiv → |
| 4 | MMaDA-VLA | 4.78 | Xiaomi-Robotics-0 +0.03 | 2026-05-29 | arxiv → |
| 5 | AVA-VLA | 4.65 | AnchorRefine +0.25 | 2026-06-05 | arxiv → |
| 6 | AVA-VLA | 4.65 | AnchorRefine +0.25 | 2026-05-29 | arxiv → |
| 7 | GR-2 | 4.64 | DFM-VLA +0.20 | 2026-06-05 | arxiv → |
| 8 | GR-2 | 4.64 | DFM-VLA +0.20 | 2026-05-29 | arxiv → |
| 9 | NS-VLA | 4.56 | AtomicVLA +0.29 | 2026-06-05 | arxiv → |
| 10 | NS-VLA | 4.56 | AtomicVLA +0.29 | 2026-05-29 | arxiv → |
| 11 | Flower VLA | 4.35 | RoboUniview +0.49 | 2026-06-05 | arxiv → |
| 12 | Flower VLA | 4.35 | RoboUniview +0.49 | 2026-05-29 | arxiv → |
| 13 | MCIL | 1.82 | 2026-06-05 | arxiv → | |
| 14 | MCIL | 1.82 | 2026-05-29 | arxiv → |
LIBERO standard-closed 飽和饱和 average
| # | Model | Score | vs Prev | Date | Paper |
|---|---|---|---|---|---|
| 1 | LaST-R1 | 99.8 | PriorVLA +0.70 | 2026-06-05 | arxiv → |
| 2 | LaST-R1 | 99.8 | PriorVLA +0.70 | 2026-05-29 | arxiv → |
| 3 | CORAL | 99.3 | SRPO +0.10 | 2026-06-05 | arxiv → |
| 4 | CORAL | 99.3 | SRPO +0.10 | 2026-05-29 | arxiv → |
| 5 | PLD | 99.17 | NS-VLA +0.57 | 2026-06-05 | arxiv → |
| 6 | PLD | 99.17 | NS-VLA +0.57 | 2026-05-29 | arxiv → |