VLA 深度追蹤VLA 深度追踪
Vision-Language-Action:讓機器人看→想→做的端到端模型Vision-Language-Action:让机器人看→想→做的端到端模型
METHOD FAMILY TRENDS
high
METHOD FAMILY TRENDS
COMPETITION PAIRS 6 matchups · hover for details
Language Grounding: connecting natural language instructions to robot actions; vision-language-action alignment
World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction
The central paradigm war in embodied AI. VLA (Vision-Language-Action) maps observations directly to actions end-to-end — simple, scalable, but needs massive data and generalizes poorly. WAM (World-Action Model) first learns how the world works, then plans actions through mental simulation — better generalization and data efficiency, but world models are often inaccurate. The boundary is blurring: Pi0.5 uses flow matching (generative, WAM-like), GR00T adds video prediction. The winner likely is a hybrid.
Diffusion Policy: iterative denoising process (DDPM) to generate continuous robot actions; strong on multi-modal action distributions
Flow Matching: optimal-transport-based generative model (e.g. Pi0); faster inference than diffusion with comparable quality
Both generate continuous actions from the same VLA backbone but take different mathematical routes: diffusion iteratively denoises random noise into actions (slow, expressive), while flow matching uses optimal transport for a direct trajectory (fast, efficient). If flow matching matches diffusion quality, it could replace it as the default action head.
Instruction Tuning: supervised fine-tuning (SFT) on language-action pairs; simpler but limited to offline data distribution
RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data
After pretraining a VLA, two competing strategies exist: SFT directly imitates expert demonstrations (simple, stable), while RL fine-tuning (GRPO/DPO) optimizes a reward signal to go beyond the demonstration distribution. RL can discover novel strategies but is harder to stabilize.
World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction
RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data
World models learn by predicting the future (imagination-based planning), while RL learns from reward feedback. If world models become accurate enough, they could reduce the need for expensive real-world RL exploration.
Tactile Sensing: force/torque and GelSight contact sensors; provides direct manipulation feedback for delicate tasks
Dexterous Hand: multi-finger manipulation control; achieves fine-grained object interaction without dedicated sensors
Two approaches to dexterous manipulation: tactile sensing adds explicit touch feedback (hardware cost, rich signal), while dexterous hand control relies on proprioception and vision alone (simpler hardware, harder control). The winner depends on sensor cost-to-performance ratio.
Sim-to-Real: train in simulation, deploy on real hardware; uses domain randomization to bridge the reality gap
Cross-Embodiment: transfer policies across different robot morphologies; aims for universal robot foundation models
Sim-to-Real trains one robot in simulation then transfers (cheap data, reality gap risk), while Cross-Embodiment trains across multiple real robots directly (expensive data, natural generalization). The approaches represent different bets on where generalization should happen.
EMERGING SIGNALS
3 signals
EMERGING SIGNALS
TOP INSTITUTIONS
20 active / 30d
TOP INSTITUTIONS
📐 理論文章庫📐 理论文章库
204 篇篇 查看 GitHub 全庫查看 GitHub 全库 →
基于漂移的策略优化:面向在线机器人控制的单步原生策略学习 (Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control)
在 GitHub 閱讀在 GitHub 阅读World Model 辅助 VLA 后训练:研究进展与问题拆解(2026)
在 GitHub 閱讀在 GitHub 阅读世界-价值-动作模型:VLA 系统的隐式规划 (World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems)
在 GitHub 閱讀在 GitHub 阅读DeepThinkVLA:增强视觉-语言-动作模型的推理能力 (DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models)
在 GitHub 閱讀在 GitHub 阅读人类数据是伪装成另一种形式的机器人数据:Danfei Xu 深度访谈(2026)
在 GitHub 閱讀在 GitHub 阅读长程记忆赋能 VLA 智能体在开放世界任务执行 (Long-Term Memory for VLA-based Agents in Open-World Task Execution)
在 GitHub 閱讀在 GitHub 阅读从看到仿真:用数字表亲生成高保真仿真环境 (From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation)
在 GitHub 閱讀在 GitHub 阅读分层时空动作分词器用于上下文模仿学习 (A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics)
在 GitHub 閱讀在 GitHub 阅读力场流匹配:从单演示生成力觉数据学习 3D 顺应性策略 (Flow with the Force Field: Learning 3D Compliant Flow Matching Policies from Force and Demonstration-Guided Simulation Data)
在 GitHub 閱讀在 GitHub 阅读无需微调部署 VLA:即插即用推理时策略引导 (Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion)
在 GitHub 閱讀在 GitHub 阅读潜空间综述:语言模型的"原生思维空间"与具身智能的统一接口
在 GitHub 閱讀在 GitHub 阅读免微调部署 VLA:即插即用推理时策略引导 (Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion)
在 GitHub 閱讀在 GitHub 阅读VLA 数据工程指南:从采集到训练的完整链路
在 GitHub 閱讀在 GitHub 阅读分层时空动作分词器:上下文模仿学习的新范式 (HiST-AT: A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning)
在 GitHub 閱讀在 GitHub 阅读GR00T-N1.7:NVIDIA 的开源通用机器人基础模型——从人形到任意形态
在 GitHub 閱讀在 GitHub 阅读LingBot-VLA:20,000 小时真实数据预训练的实用主义 VLA
在 GitHub 閱讀在 GitHub 阅读完全开源 VLA 选型指南:谁是真开源,谁在"开源洗"
在 GitHub 閱讀在 GitHub 阅读多模态操作 via 多模态策略共识 (Multi-Modal Manipulation via Multi-Modal Policy Consensus)
在 GitHub 閱讀在 GitHub 阅读cuRoboV2:高自由度机器人的动力学感知运动生成 (cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots)
在 GitHub 閱讀在 GitHub 阅读DockAnywhere: 通过演示生成提升移动操作数据效率 (DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation)
在 GitHub 閱讀在 GitHub 阅读HAMLET:将视觉 - 语言 - 动作模型转换为历史感知策略 (HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy)
在 GitHub 閱讀在 GitHub 阅读X-Diffusion: 跨具身人类演示训练扩散策略 (X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations)
在 GitHub 閱讀在 GitHub 阅读人工三元智能:生物启发的物理 AI 传感器优先架构 (Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI)
在 GitHub 閱讀在 GitHub 阅读IGen: 从开放世界图像可扩展生成机器人学习数据 (IGen: Scalable Data Generation for Robot Learning from Open-World Images)
在 GitHub 閱讀在 GitHub 阅读3D 优先:从 VGA 和 Spark 2.0 看具身智能的下一个表征革命
在 GitHub 閱讀在 GitHub 阅读VGA:机器人操作是视觉到几何的映射,不是视觉到语言到动作
在 GitHub 閱讀在 GitHub 阅读π0.7:可操控的通用机器人基础模型,涌现出组合泛化能力
在 GitHub 閱讀在 GitHub 阅读BLaDA:在 3DGS 场中桥接语言与功能性灵巧动作 (BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields)
在 GitHub 閱讀在 GitHub 阅读迭代组合式数据生成用于机器人控制 (Iterative Compositional Data Generation for Robot Control)
在 GitHub 閱讀在 GitHub 阅读HazardArena:评估 VLA 模型的语义安全 (HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models)
在 GitHub 閱讀在 GitHub 阅读Spark 2.0:李飞飞 World Labs 开源的 3DGS 网页渲染引擎——1 亿点云手机秒开
在 GitHub 閱讀在 GitHub 阅读StaMo:从紧凑状态表示中涌现通用机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation)
在 GitHub 閱讀在 GitHub 阅读StaMo: 从紧凑状态表示中涌现机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation)
在 GitHub 閱讀在 GitHub 阅读StarVLA-α:简化视觉 - 语言 - 动作系统的强基线 (StarVLA-α: Reducing Complexity in Vision-Language-Action Systems)
在 GitHub 閱讀在 GitHub 阅读StaMo:从紧凑状态表示中涌现通用机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation)
在 GitHub 閱讀在 GitHub 阅读Déjà Vu:具身智能的经验反馈学习框架 (Dejavu: Towards Experience Feedback Learning for Embodied Intelligence)
在 GitHub 閱讀在 GitHub 阅读你有一张金票:用单个噪声向量提升生成式机器人策略 (You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector)
在 GitHub 閱讀在 GitHub 阅读2D 还是 3D:谁主导 VLA 模型中的显著性?—— 三阶段 Token 剪枝框架与模态显著性感知 (2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness)
在 GitHub 閱讀在 GitHub 阅读用自由语言指令操控人形机器人:统一运动词汇的大型语言动作模型 (Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary)
在 GitHub 閱讀在 GitHub 阅读基于反思的任务适应:自改进 VLA 框架 (Reflection-Based Task Adaptation for Self-Improving VLA)
在 GitHub 閱讀在 GitHub 阅读可证明概率安全:具身 AI 系统的大规模部署新范式 (Towards Provable Probabilistic Safety for Scalable Embodied AI Systems)
在 GitHub 閱讀在 GitHub 阅读HY-Embodied-0.5:具身基础模型实战解析 (HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents)
在 GitHub 閱讀在 GitHub 阅读DailyArt: 从单张静态图像发现关节结构 (DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics)
在 GitHub 閱讀在 GitHub 阅读Orion-Lite:将 LLM 推理能力蒸馏至高效纯视觉驾驶模型 (Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models)
在 GitHub 閱讀在 GitHub 阅读HiF-VLA:通过运动表示实现后见、洞察与前瞻的视觉 - 语言 - 动作模型 (HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models)
在 GitHub 閱讀在 GitHub 阅读UniLACT:深度感知 RGB 潜在动作学习 (UniLACT: Depth-Aware RGB Latent Action Learning for VLA Models)
在 GitHub 閱讀在 GitHub 阅读RoSHI: 野外便携式全身动捕套装 (RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild)
在 GitHub 閱讀在 GitHub 阅读TAMEn: 触觉感知操作引擎用于接触丰富任务中的闭环数据收集 (TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks)
在 GitHub 閱讀在 GitHub 阅读Genie Sim PanoRecon:从单张全景图快速生成沉浸式 3D 场景 (Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama)
在 GitHub 閱讀在 GitHub 阅读更多文章 · 全部在 GitHub更多文章 · 全部在 GitHub 154 篇篇
🏆 SOTA 排行SOTA 排行
Evo-SOTA 完整榜Evo-SOTA 完整榜 →
30
CALVIN ABCD-D 飽和饱和 avg_len
| # | Model | Score | vs Prev | Date | Paper |
|---|---|---|---|---|---|
| 1 | Xiaomi-Robotics-0 | 4.8 | Flower VLA +0.13 | 2026-04-24 | arxiv → |
| 2 | Xiaomi-Robotics-0 | 4.8 | Flower VLA +0.13 | 2026-04-17 | arxiv → |
| 3 | MMaDA-VLA | 4.78 | Xiaomi-Robotics-0 +0.03 | 2026-04-24 | arxiv → |
| 4 | MMaDA-VLA | 4.78 | Xiaomi-Robotics-0 +0.03 | 2026-04-17 | arxiv → |
| 5 | AVA-VLA | 4.65 | TriVLA +0.28 | 2026-04-24 | arxiv → |
| 6 | AVA-VLA | 4.65 | TriVLA +0.28 | 2026-04-17 | arxiv → |
| 7 | GR-2 | 4.64 | DFM-VLA +0.20 | 2026-04-24 | arxiv → |
| 8 | GR-2 | 4.64 | DFM-VLA +0.20 | 2026-04-17 | arxiv → |
| 9 | NS-VLA | 4.56 | AtomicVLA +0.29 | 2026-04-24 | arxiv → |
| 10 | NS-VLA | 4.56 | AtomicVLA +0.29 | 2026-04-17 | arxiv → |
| 11 | Flower VLA | 4.35 | RoboUniview +0.49 | 2026-04-24 | arxiv → |
| 12 | Flower VLA | 4.35 | RoboUniview +0.49 | 2026-04-17 | arxiv → |
| 13 | MCIL | 1.82 | 2026-04-24 | arxiv → | |
| 14 | MCIL | 1.82 | 2026-04-17 | arxiv → |