Skip to content
VLA 線 · DEEP DIVE ARCHIVEVLA 线 · DEEP DIVE ARCHIVE

VLA 深度追蹤VLA 深度追踪

Vision-Language-Action:讓機器人看→想→做的端到端模型Vision-Language-Action:让机器人看→想→做的端到端模型

METHOD FAMILY TRENDS

data thru 2026-04-24 · 681 papers · 44d window · 15 families
high
▼ 2 declining 15 families · 203 papers covered
FAMILY MOM 7d 14d 30d Δ7d Δ14d Δ30d CHART ST
Lang. Grounding 40 78 176 1.05x 0.86x 1.26x
World Model 22 51 105 0.76x 1.04x 1.29x
Flow Matching 20 37 80 0.99x 0.75x 0.99x
Long Horizon 20 30 61 0.99x 0.74x 0.75x
Multi-Task 20 38 57 0.99x 0.94x 0.70x
RL Fine-tuning 20 42 109 0.91x 0.67x 0.91x
Tactile 14 21 37 0.69x 0.52x 0.46x
Human-Robot 13 26 63 0.64x 0.64x 0.78x
Diffusion Policy 10 13 33 0.49x 0.32x 0.41x
Cross-Embodiment 7 10 17 0.34x 0.25x 0.21x
Dexterous Hand 7 14 30 0.34x 0.34x 0.37x
Sim-to-Real 4 7 17 0.20x 0.17x 0.21x
Mobile Manip. 3 10 18 0.15x 0.25x 0.22x
3D Repr. 2 4 18 0.10x 0.10x 0.22x
Instr. Tuning 1 1 7 0.05x 0.02x 0.09x
COMPETITION PAIRS 6 matchups · hover for details
VLA vs WAM
Paradigm war: end-to-end action prediction vs world model planning
Lang. Ground. vs World Model
65%
35%
5.9% · x1.05 ratio 1.82 3.2% · x0.76
Lang. Grounding

Language Grounding: connecting natural language instructions to robot actions; vision-language-action alignment

World Model

World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction

Why they compete

The central paradigm war in embodied AI. VLA (Vision-Language-Action) maps observations directly to actions end-to-end — simple, scalable, but needs massive data and generalizes poorly. WAM (World-Action Model) first learns how the world works, then plans actions through mental simulation — better generalization and data efficiency, but world models are often inaccurate. The boundary is blurring: Pi0.5 uses flow matching (generative, WAM-like), GR00T adds video prediction. The winner likely is a hybrid.

ACTION HEAD ROUTE
Continuous action generation: denoising vs optimal transport
Diffusion Pol. vs Flow Matching
33%
67%
1.5% · x0.49 ratio 0.50 2.9% · x0.99
Diffusion Policy

Diffusion Policy: iterative denoising process (DDPM) to generate continuous robot actions; strong on multi-modal action distributions

Flow Matching

Flow Matching: optimal-transport-based generative model (e.g. Pi0); faster inference than diffusion with comparable quality

Why they compete

Both generate continuous actions from the same VLA backbone but take different mathematical routes: diffusion iteratively denoises random noise into actions (slow, expressive), while flow matching uses optimal transport for a direct trajectory (fast, efficient). If flow matching matches diffusion quality, it could replace it as the default action head.

POST-TRAINING ROUTE
Model adaptation: supervised tuning vs reward optimization
Instr. Tuning vs RL Fine-tune
5%
95%
0.1% · x0.05 ratio 0.05 2.9% · x0.91
Instr. Tuning

Instruction Tuning: supervised fine-tuning (SFT) on language-action pairs; simpler but limited to offline data distribution

RL Fine-tuning

RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data

Why they compete

After pretraining a VLA, two competing strategies exist: SFT directly imitates expert demonstrations (simple, stable), while RL fine-tuning (GRPO/DPO) optimizes a reward signal to go beyond the demonstration distribution. RL can discover novel strategies but is harder to stabilize.

LEARNING SIGNAL
Training paradigm: imagination-based vs reward-based
World Model vs RL Fine-tune
52%
48%
3.2% · x0.76 ratio 1.10 2.9% · x0.91
World Model

World Model: learned environment simulator (Dreamer, UniSim); enables planning via imagination without real-world interaction

RL Fine-tuning

RL Fine-tuning: post-training with PPO/DPO/GRPO reward signals; enables online improvement beyond demonstration data

Why they compete

World models learn by predicting the future (imagination-based planning), while RL learns from reward feedback. If world models become accurate enough, they could reduce the need for expensive real-world RL exploration.

MANIPULATION SENSING
Manipulation approach: tactile feedback vs dexterous control
Tactile vs Dext. Hand
67%
33%
2.1% · x0.69 ratio 2.00 1.0% · x0.34
Tactile

Tactile Sensing: force/torque and GelSight contact sensors; provides direct manipulation feedback for delicate tasks

Dexterous Hand

Dexterous Hand: multi-finger manipulation control; achieves fine-grained object interaction without dedicated sensors

Why they compete

Two approaches to dexterous manipulation: tactile sensing adds explicit touch feedback (hardware cost, rich signal), while dexterous hand control relies on proprioception and vision alone (simpler hardware, harder control). The winner depends on sensor cost-to-performance ratio.

TRANSFER APPROACH
Domain bridging: simulation transfer vs cross-embodiment
Sim2Real vs Cross-Embod.
36%
64%
0.6% · x0.20 ratio 0.57 1.0% · x0.34
Sim-to-Real

Sim-to-Real: train in simulation, deploy on real hardware; uses domain randomization to bridge the reality gap

Cross-Embodiment

Cross-Embodiment: transfer policies across different robot morphologies; aims for universal robot foundation models

Why they compete

Sim-to-Real trains one robot in simulation then transfers (cheap data, reality gap risk), while Cross-Embodiment trains across multiple real robots directly (expensive data, natural generalization). The approaches represent different bets on where generalization should happen.

EMERGING SIGNALS

2026-04-24 · 36/171 unmatched · 7d window
3 signals
TERM COUNT AGE VELOCITY STATUS SAMPLE
embodied ai
10 23d ++ x1.0 CANDIDATE GaLa: Hypergraph-Guided Visual Language Models for...
robot manipulation
6 8d + x0.0 CANDIDATE Bimanual Robot Manipulation via Multi-Agent In-Con...
high fidelity
5 8d ~ x-0.5 CANDIDATE FLASH: Fast Learning via GPU-Accelerated Simulatio...

TOP INSTITUTIONS

30d window · 20 labs tracked · VLA domain
20 active / 30d
INSTITUTION TOTAL BEST LAST SEEN ACTIVITY
1 CMU 10 04-17
2 Berkeley 6 🔧 04-17
3 NVIDIA 4 🔧 03-27
Physical Intelligence 4 04-10
Tongji 2 🔧 03-26
HKUST 2 📖 03-26
Ryoo 2 🔧 04-05
清华 15 04-01
Stanford 4 🔧 04-17
UCSD 3 📖 03-26
Princeton 3 🔧 04-22
科大 2 🔧 04-01
北大 2 🔧 04-01
Wisconsin 1 📖 03-26
CUHK 1 📖 03-26
Colorado 1 📖 03-26
Roma 1 📖 03-26
NJU 1 📖 03-26
Buffalo 1 📖 03-26
Wayne 1 📖 03-26

📐 理論文章庫📐 理论文章库

204 查看 GitHub 全庫查看 GitHub 全库
最近 2 週最近 2 周 50
昨天 vla core

基于漂移的策略优化:面向在线机器人控制的单步原生策略学习 (Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control)

在 GitHub 閱讀在 GitHub 阅读
2 天前 vla core

世界-价值-动作模型:VLA 系统的隐式规划 (World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems)

在 GitHub 閱讀在 GitHub 阅读
2 天前 vla core

DeepThinkVLA:增强视觉-语言-动作模型的推理能力 (DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models)

在 GitHub 閱讀在 GitHub 阅读
3 天前 vla core

长程记忆赋能 VLA 智能体在开放世界任务执行 (Long-Term Memory for VLA-based Agents in Open-World Task Execution)

在 GitHub 閱讀在 GitHub 阅读
3 天前 vla core

从看到仿真:用数字表亲生成高保真仿真环境 (From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation)

在 GitHub 閱讀在 GitHub 阅读
4 天前 vla core

分层时空动作分词器用于上下文模仿学习 (A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics)

在 GitHub 閱讀在 GitHub 阅读
4 天前 vla core

力场流匹配:从单演示生成力觉数据学习 3D 顺应性策略 (Flow with the Force Field: Learning 3D Compliant Flow Matching Policies from Force and Demonstration-Guided Simulation Data)

在 GitHub 閱讀在 GitHub 阅读
4 天前 vla core

无需微调部署 VLA:即插即用推理时策略引导 (Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion)

在 GitHub 閱讀在 GitHub 阅读
4 天前 foundation

免微调部署 VLA:即插即用推理时策略引导 (Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion)

在 GitHub 閱讀在 GitHub 阅读
4 天前 planning

分层时空动作分词器:上下文模仿学习的新范式 (HiST-AT: A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning)

在 GitHub 閱讀在 GitHub 阅读
5 天前 vla core

cuRoboV2:高自由度机器人的动力学感知运动生成 (cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots)

在 GitHub 閱讀在 GitHub 阅读
6 天前 vla core

DockAnywhere: 通过演示生成提升移动操作数据效率 (DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation)

在 GitHub 閱讀在 GitHub 阅读
7 天前 vla core

HAMLET:将视觉 - 语言 - 动作模型转换为历史感知策略 (HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy)

在 GitHub 閱讀在 GitHub 阅读
7 天前 vla core

X-Diffusion: 跨具身人类演示训练扩散策略 (X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations)

在 GitHub 閱讀在 GitHub 阅读
7 天前 vla core

人工三元智能:生物启发的物理 AI 传感器优先架构 (Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI)

在 GitHub 閱讀在 GitHub 阅读
7 天前 vla core

IGen: 从开放世界图像可扩展生成机器人学习数据 (IGen: Scalable Data Generation for Robot Learning from Open-World Images)

在 GitHub 閱讀在 GitHub 阅读
8 天前 vla core

BLaDA:在 3DGS 场中桥接语言与功能性灵巧动作 (BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields)

在 GitHub 閱讀在 GitHub 阅读
8 天前 vla core

HazardArena:评估 VLA 模型的语义安全 (HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models)

在 GitHub 閱讀在 GitHub 阅读
9 天前 vla core

StaMo:从紧凑状态表示中涌现通用机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation)

在 GitHub 閱讀在 GitHub 阅读
9 天前 vla core

StaMo: 从紧凑状态表示中涌现机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation)

在 GitHub 閱讀在 GitHub 阅读
9 天前 foundation

StarVLA-α:简化视觉 - 语言 - 动作系统的强基线 (StarVLA-α: Reducing Complexity in Vision-Language-Action Systems)

在 GitHub 閱讀在 GitHub 阅读
9 天前 vla core

StaMo:从紧凑状态表示中涌现通用机器人运动 (StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation)

在 GitHub 閱讀在 GitHub 阅读
10 天前 vla core

Déjà Vu:具身智能的经验反馈学习框架 (Dejavu: Towards Experience Feedback Learning for Embodied Intelligence)

在 GitHub 閱讀在 GitHub 阅读
10 天前 vla core

你有一张金票:用单个噪声向量提升生成式机器人策略 (You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector)

在 GitHub 閱讀在 GitHub 阅读
10 天前 vla core

2D 还是 3D:谁主导 VLA 模型中的显著性?—— 三阶段 Token 剪枝框架与模态显著性感知 (2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness)

在 GitHub 閱讀在 GitHub 阅读
10 天前 vla core

用自由语言指令操控人形机器人:统一运动词汇的大型语言动作模型 (Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary)

在 GitHub 閱讀在 GitHub 阅读
11 天前 vla core

可证明概率安全:具身 AI 系统的大规模部署新范式 (Towards Provable Probabilistic Safety for Scalable Embodied AI Systems)

在 GitHub 閱讀在 GitHub 阅读
11 天前 foundation

HY-Embodied-0.5:具身基础模型实战解析 (HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents)

在 GitHub 閱讀在 GitHub 阅读
12 天前 vla core

DailyArt: 从单张静态图像发现关节结构 (DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics)

在 GitHub 閱讀在 GitHub 阅读
12 天前 vla core

Orion-Lite:将 LLM 推理能力蒸馏至高效纯视觉驾驶模型 (Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models)

在 GitHub 閱讀在 GitHub 阅读
13 天前 vla core

HiF-VLA:通过运动表示实现后见、洞察与前瞻的视觉 - 语言 - 动作模型 (HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models)

在 GitHub 閱讀在 GitHub 阅读
14 天前 tactile

TAMEn: 触觉感知操作引擎用于接触丰富任务中的闭环数据收集 (TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks)

在 GitHub 閱讀在 GitHub 阅读
14 天前 vla core

Genie Sim PanoRecon:从单张全景图快速生成沉浸式 3D 场景 (Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama)

在 GitHub 閱讀在 GitHub 阅读
更多文章 · 全部在 GitHub更多文章 · 全部在 GitHub 154
🏗️ Foundation & Training  ·  30
联合训练 (Co-training) 数据处理 (Data Processing) 具身智能深度:数据飞轮与跨模态迁移 (Data Flywheel & Cross-modal Transfer) DCP:凸性检测规则与 CVX/CVXPY 建模心法 (Disciplined Convex Programming) DoRA:权重分解的低秩适配 (DoRA: Weight-Decomposed Low-Rank Adaptation) 评估体系详解 (Evaluation Protocols Deep Dive) Flash Attention: 高效 Transformer 推理的关键 🏗️ 基础理论 — ML 工具箱主线总纲 更新成本摊销:Doc-to-LoRA / Text-to-LoRA 让 LLM “瞬时内化” (Cost Amortization for Instant LLM Updates) 知识蒸馏 (Knowledge Distillation) Knowledge Insulation: 防止灾难性遗忘 当我们谈论 AI 推理的 KV Cache,我们在说什么? (KV Cache in LLM Inference) 终身模仿学习与多模态潜在回放 (Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment) VLA 文献核心技术归纳 (Literature Technical Review) VLA 数学必备:从直觉到实作 弹性模组化架构 Table 生成器(VLA Modular Pipelines) NeurIPS 2025 最佳论文:具身智能视角解读 VLA 论文索引 (Paper Index) 高效微调理论 (PEFT & LoRA) 量化理论 (Quantization Theory) 动作空间敏感量化:QVLA (QVLA: Not All Channels Are Equal in VLA Quantization) RDT2:UMI 数据规模化与跨本体零样本部署 (RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization) RoboGene: 通过多样性驱动的智能体框架提升 VLA 预训练 (Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation) 自监督学习 (Self-Supervised Learning) Shallow-π:Flow-based VLA 的层深蒸馏 (Shallow-π: Knowledge Distillation for Flow-based VLAs) 迁移学习 (Transfer Learning) Transformer vs CNN: 核心架构对比 VideoWeaver:具身智能体的多模态多视角视频迁移 (VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents) VLA Loss Functions Handbook(VLM-Robot Policy 训练目标实务手册) VLA-Forget:具身基础模型的视觉 - 语言 - 动作遗忘机制 (VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models)
🧠 Planning & Reasoning  ·  26
VLA 的安全、对齐与约束决策 (Safety & Alignment for VLA) BEHAVIOR-1K:为什么它不是“任务更多的 benchmark”,而是对通用机器人提出了更真实的要求? (BEHAVIOR-1K: A Human-Centered Embodied AI Benchmark with OmniGibson) 2025 BEHAVIOR Challenge 冠军方案:当 benchmark 足够难时,VLA 最后靠什么赢? (Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge) Benchmark 主线总纲:从任务世界到安全约束,再到世界模型评测器 (Benchmark Mainline Overview) BeSafe-Bench:揭示功能环境中具身代理的行为安全风险 (BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments) 思维链推理 (Chain-of-Thought Reasoning) 基于概念字典学习的 VLA 推理时安全控制 (SAFE-Dict) DAC-RL:分治推理训练提升测试时可扩展性 (Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability) 通过视觉符号诊断、纠正并从操作失败中学习 (Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols) 具身思维链:让 VLA 先“想清楚再动手” (Robotic Control via Embodied Chain-of-Thought Reasoning, 2024) ENACT:它不是再做一个 benchmark,而是在追问 VLM 有没有“具身认知”? (ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction) AgiBot:ERIQ + FACT + GenieReasoner —— 量化“推理→动作”的传递损耗 如何用刀削皮:将细粒度操作与人类偏好对齐 (How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference) IAIL:跨机器人行为适配中的意图对齐 (Cross-robot Behavior Adaptation through Intention Alignment) IS-Bench:它测的不是“安不安全”,而是“会不会在交互过程中把事情做危险”? (IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks) 运动规划 (Motion Planning) RoboClaw:长程机器人操作的智能体框架 (RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks) SOMA:通过记忆增强与策略编排实现 VLA 鲁棒性 (SOMA: Strategic Orchestration and Memory-Augmented System for VLA Robustness via In-Context Adaptation) Tex3D: 通过对抗性 3D 纹理将物体变为 VLA 模型的攻击表面 (Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models) 具身任务规划的视觉语言基础模型:Thinker (Thinker: A Vision-Language Foundation Model for Embodied Intelligence) 13 参数推理微调:TinyLoRA (Learning to Reason in 13 Parameters) Uni-Skill:构建自演化技能库实现通用机器人操作 (Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation) VLA 十大挑战 (10 Open Challenges) VLA 本质安全:从梯度掩码到物理“脑切除” (SGTM) VLM Promptable Representations:用“可提示表征”给 RL 注入常识 (PR2L, 2024/2025) 何时执行、询问或学习:不确定性感知策略转向 (When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering)
🏛️ VLA Core  ·  21
SnapFlow:流匹配 VLA 的单步动作生成 (SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation) 即时 VLA 自适应 via 测试时强化学习 (On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning) GeoPredict:利用预测运动学与 3D 高斯几何实现精确 VLA 操作 (GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation) ABot-M0:动作流形学习的 VLA 基础模型 (ABot-M0: VLA Foundation Model with Action Manifold Learning) ACT: 动作分块变换器 (Action Chunking with Transformers) 超越注意力幅度:利用层间秩一致性实现高效 VLA 模型 (Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models) FAST: 高效动作 Token 化 Figure Helix 02:全身端到端 VLA 的“运动-操作一体化”架构 (Helix 02: Full-Body Autonomy) FocusVLA:聚焦视觉利用的 Vision-Language-Action 模型 (FocusVLA: Focused Visual Utilization for Vision-Language-Action Models) Galaxea G0: 双系统 VLA 框架 GR00T-N1.6 模型解剖 (Dissecting GR00T-N1.6) InstructVLA:从理解到操作的视觉 - 语言 - 动作指令微调 (InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation) LangGap:诊断与修复 VLA 模型的语言理解缺口 (Diagnosing and Closing the Language Gap in Vision-Language-Action Models) LingBot-VLA:实用主义 VLA 基座模型与高吞吐训练栈 (LingBot-VLA: A Pragmatic VLA Foundation Model) 压碎、涂抹、切片!通过视觉空间进展学习物体状态操作 (Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress) 学习加法组合潜在动作用于具身 AI (Learning Additively Compositional Latent Actions for Embodied AI) PALM: 通过可供性推理实现进度感知的策略学习 (PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation) 通过质量多样性提示生成对 VLA 模型进行红队测试 (Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies) 开环规划,闭环验证:VLA 的推测验证框架 (Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA) Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception (EyeVLA) 从视觉语言模型学习结构化机器人策略 via 合成神经符号监督 (Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision)
🔧 Deployment & Hardware  ·  19
🔧 部署与硬件 — 实战落地主线总纲 DexGrasp-Zero:形态对齐的零样本跨本体灵巧抓取策略 (DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping) 中金人机系列05(灵巧手)→ VLA/控制/硬件的“可计算约束”框架(理论侧整理) 灵巧手机械学深度解析 (Dexterous Hand Mechanics) — 修订整合版 v2 机器人开可乐/发牌有多难?灵巧手:硬件路线 × 接触数学 × 数据金字塔(访谈摘录整理) EquiBim:双臂操作中的对称等变策略学习 (EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation) GR-Dexter(ByteDance Seed):把 VLA 扩展到高自由度灵巧手的“硬件-数据-模型”全栈框架 抓取算法与仿真平台 (Grasp Algorithms & Simulation Platforms) House of Dextra: 灵巧手机器人形态 - 控制协同设计 (House of Dextra: Cross-embodied Co-design for Dexterous Hands) 产业视角:通用性与“元学习”路径(从一张路线图说起) Isaac Lab: GPU 加速的多模态机器人学习仿真框架 Lightning Grasp:Contact Field 驱动的超高速灵巧手抓取合成 (Lightning Grasp: Procedural Grasp Synthesis with Contact Fields) NVIDIA 的 AI 五层蛋糕:从能源到机器人应用的基础设施观 (AI Is a 5-Layer Cake) 英伟达物理 AI 的第一刀:为什么先砍向汽车 (Why NVIDIA's First Physical AI Wedge Hits Cars First) Physical Intelligence Layer:机器人基础模型 API 的产品化范式 (The Physical Intelligence Layer) RoboPocket:把“机器人博士”装进口袋的无本体即时策略迭代 (RoboPocket: Improve Robot Policies Instantly with Your Phone) 机械臂运动学、动力学与控制 (Robot Arm Kinematics, Dynamics & Control) 机器人动力学系统分类 (Classification of Robot Dynamical Systems) 机器人“开源基建”三分法:成果展示 / 生态绑定 / 基础设施(以 RoboParty Roboto_Origin 为例)
🎮 Reinforcement Learning  ·  16
CausalGDP:因果引导的扩散策略用于强化学习 (CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning) Evo-RL:在低成本机械臂上把 π*0.6 / RECAP 真机 RL 跑成可复现工程 (Evo-RL for Open Real-World RL on SO101 and Beyond) GR-RL 模型解剖 (Dissecting GR-RL) π*0.6 / RECAP:披着 RL 外衣的 Supervised Learning?——从 Offline RL 的“监督化”到 VLA Post-training 的新范式 π-StepNFT:更宽探索空间需要更细粒度步级监督 (Wider Space Needs Finer Steps in Online RL for Flow-based VLAs) 后验优化与裁剪目标:生成式策略学习中的效率 - 稳定性桥梁 (Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning) PGR:用条件扩散“生成式回放”替代 PER 的稀有样本过拟合 (Prioritized Generative Replay, ICLR 2025) 强化学习 (Reinforcement Learning) 具身智能体的奖励函数自主发现 (Discovery of Reward Function for Embodied RL) 🎮 强化学习 — VLA 后训练主线总纲 RLinf:面向 Embodied / Agentic AI 的 RL 训练基础设施(以及它对 VLA+RL 的意义) 扩展验证比扩展策略学习更有效:VLA 对齐的测试时验证框架 (Scaling Verification Can Be More Effective than Scaling Policy Learning for VLA Alignment) U2O RL:用无监督离线技能预训练替代“任务奖励离线预训练” (Unsupervised-to-Online Reinforcement Learning, 2024) VLA-OPD:用反向 KL 桥接 SFT 与 RL 的 VLA 后训练范式 (VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation) VLA+RL 实战教程:架构、算法与工具链 (Practical VLA+RL Guide) VLGOR:视觉 - 语言知识引导的离线强化学习用于通用智能体 (VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents)
👁️ Perception & 3D  ·  15
ArtPro: 基于运动提议自适应整合的关节物体自监督重建 (ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals) DKT: 基于视频扩散先验的透明物体感知 (Diffusion Knows Transparency) DVGT-2:以密集几何为基石的自动驾驶新范式 (DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale) EgoDemoGen: 第一人称视角演示生成用于机器人操作中的视角泛化 (EgoDemoGen: Egocentric Demonstration Generation for Viewpoint Generalization in Robotic Manipulation) Fast-FoundationStereo:把基础立体匹配压到实时的零样本双目深度模型 (Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching) 语言如何“改写”视觉:从「香蕉是黄色的」到 VLA 的工程启示 (How Language Shapes Vision: From “Bananas Are Yellow” to VLA) 多模态模型基础 (Multimodal Models) PAM: 用于 Sim-to-Real HOI 视频生成的姿态 - 外观 - 动作引擎 (PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation) 👁️ 视觉感知 — 3D 理解主线总纲 视觉/多模态感知技术 (Visual & Multimodal Perception Techniques) 点云理解与 SLAM (Point Cloud Intelligence & SLAM) 空间智能与坐标系 (Spatial Intelligence & Coordinate Systems) 状态估计与传感器融合 (State Estimation & Sensor Fusion) WaveFormer:波动方程驱动的视觉建模 (WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation) 单图像零样本三维生成:Zero-1-to-3 (Zero-1-to-3: Zero-shot One Image to 3D Object)
🌊 Diffusion & Flow  ·  13
🤚 Tactile Perception  ·  13
学习何时看与何时感觉:接触感知操作中的自适应视觉-力矩融合 (Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation) 多模态操控的策略共识:让“触觉不再拖后腿” (Multi-Modal Manipulation via Policy Consensus) 软体机器人“本体觉醒”:GVS 应变建模 + 灵敏度椭球,让形状与 3D 外力可估计 (Soft Robot Proprioception with GVS + Sensitivity Ellipsoids) SuperTac:多模态“电子皮肤” + 触觉语言模型 DOVE(Nature Sensors 2025) TacRefineNet:触觉驱动的机器人精细抓取微调模型 (TacRefineNet: Tactile-Only Grasp Refinement) 触觉:为什么看起来最“低级”的感官,在具身智能里最不可替代?(Tactile, the Irreplaceable Modality) 🤚 触觉感知 — 多模态触觉主线总纲 触觉感知与 VLA (Tactile VLA) 触觉-力对齐的 VLA:TaF-VLA (Tactile-Force Alignment for VLA) 让策略先“看见”,再“摸准”:TouchGuide 触觉推理引导 (TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance) UniTacHand:用 MANO UV Map 统一触觉,实现人手→机器人零样本技能迁移 (Unified Spatio-Tactile Representation) 视触觉“感同身受”的神经基础 (Vicarious Body Maps) 视触觉预训练 + 在线多任务学习:用单目 + 二值触觉解锁“类人灵巧操作” (Visual-Tactile Pretraining + Online Multitask Learning)

🏆 SOTA 排行SOTA 排行

Evo-SOTA 完整榜Evo-SOTA 完整榜 30
CALVIN ABCD-D 飽和饱和 avg_len
# Model Score vs Prev Date Paper
1 Xiaomi-Robotics-0 4.8 Flower VLA +0.13 2026-04-24 arxiv →
2 Xiaomi-Robotics-0 4.8 Flower VLA +0.13 2026-04-17 arxiv →
3 MMaDA-VLA 4.78 Xiaomi-Robotics-0 +0.03 2026-04-24 arxiv →
4 MMaDA-VLA 4.78 Xiaomi-Robotics-0 +0.03 2026-04-17 arxiv →
5 AVA-VLA 4.65 TriVLA +0.28 2026-04-24 arxiv →
6 AVA-VLA 4.65 TriVLA +0.28 2026-04-17 arxiv →
7 GR-2 4.64 DFM-VLA +0.20 2026-04-24 arxiv →
8 GR-2 4.64 DFM-VLA +0.20 2026-04-17 arxiv →
9 NS-VLA 4.56 AtomicVLA +0.29 2026-04-24 arxiv →
10 NS-VLA 4.56 AtomicVLA +0.29 2026-04-17 arxiv →
11 Flower VLA 4.35 RoboUniview +0.49 2026-04-24 arxiv →
12 Flower VLA 4.35 RoboUniview +0.49 2026-04-17 arxiv →
13 MCIL 1.82 2026-04-24 arxiv →
14 MCIL 1.82 2026-04-17 arxiv →
LIBERO standard-opensource 飽和饱和 average
# Model Score vs Prev Date Paper
1 CORAL 99.3 SRPO +0.10 2026-04-24 arxiv →
2 PLD 99.17 NS-VLA +0.57 2026-04-24 arxiv →
3 PLD 99.17 NS-VLA +0.57 2026-04-17 arxiv →
4 Dual-CoT VLA 98.8 FocusVLA +0.10 2026-04-24 arxiv →
5 Dual-CoT VLA 98.8 FocusVLA +0.10 2026-04-17 arxiv →
LIBERO Plus standard-closed total
# Model Score vs Prev Date Paper
1 TAG 87.24 ProGAL-VLA +1.74 2026-04-24 arxiv →
2 ACoT-VLA 86.6 pi0.5 +0.90 2026-04-24 arxiv →
3 NS-VLA 79.4 2026-04-24 arxiv →
MetaWorld non-standard average
# Model Score vs Prev Date Paper
1 MPI 86 iRe-VLA +3.00 2026-04-24 arxiv →
2 pi-RL 85.8 Evo-1 +5.20 2026-04-24 arxiv →
RoboCasa-GR1-Tabletop standard-opensource avg_success_rate
# Model Score vs Prev Date Paper
1 ABot-M0 58.3 TwinBrainVLA +3.70 2026-04-24 arxiv →
2 StarVLA-alpha (generalist) 57.3 Dual-CoT VLA +2.20 2026-04-24 arxiv →
3 StarVLA-alpha (generalist) 57.3 Dual-CoT VLA +2.20 2026-04-18 arxiv →
RoboChallenge standard-opensource score
# Model Score vs Prev Date Paper
1 DM0 72.25 Giga-Brain-0.1 +3.91 2026-04-24 arxiv →
2 StarVLA-alpha (generalist) 54.5 2026-04-24 arxiv →
3 StarVLA-alpha (generalist) 54.5 2026-04-18 arxiv →