VTLA Plan B: Why Evo-RL Should Not Be the Main Framework

EN 中文

Published: 2026-01-25 · Updated: 2026-03-09

The short answer

Evo-RL is valuable, but it belongs late in the stack

After looking through Evo-RL, my conclusion is simple: it is useful for real-robot post-training and correction loops, but it should not be the main framework if the end goal is VTLA (Vision-Tactile-Language-Action).

VTLA Tactile Real robot Post-training

Where Evo-RL should sit

L1  Simulation & data generation
L2  Visuo-tactile policy learning
L3  Sim2Real
L4  Real-robot deployment & continual improvement

Best place for Evo-RL: L4

Evo-RL is strongest when you already have a working policy and want rollout, human intervention, correction, value/advantage estimation, and retraining on real hardware.

Recommended Plan B

My more concrete answer is:

L1  Visuo-tactile simulation + benchmark
    -> UniVTAC first
    -> TacSL if you need Isaac-Lab-native, GPU-heavy tactile simulation

L2  Policy backbone
    -> Diffusion Policy
    -> vision + tactile + proprio fusion
    -> action chunking in continuous space

L3  Contact-aware upgrade
    -> tactile-force alignment / force-aware adapter
    -> use when insertion fails because of force misuse, not because of perception

L4  Real-robot continual learning
    -> Evo-RL after deployment
    -> rollout -> intervention -> value/advantage -> retraining

UniVTAC first: it already unifies data generation, encoder pretraining, and benchmark tasks such as Insert Hole and Insert HDMI.
TacSL as an alternative: choose it when your team is already standardized on Isaac Lab and you care more about tactile simulation throughput than unified benchmark design.
Diffusion Policy first: it handles multimodal continuous control and action chunking better than simple MSE regression, which matters a lot for insertion-style correction.
Force-aware upgrade second: once the robot can reach and touch reliably, add a tactile-force alignment module inspired by force-aware VTLA work, because tactile is most valuable when it becomes a force/control signal rather than just “another image”.
Evo-RL last: only after a working policy exists, use rollout, intervention, value, advantage, and retraining on real hardware.

Why this stack is safer than “Evo-RL first”

The key systems insight is that tactile should not enter the stack at the same place as language and high-level semantics.

Language / task
   -> high-level goal and stage selection

Vision
   -> global localization, approach, coarse alignment

Tactile / force
   -> contact onset, slip, over-force, peg-hole micro-correction

Evo-RL
   -> post-deployment data flywheel
   -> not the first source of competence

Vision solves “where”, tactile solves “what is happening after contact”.
Diffusion Policy solves “how to output a smooth corrective action chunk”.
Evo-RL solves “how to keep getting better after deployment”.

If you start from Evo-RL directly, you mix together perception errors, contact-model errors, action-generation errors, calibration errors, and online learning instability. That makes debugging too expensive for a first VTLA system.

The first task I would choose

I would still start with single-arm + two-finger tactile gripper + blind insertion, but I would define it more tightly:

Task: peg insertion / HDMI-style insertion / USB-C dummy insertion.
Robot: one arm, no dexterous hand yet.
Tactile: start with contact depth / contact area / binary contact or coarse tactile, not maximum sensor complexity on day one.
Observations: fixed camera + wrist camera + proprio + tactile stream.
Action space: delta end-effector pose + gripper command, chunked over a short horizon.

This task is small enough to finish, but rich enough to reveal whether your tactile signal is actually useful after vision reaches its limit.

Concrete acceptance criteria

Simulation success: ≥ 85% under random pose perturbation.
Vision-only gap: VTLA should beat a vision-only policy by at least 10 to 15 points on insertion success.
Contact quality: fewer penetration-like failures, fewer jam events, fewer unstable retries.
Real-robot success: ≥ 70% over 30 consecutive trials with small pose mismatch and mild occlusion.
Ablation requirement: if removing tactile barely changes results, your “VTLA” stack is not really using tactile yet.

Why not start from Evo-RL directly?

Because that mixes every hard problem together at the same time: tactile sensing, contact simulation, policy design, calibration, deployment, and RL loops. If something fails, you will not know whether the issue is the tactile model, the simulation, the backbone, or the post-training loop.

A safer sequence is:

Prove VTLA in simulation
-> deploy a minimal real-robot version
-> then add Evo-RL for continual improvement

A practical 3-month path

Month 1: build the minimum stack. One task, one simulator, one policy, one tactile representation. Finish the benchmark and prove the policy can close the loop after contact.
Month 2: harden the stack. Run robustness tests on pose error, occlusion, slip, and contact-force failure. Add a force-aware adapter only if contact misuse is now the bottleneck.
Month 3: move to the smallest real setup. Keep the task identical, preserve logging, then connect Evo-RL only after the base policy can already complete the task sometimes without online learning.

The exact Plan B I would defend to a research lead

Phase 1: UniVTAC-style visuo-tactile simulation and benchmark.
Phase 2: Diffusion Policy baseline with vision + tactile + proprio fusion.
Phase 3: force-aware tactile upgrade if insertion failure is mostly about force misuse.
Phase 4: Evo-RL-style real-robot continual improvement loop.

That is the lowest-risk route because each layer answers a different question:

UniVTAC / TacSL: can we generate and evaluate contact-rich data properly?
Diffusion Policy: can we turn multimodal state into stable corrective actions?
Tactile-force alignment: can we make contact signals physically meaningful?
Evo-RL: can the deployed robot learn from failure and human correction?

先说结论

Evo-RL 有价值，但不该做 VTLA 的主框架

我认真看完 Evo-RL 之后的判断是：它更适合做真机后的持续学习与纠错闭环，而不适合在项目一开始就承担 VTLA （Vision-Tactile-Language-Action）的主研发任务。

VTLA 触觉真机后训练

Evo-RL 应该放在哪一层

L1  仿真与数据生成
L2  视触觉策略学习
L3  Sim2Real
L4  真机部署与持续优化

Evo-RL 最适合放在 L4

也就是：当你已经有一个能工作的策略，再用它做 rollout、人在环接管、纠错回流、 value / advantage 估计和再训练闭环。

我会给的 Plan B

我更具体的版本是：

L1  视触觉仿真与 benchmark
    -> 先用 UniVTAC
    -> 如果团队深度绑定 Isaac Lab，再考虑 TacSL

L2  策略主干
    -> Diffusion Policy
    -> 融合 vision + tactile + proprio
    -> 连续动作 chunk 输出

L3  接触感知升级
    -> 触觉-力对齐 / force-aware adapter
    -> 当失败主要来自“用力不对”时再加

L4  真机持续学习
    -> Evo-RL 在部署后接入
    -> rollout -> intervention -> value/advantage -> retrain

先选 UniVTAC：因为它把数据生成、编码器预训练和 benchmark 做成了一套闭环，而且任务里已经包含 Insert Hole、Insert HDMI 这类和盲插高度相关的问题。
TacSL 是备选：如果你们更在意 Isaac Lab 体系和 GPU 触觉仿真吞吐，而不是统一 benchmark 设计，就可以考虑它。
先上 Diffusion Policy：它比简单回归更适合多模态连续控制，尤其适合“接触后要不断微调”的动作块生成。
再补 force-aware 触觉层：当系统已经能碰到、对准、但总是因为“力用错了”失败时，再加触觉-力对齐模块。触觉真正的价值不是多一张图，而是把接触变成可控的力语义。
最后才接 Evo-RL：只有当你已经有一个能做成任务的基础策略时，rollout、接管、value、advantage、再训练闭环才真正有意义。

为什么这套分层比“先 Evo-RL”更安全

关键不是“触觉要不要加”，而是触觉应该加在哪一层。

Language / task
   -> 高层目标与阶段切换

Vision
   -> 全局定位、接近、粗对准

Tactile / force
   -> 接触开始、滑移、过力、孔口微修正

Evo-RL
   -> 部署后的数据飞轮
   -> 不是第一性能力来源

视觉解决“去哪”，触觉解决“接触后到底发生了什么”。
Diffusion Policy 解决“怎样输出平滑、连续、可纠错的动作块”。
Evo-RL 解决“上线以后怎么继续变强”。

如果一开始就从 Evo-RL 入手，你会把感知错误、接触建模错误、动作生成错误、标定错误和在线学习不稳定性全部混在一起，这对第一个 VTLA 系统来说调试成本太高。

第一个任务我会怎么定义

我还是会选 单臂 + 双指触觉夹爪 + 盲插，但会把定义再收紧：

任务：peg insertion / HDMI-style insertion / USB-C dummy insertion。
机器人：先不用灵巧手，只用单臂 + 双指夹爪。
触觉：先从 contact depth / contact area / binary contact 或 coarse tactile 开始，不要第一天就上最复杂高分辨率方案。
观测：固定相机 + 腕部相机 + proprio + tactile。
动作空间：末端 delta pose + gripper command，按短时域 chunk 输出。

这个任务足够小，能做完；同时它又真的需要触觉，因为视觉只能解决全局对准，接触后的微调、纠偏和防卡死主要靠触觉/力信息。

更具体的验收标准

仿真成功率：随机位姿扰动下 ≥ 85%。
相对 vision-only 提升：插入成功率至少高 10 到 15 个百分点。
接触质量：穿透式失败、卡死、无效重试显著减少。
真机成功率：轻度遮挡、轻微初始偏差下，连续 30 次测试成功率 ≥ 70%。
必要消融：如果去掉触觉后结果几乎不变，那说明这套系统还不是真正的 VTLA。

为什么不建议直接从 Evo-RL 开始

因为这样会把所有难题同时绑在一起：触觉建模、接触仿真、策略设计、标定部署、以及 RL 闭环。最后一旦失败，你很难判断问题到底出在触觉、仿真、骨架还是后训练。

更稳的顺序是：

先在仿真里证明 VTLA 值得做
-> 再做一个最小真机版本
-> 最后接 Evo-RL 做持续优化

一个更现实的 3 个月路线

第 1 个月：搭最小栈。一个任务、一个仿真器、一个策略、一个触觉表示。先把“接触后闭环纠错”跑通。
第 2 个月：做稳健性。系统测试位姿误差、遮挡、滑移、过力失败；只有当主要瓶颈变成“力用错了”，才加 force-aware adapter。
第 3 个月：上最小真机。任务不变，日志结构不变；只有当基础策略已经能在真机上偶尔独立完成任务时，再接 Evo-RL 的持续学习闭环。

如果要向老师或团队负责人 defend 这套 Plan B

阶段 1：UniVTAC 风格的视触觉仿真与 benchmark。
阶段 2：Diffusion Policy 基线，融合 vision + tactile + proprio。
阶段 3：如果主要失败来自接触用力错误，再补触觉-力对齐层。
阶段 4：Evo-RL 风格的真机持续优化闭环。

这条路线风险最低，因为每一层都在回答一个不同的问题：

UniVTAC / TacSL：我们能不能正确生成和评估接触密集数据？
Diffusion Policy：我们能不能把多模态状态变成稳定的纠错动作？
触觉-力对齐：我们能不能让触觉信号真正带上物理意义？
Evo-RL：机器人上线后能不能从失败和人工纠错中继续变强？