VTLA Plan B: Why Evo-RL Should Not Be the Main Framework
The short answer
Evo-RL is valuable, but it belongs late in the stack
After looking through Evo-RL, my conclusion is simple: it is useful for real-robot post-training and correction loops, but it should not be the main framework if the end goal is VTLA (Vision-Tactile-Language-Action).
Where Evo-RL should sit
L1 Simulation & data generation L2 Visuo-tactile policy learning L3 Sim2Real L4 Real-robot deployment & continual improvement Best place for Evo-RL: L4
Evo-RL is strongest when you already have a working policy and want rollout, human intervention, correction, value/advantage estimation, and retraining on real hardware.
Recommended Plan B
Use UniVTAC or TacSL as the visuo-tactile simulation base, Diffusion Policy as the main policy backbone, and keep Evo-RL for the later real-robot continual-learning stage.
- UniVTAC / TacSL: build the visuo-tactile world and contact-rich task data.
- Diffusion Policy: stable baseline for continuous control and multimodal policy learning.
- Evo-RL: add rollout, intervention, correction, and post-training after deployment.
The first task I would choose
I would start with single-arm + two-finger tactile gripper + blind insertion, such as peg insertion or USB-C dummy insertion. It is small enough to finish, but tactile sensing actually matters after initial visual alignment.
- Simulation goal: success rate ≥ 85% under pose perturbation.
- Real-robot goal: success rate ≥ 70% over 30 consecutive trials.
- Key comparison: VTLA should beat vision-only by a meaningful margin.
Why not start from Evo-RL directly?
Because that mixes every hard problem together at the same time: tactile sensing, contact simulation, policy design, calibration, deployment, and RL loops. If something fails, you will not know whether the issue is the tactile model, the simulation, the backbone, or the post-training loop.
A safer sequence is:
Prove VTLA in simulation -> deploy a minimal real-robot version -> then add Evo-RL for continual improvement
A practical 3-month path
- Month 1: one task, one tactile representation, one policy, one simulator.
- Month 2: robustness and ablations; prove tactile actually helps.
- Month 3: minimal real deployment first, then connect Evo-RL for correction and retraining.
先说结论
Evo-RL 有价值,但不该做 VTLA 的主框架
我认真看完 Evo-RL 之后的判断是: 它更适合做真机后的持续学习与纠错闭环,而不适合在项目一开始就承担 VTLA (Vision-Tactile-Language-Action)的主研发任务。
Evo-RL 应该放在哪一层
L1 仿真与数据生成 L2 视触觉策略学习 L3 Sim2Real L4 真机部署与持续优化 Evo-RL 最适合放在 L4
也就是:当你已经有一个能工作的策略,再用它做 rollout、人在环接管、纠错回流、 value / advantage 估计和再训练闭环。
我会给的 Plan B
用 UniVTAC 或 TacSL 做视触觉仿真底座,用 Diffusion Policy 做主策略骨架, 把 Evo-RL 放到后面的真机持续学习阶段。
- UniVTAC / TacSL:负责视触觉世界、接触任务和数据。
- Diffusion Policy:负责先把多模态连续控制做稳。
- Evo-RL:负责上线后的 rollout、接管、纠错和后训练。
第一个任务我会怎么选
我会选 单臂 + 双指触觉夹爪 + 盲插,例如 peg insertion 或 USB-C dummy insertion。 这个任务足够小,能做完;同时它又真的需要触觉,因为视觉只能解决全局定位, 接触后的微调主要靠触觉。
- 仿真目标:有位姿扰动时成功率 ≥ 85%。
- 真机目标:连续 30 次测试成功率 ≥ 70%。
- 关键对比:VTLA 必须显著优于 vision-only。
为什么不建议直接从 Evo-RL 开始
因为这样会把所有难题同时绑在一起:触觉建模、接触仿真、策略设计、标定部署、 以及 RL 闭环。最后一旦失败,你很难判断问题到底出在触觉、仿真、骨架还是后训练。
更稳的顺序是:
先在仿真里证明 VTLA 值得做 -> 再做一个最小真机版本 -> 最后接 Evo-RL 做持续优化
一个更现实的 3 个月路线
- 第 1 个月:一个任务、一个触觉表示、一个策略、一个仿真器。
- 第 2 个月:做鲁棒性和消融,证明触觉确实有增益。
- 第 3 个月:先上最小真机,再接 Evo-RL 做纠错回流与再训练。