System Concept (DexHand × Sensing × Learning)

What problem should the sensing system solve?

Dexterous manipulation fails less because the robot “can’t see”, and more because contact is nonlinear: friction is unknown, occlusion is common, tabletop collisions are real, and small geometry errors amplify. A DexHand sensing system should be designed to make contact measurable, controllable, and replayable.

Contact Wrench/Friction Slip/Pre-slip Replayability

Sensor stack (from minimum viable to “robot skin”)

  • Level-0 (default): joint encoders (q/dq), motor current/voltage, temperature/power.
  • Level-1 (stable grasp): fingertip normal force, optional wrist 6D F/T, tactile array CoP/contact area.
  • Level-2 (micro-manipulation): shear/slip cues (direct or inferred), pre-slip detection.
  • Level-3 (high information density): visuotactile or dynamic tactile arrays (learn contact geometry).
  • Level-4 (future): palm-scale skin, nail/edge sensing for prying/insertion, near-field proximity.

Three-loop architecture (fast reflex / tactile coordination / slow semantics)

DexHand Sensing/Control: three-layer closed loop

  Slow (5~15 Hz): semantics & planning
  ┌───────────────────────────────────────────────────────────┐
  │ VLM/VLA: stage machine, goal selection, recovery policy   │
  └──────────────────────────────┬────────────────────────────┘
                                 │ goals + constraints
                                 ▼
  Mid (60~200 Hz): tactile coordination
  ┌───────────────────────────────────────────────────────────┐
  │ Slip/Force controller: friction margin, Δgrip, micro-move  │
  └──────────────────────────────┬────────────────────────────┘
                                 │ impedance/targets
                                 ▼
  Fast (500~1000 Hz): on-hand reflex & safety
  ┌───────────────────────────────────────────────────────────┐
  │ RT loop: encoder/current control, over-current, thermal    │
  │ derating, collision detection, emergency stop              │
  └───────────────────────────────────────────────────────────┘

This separation makes “hold steady” independent of 30 Hz vision, and makes safety independent of PC scheduling jitter.

Time sync & calibration (the system’s foundation)

  • Timestamp-at-source: tag at the device when data is produced (not when received).
  • Unified clock: PTP (IEEE 1588) preferred; hardware trigger as a fallback.
  • Alignment: ring buffer + ZOH / linear interpolation to fuse 30 Hz vision with 1 kHz proprioception.
  • Tactile calibration: taxel/visuotactile pixels → finger frame; raw signal → force proxy; cross-modal alignment with vision.

Learning-ready data schema (what to log)

  • Vision: wrist/palm RGB(D) (+ timestamps)
  • Proprioception: q/dq/current/temp (+ timestamps)
  • Tactile: taxel map / CoP / contact area / slip flag (or visuotactile image)
  • Action: Δq/Δx or impedance parameters (what you commanded)
  • QC labels: replay_ok, recovery_event

Future direction: tactile expectation & recovery as “first-class skills”

The next step is to make tactile not only an input, but a predictable signal: learn a tactile expectation model; when real tactile deviates from predicted tactile, trigger recovery. This aligns with the idea that diverse, messy, fail-retry data is crucial for robust policies: Spirit-v1.5 Blog.

系统构想(灵巧手 × 传感 × 学习)

传感系统到底要解决什么问题?

灵巧操作失败更常见的原因不是“看不见”,而是接触的非线性: 摩擦未知、遮挡常态、桌面硬碰撞真实存在,微小几何误差会被放大。 因此系统目标应是让接触可测可控可复现

接触 力/摩擦 防滑/预滑 可复现

传感栈(从最小可用到“皮肤”)

  • Level-0(默认):编码器 q/dq、电机电流/电压、温度/功耗。
  • Level-1(稳定抓取):指尖法向力、可选腕部 6D F/T、触觉阵列的接触面积/压力中心。
  • Level-2(微操):剪切/滑移线索与预滑检测。
  • Level-3(高信息密度):视触觉或动态触觉阵列(学习接触几何)。
  • Level-4(未来):掌心大面积皮肤、指甲/硬边缘传感、近场接近传感。

三层闭环(快反射 / 触觉协调 / 慢语义)

DexHand Sensing/Control: three-layer closed loop

  Slow (5~15 Hz): semantics & planning
  ┌───────────────────────────────────────────────────────────┐
  │ VLM/VLA: stage machine, goal selection, recovery policy   │
  └──────────────────────────────┬────────────────────────────┘
                                 │ goals + constraints
                                 ▼
  Mid (60~200 Hz): tactile coordination
  ┌───────────────────────────────────────────────────────────┐
  │ Slip/Force controller: friction margin, Δgrip, micro-move  │
  └──────────────────────────────┬────────────────────────────┘
                                 │ impedance/targets
                                 ▼
  Fast (500~1000 Hz): on-hand reflex & safety
  ┌───────────────────────────────────────────────────────────┐
  │ RT loop: encoder/current control, over-current, thermal    │
  │ derating, collision detection, emergency stop              │
  └───────────────────────────────────────────────────────────┘

这种分层能让“拿稳”不依赖 30Hz 视觉,让“安全”不依赖上位机线程调度抖动。

同步与标定(系统的生命线)

  • 设备端打戳:采集瞬间打全局时间戳,而不是接收时打戳。
  • 统一时钟:优先 PTP(IEEE 1588),必要时用硬件触发补偿。
  • 对齐:环形缓冲 + ZOH/线性插值,把 30Hz 视觉与 1kHz 本体对齐。
  • 触觉标定:触觉像素/Taxel → 指尖坐标系;原始信号 → 力的代理量;与视觉点云对齐构造监督。

面向学习的数据结构(你应该记录什么)

  • 视觉:wrist/palm RGB(D)(含 source timestamp)
  • 本体:q/dq/current/temp(含 source timestamp)
  • 触觉:taxel map/CoP/contact area/slip flag(或视触觉图像)
  • 动作:Δq/Δx 或阻抗参数(你到底下发了什么)
  • 质控标签replay_okrecovery_event

未来:触觉预期与恢复动作成为“第一类技能”

未来触觉不只是输入,而应成为可预测的信号: 学触觉预期,一旦真实触觉偏离预期就触发恢复。这与“多样化、失败-重试”的数据更利于泛化的观点一致: Spirit-v1.5 Blog