Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Yang, Timing; He, Sicheng; Jing, Hongyi; Yang, Jiawei; Liu, Zhijian; Zou, Chuhang; Wang, Yue

Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Timing Yang¹, Sicheng He¹, Hongyi Jing¹, Jiawei Yang¹, Zhijian Liu^2,3
Chuhang Zou⁴^†, Yue Wang^1,3^†

¹USC Physical Superintelligence (PSI) Lab ²University of California, San Diego ³NVIDIA ⁴Meta Reality Labs
^† Joint corresponding authors

Paper Code

Speed-accuracy overview of Fast SAM 3D Body

Fast SAM 3D Body achieves up to 10.9× end-to-end speedup and over 10,000× faster MHR-to-SMPL conversion, enabling real-time humanoid control from a single RGB stream.

Abstract

SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000×. Overall, our framework delivers up to a 10.9× end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that—unlike methods reliant on wearable IMUs—enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.

Real-Time Humanoid Teleoperation

Fast SAM 3D Body enables real-time, vision-only teleoperation of the Unitree G1 humanoid robot at ~65 ms latency per frame on an NVIDIA RTX 5090. The system directly translates SMPL kinematics for robotic control, enabling collection of whole-body manipulation policies from a single RGB stream.

Object Manipulation

Lifting Box

Kneeling & Lower Body

Alternating Knee Switch

Stand Up to Half-Kneeling

Single-Knee Lift

Left Side-on Kneel

Right Side-on Kneel

Single-Knee Kneel

Upper Body & Full Body

Crouch and Rotate

Raise Hands and Squat

Arm Raise with Swivel

Upper-Body Gestures

Raise Elbow

Locomotion

Turn and Move Forward

Forward and Backward

Move Forward

Qualitative Results

Visual comparison between SAM 3D Body and our accelerated Fast SAM 3D Body on diverse in-the-wild images. Our method preserves high-fidelity reconstruction quality across challenging scenarios.

BibTeX

@article{yang2026fastsam3dbody,
      title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
      author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian and Zou, Chuhang and Wang, Yue},
      journal={arXiv preprint arXiv:2603.15603},
      year={2026}
}