We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility–generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.
Overview of Humanoid Generative Pre-Training for Zero-Shot Motion Tracking. We introduce Humanoid-GPT, a Transformer-based humanoid tracker trained on an unprecedented 2B-frame retargeted corpus, whose scaling in data and model capacity yield a universal controller that zero-shot tracks highly dynamic and unseen human motions beyond the agility and generalization limits of prior MLP trackers. The figure illustrates the real-time whole-body control and zero-shot dance generation capabilities of Humanoid-GPT.
Humanoid-GPT Training Pipeline. (a) We curate a large-scale motion corpus by aggregating multiple mocap datasets and performing motion retargeting to the Unitree G1 humanoid. (b) Motion experts are trained via reinforcement learning on clustered motion data using Harmonic Motion Embedding (HME). (c) A GPT-style Transformer is trained via DAgger distillation to consolidate all expert knowledge into a single generalist tracker with causal temporal attention.
Humanoid-GPT is the first work that combines a Transformer-based architecture, agile motion tracking, zero-shot generalization, and billion-scale training data.
| Method | Tracker | Agile | Zero-shot | #Frames |
|---|---|---|---|---|
| HumanPlus | Transformer | ✗ | ✗ | 7.2M |
| OmniH2O | MLP | ✗ | ✗ | 7.2M |
| ASAP | MLP | ✓ | ✗ | - |
| GMT | MoE-MLP | ✓ | ✗ | 6.0M |
| UniTracker | MLP | ✓ | ✗ | 7.2M |
| TWIST | MLP | ✗ | ~ | 9.2M |
| Any2Track | MLP | ✓ | ✗ | 9.1M |
| SONIC | MLP | ✓ | ✓ | 100M |
| Humanoid-GPT (ours) | Transformer | ✓ | ✓ | 2.0B |
We propose Harmonic Motion Embedding (HME) to quantify motion data diversity in a latent space. Our curated dataset exhibits both higher embedding scale and broader latent coverage, with approximately 4-5× increase in log-volume compared with AMASS.
Dataset diversity in the HME embedding space.
Data distribution visualization.
All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motions in a zero-shot manner, especially various dance motions.
Humanoid-GPT enables zero-shot whole-body tracking of highly dynamic and unseen human motions, including various dance sequences that were never seen during training.
The Transformer-based Humanoid-GPT exhibits clear scaling laws: enlarging both the motion corpus and the model capacity yields consistent and substantial gains in tracking accuracy and stability.
Data Scaling Curve on Zero-shot Performance
Model Scalability Comparison
We carefully optimized the deployment pipeline using ONNX and TensorRT compilation, achieving an end-to-end inference latency of under 1.5ms on a single NVIDIA RTX 4090 GPU, approximately 5× faster than TWIST.
Comparison of inference latency among different optimization methods.
@article{humanoidgpt25,
title={Humanoid Generative Pre-Training for Zero-Shot Motion Tracking},
author={Qi, Zekun and Chen, Xuchuan and Wang, Jilong and Lin, Chenghuai and Lian, Yunrui and Zhang, Zhikai and Zhang, Wenyao and Yu, Xinqiang and Wang, He and Yi, Li},
journal={arXiv preprint arXiv:25xx.xxxxx},
year={2025}
}