Humanoid Generative Pre-Training
for Zero-Shot Motion Tracking

* equal contribution   † corresponding author
1Tsinghua University   2Galbot Inc.   3Beihang University   4Shanghai Jiao Tong University   5Peking University   6Shanghai Qi Zhi Institute

Real-time whole-body control and zero-shot dance generation capabilities of Humanoid-GPT on Unitree G1.

Abstract

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility–generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

Humanoid-GPT Teaser

Overview of Humanoid Generative Pre-Training for Zero-Shot Motion Tracking. We introduce Humanoid-GPT, a Transformer-based humanoid tracker trained on an unprecedented 2B-frame retargeted corpus, whose scaling in data and model capacity yield a universal controller that zero-shot tracks highly dynamic and unseen human motions beyond the agility and generalization limits of prior MLP trackers. The figure illustrates the real-time whole-body control and zero-shot dance generation capabilities of Humanoid-GPT.

Highlights

1 Current humanoid motion trackers face a fundamental agility–generalization trade-off: they can either track agile motions in-domain or generalize to unseen movements, but not both.
2 We assemble an unprecedented 2B-frame motion corpus by aggregating all major mocap datasets (AMASS, LAFAN1, Motion-X++, PHUMA, MotionMillion) with large-scale in-house recordings.
3 We propose Harmonic Motion Embedding (HME), a novel metric for quantifying and categorizing motion data diversity directly from motion data itself.
4 We introduce Humanoid-GPT, a Transformer-based humanoid tracker trained via expert distillation that achieves both highly dynamic motion tracking and unprecedented zero-shot generalization.
5 Extensive experiments demonstrate clear scaling laws: enlarging both data and model capacity yields consistent improvements in tracking accuracy and generalization.

Method Overview

Humanoid-GPT Pipeline

Humanoid-GPT Training Pipeline. (a) We curate a large-scale motion corpus by aggregating multiple mocap datasets and performing motion retargeting to the Unitree G1 humanoid. (b) Motion experts are trained via reinforcement learning on clustered motion data using Harmonic Motion Embedding (HME). (c) A GPT-style Transformer is trained via DAgger distillation to consolidate all expert knowledge into a single generalist tracker with causal temporal attention.

Comparison with Prior Works

Humanoid-GPT is the first work that combines a Transformer-based architecture, agile motion tracking, zero-shot generalization, and billion-scale training data.

Method Tracker Agile Zero-shot #Frames
HumanPlusTransformer7.2M
OmniH2OMLP7.2M
ASAPMLP-
GMTMoE-MLP6.0M
UniTrackerMLP7.2M
TWISTMLP~9.2M
Any2TrackMLP9.1M
SONICMLP100M
Humanoid-GPT (ours)Transformer2.0B

Data Diversity Analysis

We propose Harmonic Motion Embedding (HME) to quantify motion data diversity in a latent space. Our curated dataset exhibits both higher embedding scale and broader latent coverage, with approximately 4-5× increase in log-volume compared with AMASS.

Data Diversity Comparison

Dataset diversity in the HME embedding space.

Data Distribution

Data distribution visualization.

Real-World Experiments

All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motions in a zero-shot manner, especially various dance motions.

Real-world Experiments

Demo Videos

Humanoid-GPT enables zero-shot whole-body tracking of highly dynamic and unseen human motions, including various dance sequences that were never seen during training.

Zero-shot Dance: Can Do Can Go!
Zero-shot Dance: Gokuraku Joudo
Zero-shot Dance: HuoYuanJia/Fearless
Zero-shot Dance: PokerFace
Real-time Teleoperation
Agile Motion Tracking

Scaling Laws

The Transformer-based Humanoid-GPT exhibits clear scaling laws: enlarging both the motion corpus and the model capacity yields consistent and substantial gains in tracking accuracy and stability.

Data Scaling

Data Scaling Curve on Zero-shot Performance

Model Scalability

Model Scalability Comparison

Inference Optimization

We carefully optimized the deployment pipeline using ONNX and TensorRT compilation, achieving an end-to-end inference latency of under 1.5ms on a single NVIDIA RTX 4090 GPU, approximately 5× faster than TWIST.

Inference Optimization

Comparison of inference latency among different optimization methods.

BibTeX

@article{humanoidgpt25,
      title={Humanoid Generative Pre-Training for Zero-Shot Motion Tracking},
      author={Qi, Zekun and Chen, Xuchuan and Wang, Jilong and Lin, Chenghuai and Lian, Yunrui and Zhang, Zhikai and Zhang, Wenyao and Yu, Xinqiang and Wang, He and Yi, Li},
      journal={arXiv preprint arXiv:25xx.xxxxx},
      year={2025}
    }