Humanoid Generative Pre-Training
for Zero-Shot Motion Tracking

Zekun Qi^12*, Xuchuan Chen^23*, Jilong Wang^2*, Chenghuai Lin^2*, Yunrui Lian¹²,
Zhikai Zhang¹², Yu Guan¹², Wenyao Zhang²⁴, Xinqiang Yu², He Wang²⁵, Li Yi^16†

* equal contribution † corresponding author

¹Tsinghua University ²Galbot Inc. ³Beihang University ⁴Shanghai Jiao Tong University ⁵Peking University ⁶Shanghai Qi Zhi Institute

arXiv Video Code

Real-time whole-body control and zero-shot dance generation capabilities of Humanoid-GPT on Unitree G1.

Abstract

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility–generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

Overview of Humanoid Generative Pre-Training for Zero-Shot Motion Tracking. We introduce Humanoid-GPT, a Transformer-based humanoid tracker trained on an unprecedented 2B-frame retargeted corpus, whose scaling in data and model capacity yield a universal controller that zero-shot tracks highly dynamic and unseen human motions beyond the agility and generalization limits of prior MLP trackers. The figure illustrates the real-time whole-body control and zero-shot dance generation capabilities of Humanoid-GPT.

Highlights

          1
          Current humanoid motion trackers face a fundamental agility–generalization trade-off: they can either track agile motions in-domain or generalize to unseen movements, but not both.
        

          2
          We assemble an unprecedented 2B-frame motion corpus by aggregating all major mocap datasets (AMASS, LAFAN1, Motion-X++, PHUMA, MotionMillion) with large-scale in-house recordings.
        

          3
          We propose Harmonic Motion Embedding (HME), a novel metric for quantifying and categorizing motion data diversity directly from motion data itself.
        

          4
          We introduce Humanoid-GPT, a Transformer-based humanoid tracker trained via expert distillation that achieves both highly dynamic motion tracking and unprecedented zero-shot generalization.
        

          5
          Extensive experiments demonstrate clear scaling laws: enlarging both data and model capacity yields consistent improvements in tracking accuracy and generalization.
        

Method Overview

Humanoid-GPT Training Pipeline. (a) We curate a large-scale motion corpus by aggregating multiple mocap datasets and performing motion retargeting to the Unitree G1 humanoid. (b) Motion experts are trained via reinforcement learning on clustered motion data using Harmonic Motion Embedding (HME). (c) A GPT-style Transformer is trained via DAgger distillation to consolidate all expert knowledge into a single generalist tracker with causal temporal attention.

Comparison with Prior Works

Humanoid-GPT is the first work that combines a Transformer-based architecture, agile motion tracking, zero-shot generalization, and billion-scale training data.

Method	Tracker	Agile	Zero-shot	#Frames
HumanPlus	Transformer	✗	✗	7.2M
OmniH2O	MLP	✗	✗	7.2M
ASAP	MLP	✓	✗	-
GMT	MoE-MLP	✓	✗	6.0M
UniTracker	MLP	✓	✗	7.2M
TWIST	MLP	✗	~	9.2M
Any2Track	MLP	✓	✗	9.1M
SONIC	MLP	✓	✓	100M
Humanoid-GPT (ours)	Transformer	✓	✓	2.0B

Data Diversity Analysis

We propose Harmonic Motion Embedding (HME) to quantify motion data diversity in a latent space. Our curated dataset exhibits both higher embedding scale and broader latent coverage, with approximately 4-5× increase in log-volume compared with AMASS.

Dataset diversity in the HME embedding space.

Data distribution visualization.

Real-World Experiments

All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motions in a zero-shot manner, especially various dance motions.

Demo Videos

Humanoid-GPT enables zero-shot whole-body tracking of highly dynamic and unseen human motions, including various dance sequences that were never seen during training.

Zero-shot Dance: Can Do Can Go!

Zero-shot Dance: Gokuraku Joudo

Zero-shot Dance: HuoYuanJia/Fearless

Zero-shot Dance: PokerFace

Real-time Teleoperation

Agile Motion Tracking

Scaling Laws

The Transformer-based Humanoid-GPT exhibits clear scaling laws: enlarging both the motion corpus and the model capacity yields consistent and substantial gains in tracking accuracy and stability.

Data Scaling Curve on Zero-shot Performance

Model Scalability Comparison

Inference Optimization

We carefully optimized the deployment pipeline using ONNX and TensorRT compilation, achieving an end-to-end inference latency of under 1.5ms on a single NVIDIA RTX 4090 GPU, approximately 5× faster than TWIST.

Comparison of inference latency among different optimization methods.

BibTeX

@article{humanoidgpt25,
      title={Humanoid Generative Pre-Training for Zero-Shot Motion Tracking},
      author={Qi, Zekun and Chen, Xuchuan and Wang, Jilong and Lin, Chenghuai and Lian, Yunrui and Zhang, Zhikai and Zhang, Wenyao and Yu, Xinqiang and Wang, He and Yi, Li},
      journal={arXiv preprint arXiv:25xx.xxxxx},
      year={2025}
    }