Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms
This repository presents an efficient acceleration pipeline for Diffusion Transformer (DiT) based video generation models, optimized for AMD client-grade GPUs, including Navi48 dGPUs and Strix Halo iGPUs.
Built upon this pipeline, we introduce Hummingbird-XT, a new family of DiT-based text-to-video models derived from Wan2.2-5B, achieving high-quality video generation with significantly reduced inference cost.Additionally, to further extend the length of generated videos, we introduce Hummingbird-XTX, an efficient autoregressive model for long-video generation based on Wan-2.1-1.3B, which is capable of generating long videos.
Hummingbird-XT Text-to-Video Showcases
| Caption | Video |
|---|---|
Text Prompt (click to expand)
The young East Asian man with short black hair, fair skin, and monolid eyes looks ahead. A young East Asian woman with long black hair and fair skin turns to smile warmly at him. The background is blurred, focusing on their shared gaze. Realistic cinematic style.
|
The_young_East_Asian_man_with_short_black_hair__fair_skin__and_monolid_eyes_look.3.mp4 |
Text Prompt (click to expand)
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
|
Stylish_woman_walks_confidently_down_a_Tokyo_street_filled_with_warm__glowing_ne.mp4 |
Text Prompt (click to expand)
Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
|
Animated_scene_features_a_close-up_of_a_short_fluffy_monster_kneeling_beside_a_m.4.mp4 |
Hummingbird-XT Image-to-Video Showcases
| Caption | Video |
|---|---|
Text Prompt (click to expand)
A back-view close-up focusing on the runner’s feet striking the track.
Only subtle movement occurs—his steps land firmly, kicking a small
amount of dust or rubber granules. The camera stays low and straight-on
behind him, following smoothly with minimal shake. The sunlight is
bright, with long shadows stretching forward.
|
a_back-view_close-up_focusing_on_the_runner_s_feet_striking_the_track__Only_subt.5.mp4 |
Text Prompt (click to expand)
A graceful woman stands under a majestic sandstone arch, forming a
small heart shape with her fingers close to the camera while smiling
warmly and radiating joy. Behind her, a smooth and elegant fountain
rises gracefully, its water reflecting the warm, inviting courtyard
walls in a mirror-like fashion.
|
A_graceful_woman_stands_under_a_majestic_sandstone_arch__forming_a_small_heart_s.1.mp4 |
Text Prompt (click to expand)
舞台上,一名男子弹奏着一把由闪电构成的电吉他。随着音乐渐强,
火花在他周围噼啪作响。突然,耀眼的光芒转为暗红色,他的双眼
发出幽光,黑色的翅膀从背后羽化而出。他的皮肤变得黝黑,闪电
缠绕着他的身体,他化身为一个恶魔,伫立在翻滚的烟雾和雷鸣之中。
|
man.mp4 |
Hummingbird-XTX 20s videos Showcases
| Caption | Video |
|---|---|
Text Prompt (click to expand)
Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
|
vsink1.mp4 |
Text Prompt (click to expand)
A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
|
00002.mp4 |
Text Prompt (click to expand)
A cinematic wide portrait of a man with his face lit by the glow of a TV.
|
00090.mp4 |
- [2026.01.09]: 🔥🔥Release the full code and pre-trained weight of HummingbirdXT!
- [2026.01.08]: 🔥🔥Release our Blog: Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms !
Hummingbird-XT
- DiT-based text-to-video model built upon Wan2.2-5B
- Optimized for few-step inference
- Designed for efficient deployment on AMD GPUs
- Maintains competitive visual quality compared to full-step baselines
Hummingbird-XTX (Long Video Extension)
- Extends Hummingbird-XT to support efficient long video generation
- Improves temporal consistency across extended sequences
- Suitable for long-form generation scenarios
Lightweight VAE Decoder
-
Lightweight and efficient VAE decoder
-
Introduces 3D DW Conv to reduce redundancy in original 3D Conv
-
Achieves 14x speedup for decoding and reduce memory usage by 4x
-
Preserves competitive reconstruction and generation quality
Clone this Repo:
git clone https://github.com/AMD-AGI/HummingbirdXT.git
cd HummingbirdXTconda create -n hummingbirdxt python=3.10
conda activate hummingbirdxt
pip install -r requirements.txtFor rocm flash-attn, you can install it by this link.
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
python setup.py install
You can download our pre-built Docker image for better reproducibility:
docker pull panisobe/dmd_flash_image_2:latestYou can use docker run to run the image. For example:
docker run -it \
--shm-size=900g \
--name hm \
--network host \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
-e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
-v /home:/home \
panisobe/dmd_flash_image_2_release:latestYou can download the weights for all our models from our models' huggingface: amd/HummingbirdXT.
cd infer
bash run_t2v.sh # for text-to-video task
bash run_i2v.sh # for image-to-video taskTo use Lightweight VAE decoder in generation, add the below parameters after running command in run_t2v.sh and run_i2v.sh:
--vae_model ${VAE_PATH_ROOT}/wan22_v1_tiling_16_12 --t_block_size 16 --t_stride 12 cd long_video
bash run.shFirst you need to enter the train folder:
cd train
Step 1: Download the Teacher Model We use Wan2.2-TI2V-5B as the teacher model for step distillation.
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 --local-dir wan_models/Wan2.2-TI2V-5BStep 2: Prepare Training Datasets We train our models on a mixture of large-scale video datasets, including: MagicData,OpenVid,HumanVid
Please update the dataset root paths in the corresponding CSV files to match your local storage layout. You can download the csv file from our models' huggingface: amd/HummingbirdXT.
Step 3: Launch Training Start the step distillation training using the provided script:
bash running_scripts/train/dmd.shThe training pipeline demonstrates stable loss convergence across all models.
Reference Training Configuration: GPUs: 16 × AMD MI325, Iterations: 4000, Training time: ~48 hours.
Step 1: ODE Initialization(Optional)
cd long_video
bash train_ode.shOr you can directly download our trained ODE initialization weights from our models' huggingface: amd/HummingbirdXT, for the second stage of training.
Step 2: DMD Training
bash train_dmd.shTable 1. Quantitative results for the text-to-video task on VBench.
| Model | Quality Score ↑ | Semantic Score ↑ | Total Score ↑ |
|---|---|---|---|
| Wan-2.2-5B-T2V w/o recap | 82.75 | 68.38 | 79.88 |
| Wan-2.2-5B-T2V with recap | 83.99 | 77.04 | 82.60 |
| Ours-T2V w/o recap | 84.07 | 54.75 | 78.20 |
| Ours-T2V with recap | 85.71 | 72.33 | 83.03 |
Table 2. Quantitative results for the image-to-video task on VBench.
| Model | Video-Image Subject Consistency ↑ | Video-Image Background Consistency ↑ | Quality Score ↑ |
|---|---|---|---|
| Wan-2.2-5B-I2V w/o recap | 97.89 | 99.04 | 81.43 |
| Wan-2.2-5B-I2V with recap | 97.63 | 98.95 | 81.06 |
| Ours-I2V w/o recap | 98.46 | 98.91 | 80.01 |
| Ours-I2V with recap | 98.42 | 98.99 | 80.57 |
Table 3. Runtime for generating a 121-frame video at 704×1280 resolution on server-grade (AMD Instinct™ MI300X and AMD Instinct™ MI325X GPU) and client-grade (Strix Halo and Navi48).
| Model | MI300X | MI325X | Strix Halo iGPU | Navi48 dGPU |
|---|---|---|---|---|
| Wan-2.2-5B | 193.4s | 153.9s | 15000s | OOM |
| Ours | 6.5s | 3.8s | 460s | 36.4s |
Table 4. Performance and efficiency comparison of different VAE decoders on AMD Instinct™ MI300X GPU.
| Model | LPIPS ↓ | PSNR ↑ | SSIM ↑ | RunTime ↓ | Memory ↓ |
|---|---|---|---|---|---|
| Wan-2.2 VAE | 0.0141 | 35.979 | 0.9598 | 31.34s | 11.37G |
| TAEW2.2 | 0.0575 | 29.599 | 0.8953 | 0.14s | 1.35G |
| Ours VAE | 0.0260 | 34.635 | 0.9483 | 2.29s | 2.71G |
Table 5. Quantitative results for long video generation on three benchmarks.
| Model | FPS ↑ | Flicker Metric ↓ | DOVER ↑ | VBench Quality ↑ | VBench Semantic ↑ | VBench Total ↑ |
|---|---|---|---|---|---|---|
| Self-Forcing | 19.28 | 0.1010 | 84.37 | 81.99 | 80.09 | 81.61 |
| Causvid | 18.24 | 0.0972 | 82.77 | 81.96 | 77.02 | 80.97 |
| LongLive | 21.32 | 0.0947 | 84.07 | 82.86 | 81.61 | 82.61 |
| RollingForcing | 19.57 | 0.0928 | 85.16 | 82.94 | 80.61 | 82.47 |
| Ours | 26.38 | 0.0946 | 84.55 | 83.42 | 79.22 | 82.58 |
Huggingface model cards: AMD-HummingbirdXT
Full training code: AMD-AIG-AIMA/HummingbirdXT
Related work on diffusion models by the AMD team:
- AMD Hummingbird-0.9B: An Efficient Text-to-Video Diffusion Model with 4-Step Inferencing
- AMD Hummingbird Image to Video: A Lightweight Feedback-Driven Model for Efficient Image-to-Video Generation
Please refer to the following resources to get started with training on AMD ROCm™ software:
- Use the public PyTorch ROCm Docker images that enable optimized training performance out-of-the-box
- PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm — ROCm Blogs
- Accelerating Large Language Models with Flash Attention on AMD GPUs — ROCm Blogs
Our codebase builds on Wan 2.1, Wan 2.2, Self-Forcing, VideoX-Fun .Thanks the authors for sharing their awesome codebases!
Feel free to cite our Hummingbird-XT models and give us a star⭐, if you find our work helpful. Thank you.