Skip to content

AMD-AGI/HummingbirdXT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms

🔍 Overview

This repository presents an efficient acceleration pipeline for Diffusion Transformer (DiT) based video generation models, optimized for AMD client-grade GPUs, including Navi48 dGPUs and Strix Halo iGPUs.

Built upon this pipeline, we introduce Hummingbird-XT, a new family of DiT-based text-to-video models derived from Wan2.2-5B, achieving high-quality video generation with significantly reduced inference cost.Additionally, to further extend the length of generated videos, we introduce Hummingbird-XTX, an efficient autoregressive model for long-video generation based on Wan-2.1-1.3B, which is capable of generating long videos.

Hummingbird-XT Text-to-Video Showcases

Caption Video
Text Prompt (click to expand)
The young East Asian man with short black hair, fair skin, and monolid eyes looks ahead. A young East Asian woman with long black hair and fair skin turns to smile warmly at him. The background is blurred, focusing on their shared gaze. Realistic cinematic style.
The_young_East_Asian_man_with_short_black_hair__fair_skin__and_monolid_eyes_look.3.mp4
Text Prompt (click to expand)
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
Stylish_woman_walks_confidently_down_a_Tokyo_street_filled_with_warm__glowing_ne.mp4
Text Prompt (click to expand)
Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
Animated_scene_features_a_close-up_of_a_short_fluffy_monster_kneeling_beside_a_m.4.mp4

Hummingbird-XT Image-to-Video Showcases

Caption Video
Text Prompt (click to expand)
A back-view close-up focusing on the runner’s feet striking the track. Only subtle movement occurs—his steps land firmly, kicking a small amount of dust or rubber granules. The camera stays low and straight-on behind him, following smoothly with minimal shake. The sunlight is bright, with long shadows stretching forward.
a_back-view_close-up_focusing_on_the_runner_s_feet_striking_the_track__Only_subt.5.mp4
Text Prompt (click to expand)
A graceful woman stands under a majestic sandstone arch, forming a small heart shape with her fingers close to the camera while smiling warmly and radiating joy. Behind her, a smooth and elegant fountain rises gracefully, its water reflecting the warm, inviting courtyard walls in a mirror-like fashion.
A_graceful_woman_stands_under_a_majestic_sandstone_arch__forming_a_small_heart_s.1.mp4
Text Prompt (click to expand)
舞台上,一名男子弹奏着一把由闪电构成的电吉他。随着音乐渐强, 火花在他周围噼啪作响。突然,耀眼的光芒转为暗红色,他的双眼 发出幽光,黑色的翅膀从背后羽化而出。他的皮肤变得黝黑,闪电 缠绕着他的身体,他化身为一个恶魔,伫立在翻滚的烟雾和雷鸣之中。
man.mp4

Hummingbird-XTX 20s videos Showcases

Caption Video
Text Prompt (click to expand)
Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
vsink1.mp4
Text Prompt (click to expand)
A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
00002.mp4
Text Prompt (click to expand)
A cinematic wide portrait of a man with his face lit by the glow of a TV.
00090.mp4

📝 News


🧬 Models

Hummingbird-XT

  • DiT-based text-to-video model built upon Wan2.2-5B
  • Optimized for few-step inference
  • Designed for efficient deployment on AMD GPUs
  • Maintains competitive visual quality compared to full-step baselines

Hummingbird-XTX (Long Video Extension)

  • Extends Hummingbird-XT to support efficient long video generation
  • Improves temporal consistency across extended sequences
  • Suitable for long-form generation scenarios

Lightweight VAE Decoder

  • Lightweight and efficient VAE decoder

  • Introduces 3D DW Conv to reduce redundancy in original 3D Conv

  • Achieves 14x speedup for decoding and reduce memory usage by 4x

  • Preserves competitive reconstruction and generation quality


⚙️ Installation

Clone this Repo:

git clone https://github.com/AMD-AGI/HummingbirdXT.git
cd HummingbirdXT

Option 1: Conda Environment

conda create -n hummingbirdxt python=3.10
conda activate hummingbirdxt
pip install -r requirements.txt

For rocm flash-attn, you can install it by this link.

git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
python setup.py install

Option 2: Docker

You can download our pre-built Docker image for better reproducibility:

docker pull panisobe/dmd_flash_image_2:latest

You can use docker run to run the image. For example:

docker run -it \
  --shm-size=900g \
  --name hm \
  --network host \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  -e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
  -v /home:/home \
  panisobe/dmd_flash_image_2_release:latest

🚀 Getting Started for Video Generation

You can download the weights for all our models from our models' huggingface: amd/HummingbirdXT.

HummingbirdXT Video Generation

cd infer
bash run_t2v.sh # for text-to-video  task
bash run_i2v.sh # for image-to-video task

To use Lightweight VAE decoder in generation, add the below parameters after running command in run_t2v.sh and run_i2v.sh:

--vae_model ${VAE_PATH_ROOT}/wan22_v1_tiling_16_12 --t_block_size 16 --t_stride 12  

HummingbirdXTX Long Video Generation

cd long_video
bash run.sh

🧪 Training & Implementation

HummingbirdXT Training

First you need to enter the train folder:

cd train

Step 1: Download the Teacher Model We use Wan2.2-TI2V-5B as the teacher model for step distillation.

pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 --local-dir wan_models/Wan2.2-TI2V-5B

Step 2: Prepare Training Datasets We train our models on a mixture of large-scale video datasets, including: MagicData,OpenVid,HumanVid

Please update the dataset root paths in the corresponding CSV files to match your local storage layout. You can download the csv file from our models' huggingface: amd/HummingbirdXT.

Step 3: Launch Training Start the step distillation training using the provided script:

bash running_scripts/train/dmd.sh

The training pipeline demonstrates stable loss convergence across all models.

Reference Training Configuration: GPUs: 16 × AMD MI325, Iterations: 4000, Training time: ~48 hours.

HummingbirdXTX Training

Step 1: ODE Initialization(Optional)

cd long_video
bash train_ode.sh

Or you can directly download our trained ODE initialization weights from our models' huggingface: amd/HummingbirdXT, for the second stage of training.

Step 2: DMD Training

bash train_dmd.sh

📊 Experimental Results

Table 1. Quantitative results for the text-to-video task on VBench.

Model Quality Score ↑ Semantic Score ↑ Total Score ↑
Wan-2.2-5B-T2V w/o recap 82.75 68.38 79.88
Wan-2.2-5B-T2V with recap 83.99 77.04 82.60
Ours-T2V w/o recap 84.07 54.75 78.20
Ours-T2V with recap 85.71 72.33 83.03

Table 2. Quantitative results for the image-to-video task on VBench.

Model Video-Image Subject Consistency ↑ Video-Image Background Consistency ↑ Quality Score ↑
Wan-2.2-5B-I2V w/o recap 97.89 99.04 81.43
Wan-2.2-5B-I2V with recap 97.63 98.95 81.06
Ours-I2V w/o recap 98.46 98.91 80.01
Ours-I2V with recap 98.42 98.99 80.57

Table 3. Runtime for generating a 121-frame video at 704×1280 resolution on server-grade (AMD Instinct™ MI300X and AMD Instinct™ MI325X GPU) and client-grade (Strix Halo and Navi48).

Model MI300X MI325X Strix Halo iGPU Navi48 dGPU
Wan-2.2-5B 193.4s 153.9s 15000s OOM
Ours 6.5s 3.8s 460s 36.4s

Table 4. Performance and efficiency comparison of different VAE decoders on AMD Instinct™ MI300X GPU.

Model LPIPS ↓ PSNR ↑ SSIM ↑ RunTime ↓ Memory ↓
Wan-2.2 VAE 0.0141 35.979 0.9598 31.34s 11.37G
TAEW2.2 0.0575 29.599 0.8953 0.14s 1.35G
Ours VAE 0.0260 34.635 0.9483 2.29s 2.71G

Table 5. Quantitative results for long video generation on three benchmarks.

Model FPS ↑ Flicker Metric ↓ DOVER ↑ VBench Quality ↑ VBench Semantic ↑ VBench Total ↑
Self-Forcing 19.28 0.1010 84.37 81.99 80.09 81.61
Causvid 18.24 0.0972 82.77 81.96 77.02 80.97
LongLive 21.32 0.0947 84.07 82.86 81.61 82.61
RollingForcing 19.57 0.0928 85.16 82.94 80.61 82.47
Ours 26.38 0.0946 84.55 83.42 79.22 82.58

🤗Additional Resources

Huggingface model cards: AMD-HummingbirdXT

Full training code: AMD-AIG-AIMA/HummingbirdXT

Related work on diffusion models by the AMD team:

Please refer to the following resources to get started with training on AMD ROCm™ software:


❤️ Acknowledgement

Our codebase builds on Wan 2.1, Wan 2.2, Self-Forcing, VideoX-Fun .Thanks the authors for sharing their awesome codebases!


📋 Citations

Feel free to cite our Hummingbird-XT models and give us a star⭐, if you find our work helpful. Thank you.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •