DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Haoyu Zhao, Zhongang Qi, Cong Wang, Qingping Zheng, Guansong Lu, Fei Chen, Hang Xu, Zuxuan Wu

Fudan University; Huawei Noah' Ark Lab

Abstract

With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degradation. To address these issues, we propose DynamiCtrl, a novel framework for human animation in video DiT architecture. Specifically, we use a shared VAE encoder for human images and driving poses, unifying them into a common latent space, maintaining pose fidelity, and eliminating the need for an expert pose encoder during video denoising. To integrate pose control into the DiT backbone effectively, we propose a novel Pose-adaptive Layer Norm model. It injects normalized pose features into the denoising process via conditioning on visual tokens, enabling seamless and scalable pose control across DiT blocks. Furthermore, to overcome the shortcomings of text removal, we introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context. Through full-attention blocks, image and pose features are aligned with text features, enhancing semantic consistency, leveraging pretrained knowledge, and enabling multi-level control. Experiments verify the superiority of DynamiCtrl on benchmark and self-collected data (e.g., achieving the best LPIPS of 0.166), demonstrating strong character control and high-quality synthesis.

Pose-guided Human Image Animation

For each case: (a / b / c) denote (Driving video / Reference human image / Generated video)

Method Overview:

DynamiCtrl is a novel framework that enhances pose-guided video synthesis in DiT by emphasizing the role of text. It uses a Shared VAE encoder for both reference images and driving pose videos, simplifying the framework by removing the need for a separate pose encoder. DynamiCtrl introduces Pose-adaptive Layer Norm to inject sparse pose features into the model while maintaining spatiotemporal consistency. It also aligns textual and visual features within the full attention, enabling fine-grained control over both background and motion for the first time.

BibTeX

@article{zhao2025dynamictrl, title={DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation}, author={Haoyu, Zhao and Zhongang, Qi and Cong, Wang and Qingping, Zheng and Guansong, Lu and Fei, Chen and Hang, Xu and Zuxuan, Wu}, journal=Arxiv}, year={2025} }