AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

Teaser of AdaHuman

Abstract

Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.

Character Reconstruction Results

Result 1
3DGS Viewer
Result 1
3DGS Viewer
Result 1
3DGS Viewer
Result 1
3DGS Viewer
Result 1
3DGS Viewer
Result 1
3DGS Viewer
Result 1
3DGS Viewer
Result 1
3DGS Viewer
Reconstruction results from in-the-wild images (SHHQ dataset).
Result 1
Result 2
Result 3
Result 4
Result 5
Result 6
Result 7
Result 8
Reconstruction results of CustomHumans dataset (single-view input).
Comparison with baseline methods
Reconstruction Gallery on SHHQ dataset
Animation results on SHHQ avatars with AMASS motion.
Animation results on MVHumanNet compared to ground-truths.

Method

Overview of AdaHuman
Left: Given an RGB image of an unseen person as input, AdaHuman could (1) reconstruct a high-fidelity pixel-aligned 3D Gaussian Splat (3DGS) avatar, as well as (2) generate an reposed 3DGS avatar with a target pose condition, enable building animatable avatar in a standard A-pose.
Right: A pose-conditioned joint 3D diffusion process is utilized to generate global or local 3DGS reconstruction or reposing results. This process ensures 3D consistency of the reconstruction by utilizing generated 3DGS results in each reverse diffusion process of multi-view avatar images.

Citation

@misc{huang2025adahumananimatabledetailed3d,
  title={AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion}, 
  author={Yangyi Huang and Ye Yuan and Xueting Li and Jan Kautz and Umar Iqbal},
  year={2025},
  eprint={2505.24877},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2505.24877}, 
}

Template adapted from GLAMR.