NVIDIA Real-Time Graphics Research NVIDIA Spatial Intelligence Lab NVIDIA Research
VideoMat: Extracting PBR Materials from Video Diffusion Models

VideoMat: Extracting PBR Materials from Video Diffusion Models

1 NVIDIA
2 University of Toronto
3 Vector Institute

Our method starts from a known 3D model and an HDR environment map. We first render videos of normal maps and three simple uniform shading conditions (diffuse, semi-specular, fully specular) lit with the provided probe. Next, these conditions, together with a text prompt, are passed to our finetuned video model, which generates a coherent video of the object with a novel material, while respecting the given lighting condition. We then pass this video into a second video model, which performs intrinsic decomposition, and generate per-frame G-buffers for the material properties. Finally, the output from the two video models, alongside the given geometry and lighting, are passed to a differentiable path tracer, which performs multi-view reconstruction to extract high quality PBR materials from the generated views.

Abstract


We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools.


Relightable, standard PBR materials from a text prompt


Given a 3D model and a prompt, we extract high quality PBR materials from finetuned video diffusion models. We show the extracted materials with three different HDR probes. 3D Model from the BlenderVault dataset. HDRI probes from Poly Haven.

Material variery


Our finetuned video model generates view-consistent multi-view images of diverse materials from text prompts, while closely respecting the input geometry and lighting. .

Text-to-material generation


We compare against Paint-it [YOPM24] and DreamMat [ZLX∗24] on a 3D model from the BlenderVault dataset. We encourage the reader to compare the quality of the intrinsics (base color, roughness, metallic) to the reference. While significant deviations are expected in purely text-guided methods, we note that the base color predicted by our method is significantly more demodulated, or “flat” than the competing work. Our roughness and metallic guides are also more faithful to the reference, though with a slight bias towards lower roughness.

Citation


            @inproceedings{munkberg2025videomat,
                author = {Jacob Munkberg and Zian Wang and Ruofan Liang and Tianchang Shen and Jon Hasselgren},
                title = {{VideoMat: Extracting PBR Materials from Video Diffusion Models}},
                booktitle = {Eurographics Symposium on Rendering - CGF Track},
                year = {2025}
            }
        

Paper


VideoMat: Extracting PBR Materials from Video Diffusion Models

Jacob Munkberg, Zian Wang, Ruofan Liang, Tianchang Shen, and Jon Hasselgren

description Preprint (arXiv)
description Video
insert_comment BibTeX