We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
The first two videos demonstrate the “texture sticking” issue in in two “cinemagraphs” created using generators trained on the unaligned FFHQ-U dataset. The looping videos show small random walks around a central point in the latent space. Observe how the details (hairs, wrinkles, etc.) the StyleGAN2 result (left) appear to be glued to the screen coordinates while the face moves under it, while all details transform coherently in our result (right).
The following videos show interpolations between hand-picked latent points in several datasets. Observe again how the textural detail appears fixed in the StyleGAN2 result, but transforms smoothly with the rest of the scene in the alias-free StyleGAN3.
We note, in particular, how StyleGAN3 appears to have learned to mimic camera motion in the Beaches dataset.
The following video illustrates translational equivariance, or lack thereof, in several “bridge” configurations, and aims to visually demonstrate the meaning of EQ-T equivariance scores. In all panels, the first image is the result of running the corresponding generator with analytically translated Fourier input features. The second image has obtained from the first by “untransforming” the pixels using the inverse translation by an extremely high-quality resampling filter. For a perfectly equivariant generator, the first two images are the same, modulo image boundaries (not shown due to light cropping) and numerical noise from the resampling. The third image visualizes the difference of the first two images. As can be seen, EQ-T scores in the 60 dB range are essentially visually perfect. Please consult the Appendix for technical details.
The following video illustrates rotation equivariance in a manner similar to the previous video. Our StyleGAN3-T, which has only been designed for translation equivariance, fails completely, as expected. The following comparison method is a variant of StyleGAN3-T that uses a p4 symmetric G-CNN for rotation equivariance. The model shows a cyclic behavior, where the rotation is exact at multiples of 90 degrees but breaks down at intermediate angles. Our StyleGAN3-R features high-quality, though not visually perfect rotation equivariance.
The following video illustrates the aliasing inherent to pointwise nonlinearities (here, ReLU), and our solution. Left column: The original bandlimited signal z. Its ideal version (top) is sampled (middle), and then reconstructed from the samples (bottom). As the sampling rate is high enough to capture the signal, no aliasing occurs. Middle column: applying a pointwise non-linearity in the continuous domain (top) yields a non-smooth function due to clipping at the zero crossings. Sampling this signal (middle) and reconstructing the function from the samples (bottom) yields an aliased result, as the high frequencies created by the clipping cannot be represented by the sample grid. Right column: applying a low-pass filter to the ReLUed function in the continuous domain (top) yields again a smooth function; sampling it (middle) allows a faithful reconstruction (bottom).
The below video compares StyleGAN3’s internal activations to those of StyleGAN2 (top). Our alias-free translation (middle) and rotation (bottom) equivariant networks build the image in a radically different manner from what appear to be multi-scale phase signals that follow the features seen in the final image. Due to our alias-free construction, these signals must control both the appearance of as well as the relative positions of image features; we hypothesize that the local oriented oscillations form a basis that enables hierarchical localization. Our construction appears to make it natural for the network to construct them from the low-frequency input Fourier features.
The following video clarifies the slice visualization of Figure 1, right.
@inproceedings{Karras2021,
author = {Tero Karras and Miika Aittala and Samuli Laine and Erik H\"ark\"onen and Janne Hellsten and Jaakko Lehtinen and Timo Aila},
title = {Alias-Free Generative Adversarial Networks},
booktitle = {Proc. NeurIPS},
year = {2021}
}
Images, text and video files on this site are made freely available for non-commercial use under the Creative Commons CC BY-NC 4.0 license . Feel free to use any of the material in your own work, as long as you give us appropriate credit by mentioning the title and author list of our paper.
We thank David Luebke, Ming-Yu Liu, Koki Nagano, Tuomas Kynkäänniemi, and Timo Viitanen for reviewing early drafts and helpful suggestions. Frédo Durand for early discussions. Tero Kuosmanen for maintaining our compute infrastructure. AFHQ authors for an updated version of their dataset. Getty Images for the training images in the BEACHES dataset.