Processing math: 100%

World-Consistent Video-to-Video Synthesis

Paper (arxiv) Paper (embedded videos) Code (GitHub)

We present a GAN-based approach to generate 2D world renderings that are consistent over time and viewpoints, which was not possible with prior approaches. Our method colors the 3D point cloud of the world as the camera moves through the world, coloring new regions in a manner consistent with the already colored world. It learns to render images based on the 2D projections of the point cloud to the camera in a semantically consistent manner while robustly dealing with incorrect and incomplete point clouds. Our proposed approach further shortens the gap between classical graphics rendering and neural rendering.

Colorization of the world's 3D point cloud

Simultaneously rendered 2D output

TL;DR

Summary Video

Presentation Video

Overview

What is the task of vid2vid?

Video-to-video synthesis (vid2vid) is a powerful tool for converting high-level semantic inputs to photorealistic videos. An example of this task is shown in the video below. Given per-frame labels such as the semantic segmentation and depth map, our goal is to generate the video shown on the right side. You can imagine the inputs being generated by a game engine, with the goal of producing a hyper-realistic gaming experience. Alternatively, you might want to convert a video captured by a user to one that looks like a stylized game.

Semantic Segmentation

Depth

Rendered Video

Issues with prior work

The most relevant prior work in this area is Video-to-Video Synthesis, published at NeurIPS 2018. One of the major shortcomings of this work is that it fails to ensure long-term consistency in the output video. This is because it generates each frame only based on the past few generated frames and lacks knowledge of the structure of the 3D world being generated. Some of the issues are demonstrated in the video below, where we drive forward and then backward to the starting point.

We can immediately notice a few issues:

As highlighted at the end of the video, compared to the first frame, a lot of features are different in the last frame, even though we return back to the same viewpoint. In other words, the video is not world-consistent over long timeframes.
We observe flickering in the outputs, indicating a lack of fine short-term consistency and realism in features such as the trees, the road markings, etc.

Achieving consistency across views and time

We believe that in order to produce realistic outputs that are consistent over time and viewpoint change, the method must be aware of the 3D structure of the world. To achieve this, we introduce the concept of guidance images, which are physically-grounded estimates of what the next output frame should look like, based on how the world has been generated so far. As alluded to in their name, the role of these guidance images is to guide the generative model to produce colors and textures that respect previous outputs.

While prior works use optical flow to warp prior outputs, our guidance image differs from this in two aspects. First, instead of using optical flow, the guidance image should be generated by using the motion field, or scene flow, which describes the true motion of each 3D point in the world. Second, the guidance image should aggregate information from all past viewpoints (and thus frames), instead of only the direct previous frames as in vid2vid. This makes sure that the generated frame is consistent with the entire history.

The figure below shows one method to generate guidance images by using point clouds and camera locations obtained by performing Structure from Motion (SfM) on an input video. In case of a game rendering engine, the ground truth scene flow can be obtained and used to generate guidance images.

Generating 3D-aware guidance images

A camera(s) with known parameters and positions travels over time $t = 0,\cdots,N$ . At $t = 0$ , the scene is textureless and an output image is generated for this viewpoint. The output image is then back-projected to the scene and a guidance image for a subsequent camera position is generated by projecting the partially textured point cloud. Using this guidance image, the generative method can produce an output that is consistent across views and smooth over time. The guidance image can be noisy, misaligned, and have holes, and the generation method should be robust to such inputs.

As input to our model, we provide the guidance images along with the input labels such as semantic segmentation and depth maps. The inputs and generated outputs are visualized in the below video. As new 3D points become visible to the camera, colors are assigned to them by our image generator and the point cloud is updated. Note that the guidance images have holes and incorrect projections due to noisy point cloud information. Our method is robust to such noise and produces meaningful outputs.

Semantic Segmentation

Depth

Guidance Image

Rendered Video

Here, we show how the application of our method solves the issues with temporal consistency observed above with prior work. As the viewer returns back to the starting position, the produced output is very similar to that of the first image, respecting previously produced textures. The output images and transitions over time also look more realistic.

The complete network architecture

Here, we visualize our generator architecture and all its components. The main component in our generator is the novel Multi-SPADE block, which is composed of multiple SPADE layers. Each SPADE layer takes in a spatial conditioning map such as the semantic segmentation, or the optical flow-warped previous output, or the guidance image, and applies appropriate transformations on the intermediate feature maps so that the finally generated output respects the required constraints and looks realistic in the spatial and temporal domain.

The overall architecture based on the Multi-SPADE module

Each Multi-SPADE module takes input label features, warped previous frame features, and guidance images to modulate the features in each layer of our generator. The labels decide the semantic content of the output frames, the optical-flow warped previous frames ensure short-term consistency, and the guidance images ensure long-term consistency.

Sample Video Generation Results

Consistent Multiview Generation

We can also simultaneously generate videos that are consistent across multiple viewpoints or users. In this setting, the first user generates the first frame and appropriate colors are assigned to the world's 3D point cloud. The 3D point cloud is then splatted/projected to the viewpoint of the second user to be used as guidance and our network generates the final semantically consistent output image after resolving holes and errors. Colors are then assigned to the newly visible 3D points and the point cloud is updated. Below are examples of stereo pair generations, with the point cloud being alternately updated by the left and right views.

Summary

Video-to-video synthesis is a powerful tool for converting semantic inputs to photorealistic videos.
Existing vid2vid methods are unable to maintain long-term consistency (such as during loop closure) due to lack of the 3D structure of the world.
We provide information about the 3D structure of the world using guidance images - projecting point clouds of the world colored so far (which can be noisy and incomplete) to the of the current camera view.
We introduce a new architecture based on the Multi-SPADE module, which uses semantic labels, optical-flow warping, and guidance images as conditioning inputs.
Our models improve upon the realism, short-term, and long-term consistency of generated videos.

Citation

@inproceedings{mallya2020world,
    title={World-Consistent Video-to-Video Synthesis},
    author={Arun Mallya and Ting-Chun Wang and Karan Sapra and Ming-Yu Liu},
    booktitle={Proceedings of the European Conference on Computer Vision},
    year={2020}
}

World-Consistent Video-to-Video Synthesis

Arun Mallya*

Ting-Chun Wang*

Karan Sapra

Ming-Yu Liu

NVIDIA

Published at the European Conference on Computer Vision, 2020