Neural Volume Rendering for Dynamic Scenes

NeRF has shown incredible view synthesis results, but it requires multi-view captures for STATIC scenes.

How can we achieve view synthesis for DYNAMIC scenes from a single video? Here is what I learned from several recent efforts.

Instead of presenting Video-NeRF, Nerfie, NR-NeRF, D-NeRF, NeRFlow, NSFF (and many others!) as individual algorithms, here I try to view them from a unifying perspective and understand the pros/cons of various design choices.

Okay, here we go.
*Background*

NeRF represents the scene as a 5D continuous volumetric scene function that maps the spatial position and viewing direction to color and density. It then projects the colors/densities to form an image with volume rendering.

Volumetric + Implicit -> Awesome!
*Model*

Building on NeRF, one can extend it for handling dynamic scenes with two types of approaches.

A) 4D (or 6D with views) function.

One direct approach is to include TIME as an additional input to learn a DYNAMIC radiance field.

e.g., Video-NeRF, NSFF, NeRFlow
B) 3D Template with Deformation.

Inspired by non-rigid reconstruction methods, this type of approach learns a radiance field in a canonical frame (template) and predicts deformation for each frame to account for dynamics over time.

e.g., Nerfie, NR-NeRF, D-NeRF
*Deformation Model*

All the methods use an MLP to encode the deformation field. But, how do they differ?

A) INPUT: How to encode the additional time dimension as input?

B) OUTPUT: How to parametrize the deformation field?
A) Input conditioning

One can choose to use EXPLICIT conditioning by treating the frame index t as input.

Alternatively, one can use a learnable LATENT vector for each frame.
B) Output parametrization

We can either use the MLP to predict
- dense 3D translation vectors (aka scene flow) or
- dense rigid motion field
With these design choices in mind, we can mix-n-match to synthesize all the methods.
*Regularization*

Adding the deformation field introduces ambiguities. So we need to make it "well-behaved", e.g., the deformation field should be spatially smooth, temporally smooth, sparse, and avoid contraction and expansion.

More from Tech

The entire discussion around Facebook’s disclosures of what happened in 2016 is very frustrating. No exec stopped any investigations, but there were a lot of heated discussions about what to publish and when.


In the spring and summer of 2016, as reported by the Times, activity we traced to GRU was reported to the FBI. This was the standard model of interaction companies used for nation-state attacks against likely US targeted.

In the Spring of 2017, after a deep dive into the Fake News phenomena, the security team wanted to publish an update that covered what we had learned. At this point, we didn’t have any advertising content or the big IRA cluster, but we did know about the GRU model.

This report when through dozens of edits as different equities were represented. I did not have any meetings with Sheryl on the paper, but I can’t speak to whether she was in the loop with my higher-ups.

In the end, the difficult question of attribution was settled by us pointing to the DNI report instead of saying Russia or GRU directly. In my pre-briefs with members of Congress, I made it clear that we believed this action was GRU.

You May Also Like

A brief analysis and comparison of the CSS for Twitter's PWA vs Twitter's legacy desktop website. The difference is dramatic and I'll touch on some reasons why.

Legacy site *downloads* ~630 KB CSS per theme and writing direction.

6,769 rules
9,252 selectors
16.7k declarations
3,370 unique declarations
44 media queries
36 unique colors
50 unique background colors
46 unique font sizes
39 unique z-indices

https://t.co/qyl4Bt1i5x


PWA *incrementally generates* ~30 KB CSS that handles all themes and writing directions.

735 rules
740 selectors
757 declarations
730 unique declarations
0 media queries
11 unique colors
32 unique background colors
15 unique font sizes
7 unique z-indices

https://t.co/w7oNG5KUkJ


The legacy site's CSS is what happens when hundreds of people directly write CSS over many years. Specificity wars, redundancy, a house of cards that can't be fixed. The result is extremely inefficient and error-prone styling that punishes users and developers.

The PWA's CSS is generated on-demand by a JS framework that manages styles and outputs "atomic CSS". The framework can enforce strict constraints and perform optimisations, which is why the CSS is so much smaller and safer. Style conflicts and unbounded CSS growth are avoided.