Paper

1. Key Sentences

  • In this work, we address the long-standing problem of view synthesis in a new way by directly optimizing parameters of a continuous 5D scene representation to minimize the error of rendering a set of captured images.

2. 记录

核心贡献点

  • 对具有复杂几何和材料的连续场景,提出一种基于 MLP 网络的方法,将其表示为 5D 神经辐射场(Neural radiance fields, Nerf)
    • 5D 分别是空间中的点位置 $(x, y, z)$, 方向 $(\theta, \phi)$ (3D volumes with 2D view-dependent appearence)
  • 基于经典的 volume rendering ,提出一种可微的渲染流程。我们用这个流程从标准的 RGB 图像来优化场景表示。这包含了层次化采样策略,将 MLP 的空间分配给有可视场景内容的空间
  • 提出一种位置编码,将每一个 5D 坐标输入映射到更高维的空间,这使得我们能成功用 neural radiance fields 来表示高频的场景内容

相关工作

  1. Neural 3D shape representation: 一些工作在做 map xyz coordinates to signed distance functions or occupancy fields,但依赖 3D geometry ground truth. 后续一些工作放松了对 3D ground truth 的依赖,但目前主要局限在 simple shapes with low geometric complexity. resulting in oversmoothed renderings.
  2. VIew synthesis and image-based rendering: 如果是 dense sampling of views, 那么很早之前的 light field sample interpolation 就可以解决;目前主要研究的是 sparser view sampling 下的 view synthesis. 目前主要有 2 类流行的方法:
  • Mesh-based representaions of scenes with eihter diffuse or view-dependent appearance. Gradient-based mesh optimization based on image reprojection is often difficult, likely because of local minima or poor conditioning of the loss landscape.

  • Volumetric representaions to address teh task of high-quality photorealistic view synthesis from a set of input RGB images. Volumetric approaches are able to realistically represent complex shapes and materials, are well-suited for gradient-based optimization, and tend to produce less visually distracting artifacts than mesh-based methods. 早期的 volumetric approaches 使用观测的 images 直接着色 voxel grids. 近期一些方法主要用一个多场景的大数据集训练一个 DNN,用来预测 a sampled volumetric representation from a set of input images, and then use eigher alpha-compositing or learned compositing along rays to render novel views. 还有工作优化了 a combination of CNNs and Sampled vexel grids for each specific scene, 这样 CNN 可以 compensate for discretization artifacts from low resolution vexel grids or allow the predicted voxel grids to vary based on input time or animation controls. 这些方法取得了非常好的效果,但是因为 time and space complexity due to there discrete sampling, 很难扩展到 high resolution imagery. Nerf 通过 encoding a continuous volume within the parameters of a MLP, only only produces significantly higher quality renderings than prior volumetric approaches, but also requires just a fractiono of the storage cost of those sampled volumetric representations.

Neural Radiance Field Scene Representation

input: 5D vector

  • 3D location $\boldsymbol{x} = (x, y, z)$, 2D viewing direction $(\theta, \phi)$
  • Direction 在实际中的表示,用的是 3 维笛卡尔单位向量 $\boldsymbol{d}$, 而不是 $(\theta, \phi)$

output: Color + volume density

  • Color: $\boldsymbol{c} = (r, g, b)$
  • Density: $\sigma$

target: 用 MLP 学习一个近似: $F_{\Theta}: (x, \boldsymbol{d}) \to (\boldsymbol{c}, \sigma)$

前向过程

  • 假设/限制:学习到的表示是 multiview 的,所以 volume density $\sigma$ 应该只与位置 $\boldsymbol{x}$ 有关而与视角 $\boldsymbol{d}$ 无关;而颜色 $\boldsymbol{c}$ 应该和 $\boldsymbol{x}$, $\boldsymbol{d}$ 都有关。
  • 网络:
    • MLP 首先用 8 层全连接(ReLU 激活, 256 channels)处理 $\boldsymbol{x}$, 输出预测密度 $\sigma$ 和 256 维特征向量 $\boldsymbol{v}$
    • 特征向量 $\boldsymbol{v}$ 和 camera rays’s viewing direction 拼接,过一个全连接层(ReLU 激活,128 channels) 输出 view-dependent RGB color.

Train/Loss

  • 先 sample 一些 5D coordinates along camera rays
  • 对这些输入,走前向过程,得到 $\sigma$ 和颜色 $\boldsymbol{c}$
  • 使用 volume rendering,基于 camera ray 组合这些值成为 1 个 image. 这个渲染过程是可微分的
  • residual between synthesized and ground truth observed image 就是 loss. 最小化这个 loss 即完成一次训练迭代