NeRF at CVPR 2023

May 1, 2023 49 min read NeRF, Literature Review

It is now my third time writing a summary of NeRFy things at a conference. This time it is the big one: CVPR. The list of accepted papers is massive again, with 2359 papers.

What is even more astounding is that the number of NeRF papers has grown significantly. I scanned the provisional program for potential NeRF titles and manually confirmed a relationship to the NeRF field.

However, writing a summary is an entirely different task. Frank Dellaert contacted me on Twitter about the CVPR post this year, and he gave me an idea to automate the entire summary using an LLM and creating slides for the posts. He even provided me access to his tools, research database, and a ChatGPT API key. Thank you so much - the idea worked wonders :).

So, I built a small program: ARCHIVE: Ai assisted Research Conference Human-readable Instant surVEy. It automatically creates a summary (thanks, OpenAI 😅) and formats the blog posts below. As the blog post is rather long, I also extended the tool to auto-generate a slide deck which also showcases prominent figures from the papers (automatically extracted). The ESC key displays a slide overview and you can use the arrow keys to move through the presentation (Categories left/right, individual slides up/down). The slide title is also a link to the paper itself. The entire deck is automatically generated by the tool (including extracting images and figure descriptions).

I am extending the tool to provide other use cases, but this will be left for another blog post.

While I reread all summaries, my little tool might get something wrong, and I did not detect it during the final read. If this happened or I missed any paper, please send me a DM on Twitter @markb_boss or via mail.

Here we go again!

Note: images below belong to the cited papers, and the copyright belongs to the authors or the organization that published these papers. Below I use key figures or videos from some papers under the fair use clause of copyright law.

NeRF

Mildenhall et al. introduced NeRF at ECCV 2020 in the now seminal Neural Radiance Field paper. As mentioned in the introduction, fantastic progress in research in this field has been made, and this is my third time covering it. CVPR 2023 is no exception, and NeRF finds applications in many new areas.

What NeRF does, in short, is similar to the task of computed tomography scans (CT). In CT scans, a rotating head measures densities from X-rays. However, knowing where that measurement should be placed in 3D is challenging. NeRF also tries to estimate where a surface and view-dependent radiance is located in space. This is done by storing the density and radiance in a neural volumetric scene representation using MLPs and then rendering the volume to create new images. With many posed images, photorealistic novel view synthesis is possible.

Fundamentals

Performance improvements with DynIBaR on complex dynamic novel view synthesis.

These papers address more fundamental problems of view-synthesis with NeRF methods.

Grid-guided Neural Radiance Fields for Large Urban Scenes: The authors propose a new methodology for high fidelity rendering in large urban scenes using a multiresolution ground feature plane representation in combination with an MLP-based neural radiance field (NeRF). This allows for the benefits of both a lightweight NeRF and joint-optimized ground planes, resulting in photorealistic novel views with fine details.

Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos: ReRF is introduced as a compact neural representation for long-duration dynamic scenes enabling real-time free-view video rendering, using a global coordinate-based tiny MLP as the feature decoder. A compact grid-based approach is utilized to handle large motions in interframe features. Improved compression and faster, higher-quality video generation were demonstrated using ReRF.

SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory: Neural Lightweight View Synthesis uses temporal consistency to render novel views faster than traditional techniques. A low-resolution feature map is generated first, and a lightweight 2D neural renderer is applied to generate the output image at target resolution leveraging the features of preceding and current frames.

Compressing Volumetric Radiance Fields to 1 MB: VQRF introduces a framework for compressing volumetric radiance fields by pruning grid models and applying vector quantization to improve compactness, resulting in a 100x compression ratio with minimal loss in visual quality. The proposed approach is generalizable across multiple volumetric structures and facilitates the use of volumetric radiance fields in real-world applications.

DINER: Disorder-Invariant Implicit Neural Representation: DINER overcomes the spectral limitations of implicit neural representations using a hash table. This allows the network to handle arbitrary ordering of input signals and generalize better across different tasks. The authors demonstrate the superiority of their approach compared to state-of-the-art algorithms across various tasks.

FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization: FreeNeRF proposes a baseline for few-shot novel view synthesis with sparse inputs using frequency regularization on NeRF’s inputs and densities. This leads to superior performance compared to existing complicated methods.

Learning Neural Duplex Radiance Fields for Real-Time View Synthesis: Duplex Mesh Neural Radiance Fields (DM-NeRF) bakes neural radiance fields into mesh representations for better rendering performance and screen-space convolution. DM-NeRF distills and compresses the radiance information on a two-layer mesh structure to allow fast and accurate rendering with minimal MLP evaluations for each pixel. They show improved performance on standard datasets.

Neuralangelo: High-Fidelity Neural Surface Reconstruction: Neuralangelo uses a multi-resolution 3D hash grid to learn representations for dense 3D surface structures from multi-view images and videos. It also uses numerical gradients and coarse-to-fine optimization for higher-fidelity surface reconstruction.

PermutoSDF: Fast Multi-View Reconstruction with Implicit Surfaces using Permutohedral Lattices: PermutoSDF expands on Neural Radiance-Density field by using permutohedral lattice to encode the SDF and achieve faster optimization and high-frequency detail retrieval. The authors’ regularization scheme is additionally crucial to high-frequency geometric detail recovery, while novel view rendering (using sphere tracing) is also achieved at a high fps rate.

Multi-Space Neural Radiance Fields: MS-NeRF represents scenes using parallel feature fields in sub-spaces to handle reflective and refractive objects. It is a modification of existing NeRF methods with small computational overheads, providing better rendering performance for complex light paths through mirrored objects.

NeRFLight: Fast and Light Neural Radiance Fields using a Shared Feature Grid: The authors propose a lightweight method for real-time view synthesis via a decoupled grid-based NeRF approach. The approach uses multiple density decoders that share one common feature grid. This results in a model that achieves real-time performance while maintaining high-quality models.

Cross-Guided Optimization of Radiance Fields with Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis: The paper proposes a differentiable framework for cross-guided optimization of single-image super-resolution and radiance fields for high-resolution novel view synthesis. By performing multi-view image super-resolution during radiance fields optimization, train-view images obtain multi-view consistency and high-frequency details, leading to better performance in novel view synthesis.

HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes with Iterative Intertwined Regularization: HelixSurf combines traditional multi-view stereo with neural implicit surface learning to improve scene geometry reconstruction. The method uses intermediate predictions from one strategy to guide the learning of the other in an iterative process.

Neural Fourier Filter Bank: The authors propose a grid-based paradigm for spatial decomposition, which is optimized to store the information both spatially and frequency-wise. The method uses adaptive sine activation features, and the network is composed of sine activation fully connected layers, which learn to decompose signals over different scales and frequencies progressively. The proposed method shows improved performance over the state-of-the-art techniques.

Progressively Optimized Local Radiance Fields for Robust View Synthesis: The authors propose an algorithm to reconstruct a large-scale scene’s radiance field from a single video. They do so by jointly estimating camera poses and a radiance field in a progressive manner. Local radiance fields are trained to handle large and unbounded scenes within a temporal window.

MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures: PolyNeRF is a new approach for Neural Radiance Fields that replaces ray marching with polygon rasterization algorithms. This allows for fast and efficient rendering while maintaining high quality.

TMO: Textured Mesh Acquisition of Objects with a Mobile Device by using Differentiable Rendering: The authors introduce a pipeline for creating textured meshes from a single smartphone by first using RGB-D aided structure from motion. This is followed by neural implicit surface reconstruction and differentiable rendering to generate finetuned texture maps that are closer to the original scene.

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction: S-MPI improves MPI by approximating the 3D scene of the image with plane structures that produce high-quality results for both RGBA layers and plane poses. Instead of an MPI, S-MPI accounts for non-planar surfaces and multi-view consistency. A transformer-based network architecture is deployed to produce expressive S-MPI layers and corresponding masks, poses, and RGBA contexts.

Neural Vector Fields for Implicit Surface Representation and Inference: Vector Fields (VF) are proposed as a new implicit 3D shape representation, yielding faster convergence rates, higher data fidelity, and superior geometric detail compared to traditional distance-based approaches. VF’s significant advantage over existing methods is that the insertion of a singularity in the field wherever geometry changes occur enables learning of discontinuous surfaces and interior boundaries.

Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment: This paper proposes a level set alignment loss to better the accuracy of neural signed distance function (SDF) inference from point clouds or multi-view images. Their approach constrains gradients at queries to ensure better gradient consistency across the field. They demonstrate the effectiveness of the method through various benchmarks.

WIRE: Wavelet Implicit Neural Representations: WIRE is a Wavelet-based Implicit neural REpresentation that uses a continuous complex Gabor wavelet activation function, allowing it to achieve high accuracy and robustness. The authors show that WIRE outperforms other INR models in image denoising, super-resolution, computed tomography reconstruction, and novel view synthesis.

MixNeRF: Modeling a Ray with Mixture Density for Novel View Synthesis from Sparse Inputs: MixNeRF improves the efficiency of NeRF by using a mixture of distributions to estimate a ray’s RGB colors and a new training objective based on ray depth estimation. It outperforms other state-of-the-art methods and is more efficient in terms of both training and inference.

Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections: NeuS-HSR is a surface reconstruction framework based on implicit neural rendering, which can deal with high specular reflections of objects when captured through glasses. The framework parameterizes the object surface as an implicit signed distance function and decomposes the rendered image into a target object and auxiliary plane appearance to generate auxiliary plane appearances. NeuS-HSR outperforms state-of-the-art approaches in reconstructing target surfaces accurately and robustly against high specular reflections.

NeuralUDF: Learning Unsigned Distance Fields for Multi-view Reconstruction of Surfaces with Arbitrary Topologies: NeuralUDF introduces a new method for reconstructing arbitrary-topology surfaces from 2D images with volume rendering. By using Unsigned Distance Functions (UDFs) and a new density function, NeuralUDF enables high-quality reconstruction of non-closed shapes, achieving comparable performance to Signed Distance Function (SDF) based methods for closed surfaces.

Sphere-Guided Training of Neural Implicit Surfaces: SphereGuided jointly trains a neural distance function with a coarse sphere-based triangular reconstruction to improve sampling efficiency in high-frequency detail regions.

VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization: VDN-NeRF is a new method that normalizes the geometry of neural radiance fields (NeRF) by distilling invariant information encoded in the fields. This technique improves the synthesis of 3D scenes with dynamic lighting or non-Lambertian surfaces and minimizes shape-radiance ambiguity.

NeAT: Learning Neural Implicit Surfaces with Arbitrary Topologies from Multi-view Images: NeAT is a new neural rendering framework that represents 3D surfaces as a level set of a signed distance function with a validity branch. This allows for learning implicit surfaces with arbitrary topologies from multi-view images.

SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes: SurfelNeRF combines explicit geometric surfel representations with NeRF rendering to enable efficient online reconstruction and high-quality rendering. The method also includes a differentiable rasterization scheme for rendering the neural surfel radiance fields.

ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision: ShadowNeuS proposes to model shadow rays, which result from light sources in order to reconstruct an SDF neural model from single RGB views or corresponding shadow information when full scene sampling is not available. The approach allows the effective reconstruction of 3D models beyond the line of sight.

Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision: Nerflets are introduced as a local scene representation. Each Nerflet represents local information regarding object type, location, and orientation within a scene. By joint optimization of Nerflet parameters, efficient and structure-aware 3D scene representation can be obtained without global modeling.

Regularize implicit neural representation by itself: INRR is introduced as a regularizer for the Implicit Neural Representation (INR). INRR measures the similarity between rows/columns of a matrix and integrates the smoothness of the Laplacian matrix by parameterizing learned Dirichlet Energy with a small INR. The method aims to improve the generalization of INR for signal representation.

RobustNeRF: Ignoring Distractors with Robust Losses: Robust NeRF suggests incorporating an outlier rejection component to NeRF training, removing moving objects and ephemeral elements from the training data. The method is a simple optimization problem and works with existing NeRF frameworks.

Multi-View Reconstruction using Signed Ray Distance Functions (SRDF): This paper introduces a new optimization framework for multi-view 3D shape reconstructions using a novel volumetric shape representation that combines differentiable rendering with local depth predictions to yield pixel-wise geometric accuracy. The approach optimizes the depths of an implicit-parameterized shape representation at each 3D location. The method outperforms existing approaches for geometry estimation over standard 3D benchmarks.

Self-supervised Super-plane for Neural 3D Reconstruction: S3PRecon introduces self-supervised super-plane constraints for neural implicit surface representation methods to handle texture-less planar regions without any annotated datasets. An iterative training scheme of grouping pixels and optimizing the reconstruction network via a super-plane constraint is used to achieve better performance than using ground truth plane segmentation.

PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces: PET-NeuS improves NeuS’s MLP sign distance field parametrization by representing the signed distance field using triplanes and MLPs in a mixture. This results in a more expressive data structure with noise. PET-NeuS ameliorates this noise by using a new learnable positional encoding and a self-attention convolution operation.

RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis: RefSR-NeRF improves NeRF super-resolution images by first generating a low-resolution image and then using a high-resolution reference image to reconstruct high-frequency details. The authors design a novel lightweight RefSR model for learning the inverse degradation process from NeRF renderings to target HR images.

DynIBaR: Neural Dynamic Image-Based Rendering: DynIBAR synthesizes realistic views in the presence of large camera and object movements in videos by adapting the image-based rendering method in a scene-motion-aware manner. Efficient and performs well against long videos of complex scenes.

F2-NeRF: Fast Neural Radiance Field Training with Free Camera Trajectories: F^2-NeRF introduces a new warping method, called perspective warping, in the context of grid-based NeRFs, enabling it to handle unbounded scenes. The method shows significant performance improvements compared to other approaches on various free-camera single-object datasets.

NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction: NeuDA proposes a hierarchical approach to implicit surface reconstruction using anchor grids to capture 3D contexts. By maintaining an adaptive anchor structure, capturing different topological structures is achieved.

PlenVDB: Memory Efficient VDB-Based Radiance Fields for Fast Training and Rendering: PlenVDB aims to accelerate the training and inference stages in NeRFs by introducing the VDB hierarchical sparsely-filled data structure. This method accomplished results with faster training convergence, compressed data for NeRF models, and faster rendering on commercial hardware.

SeaThru-NeRF: Neural Radiance Fields in Scattering Media: NeRF in the Fog modifies NeRF to account for the medium’s transmission and scattering. The authors use SeaThru’s image formation model and propose a suitable architecture. The method can also render clear views, removing the medium between the camera and the scene.

DINER: Depth-aware Image-based NEural Radiance fields: DINER uses depth information to guide the reconstruction of a volumetric neural radiance field representation for 3D object rendering. DINER achieves higher synthesis quality than the state-of-the-art methods and can capture scenes more completely with greater disparity.

Masked Wavelet Representation for Compact Neural Radiance Fields: The authors propose a method to compress grid-based neural fields and make them more efficient using wavelet transforms. Their approach includes a trainable masking technique that produces a more compact representation and achieves state-of-the-art performance within a memory budget of 2 MB.

Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields: Exact-NeRF computes the Integrated Positional Encoding in a pyramid-based, precise analytical approach, rather than an approximated conical one. The paper shows that this new approach outperforms the approximated model when scenes are distant or extended.

NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer: NeRFLiX trains an inter-viewpoint mixer to remove rendering artifacts like noise and blur in existing NeRF models, using a NeRF degradation modeling approach and inter-viewpoint aggregation.

NeUDF: Leaning Neural Unsigned Distance Fields with Volume Rendering: NeUDF is an extension to signed distance function (SDF) reconstruction algorithms to recover arbitrary shapes with open surfaces. The method utilizes the unsigned distance function (UDF) as the underlying representation and employs two new formulations of weight function and normal regularization strategy for efficient volume rendering.

TINC: Tree-structured Implicit Neural Compression: TINC proposes a tree-structured approach to compress implicit neural representations of data through partitioned local MLP fitting. The parameter-sharing approach helps to capture both local and non-local correlations across the scenes.

ABLE-NeRF: Attention-Based Rendering with Learnable Embeddings for Neural Radiance Field: ABLE-NeRF improves the view-dependent effects of Neural Radiance Fields (NeRF) in volumetric rendering by using a self-attention-based framework along rays and Learnable Embeddings to capture local lighting. The method reduces the artifacts on several materials, resulting in higher-quality rendering.

Hybrid Neural Rendering for Large-Scale Scenes with Motion Blur: The Hybrid Neural Rendering model uses a combination of neural and image-based representations to render high-fidelity, view-consistent images, even for large-scale scenes with motion blur. Additionally, the authors propose methods to simulate the blur effects and reduce the impact of blurriness during training.

Seeing Through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container: ReNeuS is a novel method for recovering the 3D geometry of objects in transparent enclosures. It models the scene as two sub-spaces, using an existing method NeuS to represent the inner sub-space. A combination of volume rendering and ray tracing is used to render the model, and then the geometry and appearance are recovered by minimizing differences between real and hybrid-rendered images.

Real-Time Neural Light Field on Mobile Devices: MobileNeRF introduces a mobile-friendly network architecture that runs in real-time on mobile devices for neural rendering of 3D scenes, with high-resolution generation and similar image quality as NeRF. It requires low latency and small size, saving 15-24 times the storage compared with MobileNeRF.

Priors and Generative

Create realistic 3D landscape synthesis using a single semantic mask using Painting 3D Nature in 2D.

Priors can either aid in the reconstruction or can be used in a generative manner. For example, in the reconstruction, priors either increase the quality of neural view synthesis or enable reconstructions from sparse image collections.

Learning 3D-aware Image Synthesis with Unknown Pose Distribution: The PoF3D method frees generative radiance fields from the requirements of 3D pose priors. It equips the generator with an efficient pose learner to infer a pose from a latent code and assigns the discriminator a task to learn pose distribution. The pose-free generator and the pose-aware discriminator are jointly trained in an adversarial manner.

Painting 3D Nature in 2D: View Synthesis of Natural Scenes from a Single Semantic Mask: The paper introduces a novel approach for 3D-aware image synthesis that can produce photorealistic multi-view consistent color images of natural scenes. The key idea is to use a semantic field as an intermediate representation and convert it to a radiance field using semantic image synthesis models. The method outperforms baseline methods and requires only a single semantic mask input.

Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion: The end-to-end monetizable NeRF (MeNRF) framework generates high-quality 3D reconstructions with accurate pose and appearance from a single image of arbitrary topologies. MeNRF leverages an unconditional 3D-aware generator while using a hybrid inversion scheme to refine the solution via optimization without exploiting multiple views. The network can de-render an image in as few as 10 steps, which is useful for practical applications.

Local Implicit Ray Function for Generalizable Radiance Field Representation: LIRF is a novel approach to neural rendering that aggregates information from conical frustums to construct each ray resulting in high-quality novel view rendering.

Multiview Compressive Coding for 3D Reconstruction: MCC is a method for single-view 3D reconstruction that operates on 3D points of single objects or whole scenes. Its efficient size compression allows large-scale training from diverse RGB-D videos for learning a generalizable representation. MCC shows strong generalization to novel objects and objects captured in the wild.

Generalizable Implicit Neural Representations via Instance Pattern Composers: The authors present a new method for implicit neural representations (INRs) which improves generalization by learning a small set of weights that modulate an early layer of the MLP network while keeping the remaining MLP weights constant. The resulting pattern composition rules enable the network to represent common features across instances.

DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors: DP-NeRF proposes a clean novel framework to handle blurry images for 3D reconstruction. The approach utilizes two physical priors for color consistency and 3D geometric consistency, which are derived from the actual blurring process during image acquisition by the camera. The authors show that the proposed method improves the perceptual quality of the constructed scene in synthetic and real scenes with both camera motion blur and defocus blur.

DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction: DIFu is a new method for single image clothed human reconstruction. It uses projected depth information to create a voxel-aligned human reconstruction that can contain pixels with detailed 3D information, such as hair and clothing and estimates occupancies with pixel and voxel-aligned features. The method also includes a texture inference branch for color estimation.

HumanGen: Generating Human Radiance Fields with Explicit Priors: HumanGen is a 3D human generation method that combines various priors from 2D and 3D models of humans using an anchor image. It features a hybrid feature representation, pronged design for geometry and appearance generation, and incorporates off-the-shelf 2D latent editing methods into 3D. The method generates view-consistent radiance fields with detailed geometry and realistic free-view rendering.

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation: RenderDiffusion is a diffusion-based model for 3D inference, training and generation. The authors present a novel image denoising method that provides consistency with a 3D intermediate representation. The method enables 2D inpainting for editing 3D scenes.

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation: A properly trained diffusion model can be used to backpropagate score through the Jacobian of a differentiable renderer. A scene representation example can be a voxel radiance field.

3D-Aware Multi-Class Image-to-Image Translation with NeRFs: The paper proposes a new method for 3D-aware multi-class image-to-image (I2I) translation using a combination of a 3D-aware GAN step and a 3D-aware I2I translation step. The authors introduce a new conditional architecture and training strategy for the multi-class GAN and several new techniques for the I2I translation step to improve view-consistency.

Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures: This paper proposes Latent-NeRF, which uses score distillation adapted to latent diffusion models for text-guided image generation. The authors integrate sketch-shape constraints to control the 3D shape generation process and apply latent score distillation on 3D meshes. Implementation is available at the provided GitHub link.

ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real Novel View Synthesis via Contrastive Learning: The paper reports that models trained on synthetic data tend to produce sharper but less accurate volume densities. To address this, a geometry-aware contrastive learning approach is introduced, and cross-view attention is adopted. The method helps to render higher quality and better detailed images when working with synthetic-to-real novel view synthesis.

SCADE: NeRFs from Space Carving with Ambiguity-Aware Depth Estimates: SCADE improves NeRF reconstruction quality by leveraging depth estimates produced with monocular depth estimation models, which can generalize across scenes. It uses a space carving loss to fusing multiple hypothesized depth maps from each view and distilling from them a consistent geometry.

VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction: VolRecon is introduced as a more generalizable neural implicit scene reconstruction method with Signed Ray Distance Function. It projects multi-view features and combines them with volume features.

OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering: OTAvator is a one-shot learning system for 3D facial avatars. It employs tri-plane volumetric rendering with an efficient CNN and disentangles facial identity and motion representing using a decoupling-by-inverting strategy, which allows to create new avatars quickly from as few as one reference portrait.

HoloDiffusion: Training a 3D Diffusion Model using 2D Images: The authors address the challenge of 3D training data scarcity and the memory complexity of 3D extension by introducing a new diffusion model that can be trained using only 2D images and an image formation model that decouples model memory from spatial memory. The approach is shown to be competitive with existing techniques on the CO3D dataset.

NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors: NeRDi is a NeRF-based single-view 3D reconstruction method with general image priors from 2D diffusion models, using pre-trained vision-language models for conditional input and depth maps for geometric regularization.

Persistent Nature: A Generative Model of Unbounded 3D Worlds: The authors present a method for the unconditioned synthesis of unbounded nature scenes based on an extendible planar scene layout grid and a panoramic skydome. Renderer can freely move through the space without auto-regression.

3D Neural Field Generation using Triplane Diffusion: TriDiff is a diffusion-based model for 3D-aware generation of neural fields. TriDiff preprocesses training data by converting meshes to continuous occupancy fields and factoring them into a set of axis-aligned triplane feature representations. Training the diffusion model with these representations yields high-quality 3D neural fields.

Diffusion-Based Signed Distance Fields for 3D Shape Generation: SDF-Diffusion proposes a two-stage generative process by using diffusion models. The first stage generates a low-resolution SDF, which is further processed in the second stage to obtain high-resolution results. The network can generate complex high-resolution 3D shapes using 3D SDF previously used for shape completion tasks.

Towards Unbiased Volume Rendering of Neural Implicit Surfaces with Geometry Priors: The paper proposes a new way to render Signed Distance Functions, where the scale factor is dependent on angle to the normal vector of the surface, which leads to a reduction in bias in volume rendering. The authors pre-train a Multi-View Stereo network for supervision at zero crossing intersection points between the implicit surface and the viewing frustum.

Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild: NPF is a novel Neural Proto-face Field that disentangles the common/specific facial cues to allow precise face modeling. It is able to learn 3D-consistent identity via uncertainty modeling and multi-image priors from photo collections. The disentangled learning methodology predicts superior 3D face shapes and textures compared to the state-of-the-art methods.

Magic3D: High-Resolution Text-to-3D Content Creation: Magic3D is a two-stage optimization framework that uses a diffusion prior to obtain a coarse model and accelerates it with a sparse 3D hash grid structure. It further optimizes a textured 3D mesh model to create high-quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion.

DiffRF: Rendering-guided 3D Radiance Field Diffusion: DiffRF proposes a novel volumetric radiance field synthesis based on denoising diffusion probabilistic models. The framework is designed to generate radiance fields by rendering a set of posed images with a deviated prior, which contains multi-view consistent priors with good quality for image synthesis. Unlike 3D GANs, the method allows free-view synthesis through learning multi-view consistent priors at inference time.

DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models: NeRF-Prior improves NeRF training introducing regularizing RGBD patch priors. These priors are learned with a denoising diffusion model and can improve the generalization of reconstructed geometry and color fields to novel scenes.

Seeing a Rose in Five Thousand Ways: The proposed method in this work is capable of learning object intrinsics (geometry, texture, and material) from a single image of a certain object category, such as roses, and then use this knowledge to generate different images of the same object under changing poses and lighting conditions. The resulting model shows a superior performance across various related tasks compared to existing methods.

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models: NeuralField-LDM is a generative model that can synthesize complex 3D environments using Latent Diffusion Models for efficiency and high-quality 3D content. With a scene autoencoder, voxel grids, and latent-autoencoder, the authors improve upon existing scene generation models and demonstrate potential uses in 3D content creation applications.

SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene: SinGRAF is a 3D-aware generative model that can generate photorealistic 3D objects with few input images. The 3D GAN architecture enables SinGRAF to produce different realizations of a scene while preserving appearance and varying the layout.

Polynomial Implicit Neural Representations For Large Diverse Datasets: Poly-INR is a new implicit neural representation technique that replaces sinusoidal positional encoding with polynomial functions. The proposed model eliminates the need for positional encodings and performs comparably to state-of-the-art generative models with far fewer trainable parameters.

NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds: NeRFVS utilizes holistic priors such as pseudo-depth maps and view coverage information to guide the learning of implicit neural representations of 3D indoor scenes. Robust depth loss and variance loss are proposed to further improve the performance, and these losses are modulated during NeRF optimization according to the view coverage information to reduce the influence of view coverage imbalance. The method achieves high-fidelity free navigation results on indoor scenes.

GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images: GM-NeRF synthesizes novel view images for human performers using a geometry-guided attention mechanism and neural rendering. This allows efficient improvement of perceptual quality of synthesis and outperforms state-of-the-art methods in terms of novel view synthesis.

GINA-3D: Learning to Generate Implicit Neural Assets in the Wild: GINA-3D uses camera and LiDAR data to learn 3D assets of vehicles and pedestrians using a generative approach. The method decouples representation learning and generative modeling into two stages with a tri-plane latent structure, which is shown to perform better than existing approaches when evaluated on a large-scale object-centric dataset.

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars: The authors propose a novel 3D GAN framework for unsupervised learning of high-quality facial avatars from unstructured 2D images. The method introduces a new 3D representation called generative texture-rasterized tri-planes. The proposed representation accurately models deformation and flexibility, enabling fine-grained expression control and animation.

NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views: NeuralLift-360 is a technique that generates a 3D object with 360-degree views that correspond well with a reference image, easing workflows for 3D artists and XR designers. It uses a depth-aware NeRF and denoising diffusion models guided by CLIP to provide coherent guidance and can be guided with rough depth estimation in the wild through a ranking loss. The method outperforms existing state-of-the-art baselines.

NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation: This paper proposes a method to finetune NeRF-GAN models to generate high-fidelity animation of real subjects based on a single image. The method includes 2D loss functions to reduce the identity gap, as well as explicit and implicit 3D regularizations to remove artifacts.

Dynamic

Neural fields to create blends of various facial expressions using BlendFields.

Capturing dynamic objects is a trend that started in previous conference years. This can either be solved with parametrization or via neural priors.

BlendFields: Few-Shot Example-Driven Facial Modeling: The authors propose a solution for fine-grained face rendering that blends sparse expressions to infer the appearance of unseen expressions. The fine-grained details are captured as appearance differences between each of the extreme poses in their method. The approach is robust and generalizes well beyond faces.

INSTA - Instant Volumetric Head Avatars: INSTA uses a neural radiance field-based pipeline to reconstruct digital avatars from a single monocular RGB video. The output model, based on a parametric face model, offers high-quality rendering and interactivity. It also performs well in unseen pose conditions.

Towards Scalable Neural Representation for Diverse Videos: D-NeRV is an INR-based framework designed for encoding long and diverse videos. It decouples visual content from motion information while introducing temporal reasoning into the implicit neural network to improve compression results. The proposed model surpasses NeRV and traditional video compression techniques while also achieving higher accuracy on action recognition tasks.

Learning Neural Volumetric Representations of Dynamic Humans in Minutes: The paper presents a novel technique to accelerate the learning process of neural radiance fields for free-viewpoint video reconstruction from sparse multi-view videos. The proposed method uses a part-based voxelized representation and a 2D motion parameterization scheme to increase convergence rates. The method is shown to achieve competitive visual quality with a much faster training process.

MagicPony: Learning Articulated 3D Animals in the Wild: MagicPony is a method for predicting detailed 3D articulated shapes and appearances of animals using only single-view images of the same category. It utilizes a novel implicit-explicit representation that combines the strengths of neural fields and meshes. MagicPony includes a self-supervised visual transformer and a viewpoint sampling technique to improve performance and generalization.

HNeRV: A Hybrid Neural Representation for Videos: HNeRV uses content-adaptive embeddings and re-designed architecture to outperform existing methods in video regression besides allowing for higher resolution and fewer parameters. The method also shows advantages in video decoding for speed, flexibility, and deployment. HNeRV can be used in downstream tasks like video compression and inpainting.

Parametric Implicit Face Representation for Audio-Driven Facial Reenactment: This work proposes a novel audio-driven facial reenactment framework that uses a parametric, interpretable implicit face representation. It improves audio-to-expression parameters encoding, uses conditional image synthesis, and data augmentation techniques to achieve high-quality results with more fidelity to the identities and speaking styles of speakers.

NIRVANA: Neural Implicit Representations of Videos with Adaptive Networks and Autoregressive Patch-wise Modeling: NIRVANA adopts a patch-wise approach to video compression where groups of frames are each fitted to separate networks to exploit temporal redundancy. The method uses autoregressive modeling, quantized network parameters, and scaling based on GPU use to achieve joint improvements in quality, speed, and scalability with efficient variable bitrate compression.

HyperReel: High-Fidelity 6-DoF Video with Ray-Conditioned Sampling: HyperReel is a 6-DoF video representation with a hyper network for ray-conditioned sample prediction. It has a compact and memory-efficient dynamic volume representation and outperforms existing approaches in visual quality, memory requirements, and frame rate.

DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos: DNeRV introduces the difference frame as an essential channel for implicit video representation, resulting in state-of-the-art performance in intraprediction and video compression on x264.

MonoHuman: Animatable Human Neural Field from Monocular Video: MonoHuman proposes a three-part pipeline to generate view-consistent and high-fidelity avatars under arbitrary novel poses. A shared bidirectional deformation module creates a pose-independent, generalizable deformation field, followed by a forward correspondence search module. Finally, a rendering network leverages multi-view consistent features to produce the final avatar. The authors show improved performance over current state-of-the-art methods.

HandNeRF: Neural Radiance Fields for Animatable Interacting Hands: HandNeRF uses pose estimation to generate a detailed explicit triangle mesh of interacting hands from multi-view images. They design a shared axis space for multiple poses, allowing each pose to add to the view space. A neural feature distillation method is used to enhance image quality while avoiding artifacts. Expensive ground-truth data is used to remove occlusions in the learning process.

FlexNeRF: Photorealistic Free-viewpoint Rendering of Moving Humans from Sparse Views: FlexNeRF provides photorealistic free-viewpoint rendering of people in motion from monocular videos. The method handles fast and complex motions under sparse views through a joint optimization approach where canonical time and pose are optimized with pose-dependent motion fields and pose-independent temporal deformations. The approach proposes novel consistency constraints and provides improved performance over existing benchmarks.

Flow Supervision for Deformable NeRF: The authors present a deformable NeRF method that uses optical flow as supervision, with improvements over baselines that don’t use flow supervision. They show that inverting the backward deformation function is not needed for computing scene flows between frames, simplifying the problem.

NeMo: 3D Neural Motion Fields from Multiple Video Instances of the Same Action: NeMo is a neural motion field optimized to reconstruct 3D human motion from multiple video instances of the same action. The method outperforms existing monocular HMR methods in terms of 2D keypoint detection and achieves better 3D reconstruction compared to baselines on a small MoCap dataset.

NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects: The paper presents a modified version of NeRF called NeRF-DS for rendering novel views from RGB video input with dynamic scenes, which is capable of modeling the reflected color of specular surfaces during motion. NeRF-DS conditions the radiance field function on surface position and orientation in the observation space and uses a mask of moving objects to guide the deformation field.

Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields: The proposed method learns spatiotemporal neural representations for scenes using neural network modules or 4D hash grids for extracting and interpolating features from space-time inputs, achieving state-of-the-art performance and/or 100 times faster training speed.

Robust Dynamic Radiance Fields: DRF-Net improves the robustness of dynamic radiance field reconstruction by jointly estimating the static and dynamic radiance fields alongside camera parameters. The method demonstrates improved performance on challenging videos compared to state-of-the-art methods.

Distilling Neural Fields for Real-Time Articulated Shape Reconstruction: The authors present a method for real-time reconstruction of articulated 3D models from video without test-time optimization or manual 3D supervision. The method trains a fast feed-forward network using off-the-shelf video-based dynamic NeRFs as 3D supervision to reconstruct arbitrary deformations represented by articulated bones and blend skinning.

Implicit Neural Head Synthesis via Controllable Local Deformation Fields: The paper presents a method of generating personalized 3D head avatars from 2D videos, with sharper deformations and greater facial detail compared to existing methods. This is achieved through a novel formulation of multiple implicit deformation fields with local semantic rig-like control and a local control loss, as well as an attention mask mechanism.

HexPlane: A Fast Representation for Dynamic Scenes: HexPlane proposes a new method to represent dynamic 3D scenes explicitly using six planes of learned features. HexPlane computes color features efficiently for voxels by fusing six vectors. Its combination with a small MLP produces impressive results in novel view synthesis.

Instant-NVR: Instant Neural Volumetric Rendering for Human-object Interactions from Monocular RGBD Stream: Instant-NVR proposes a monocular tracking and rendering system for complex human-object interactions in real-time. The authors use a hybrid deformation module and an online reconstruction strategy for efficient rendering. The system can capture the dynamic and static radiance fields for image synthesis.

DyLiN: Making Light Field Networks Dynamic: DyLiN is proposed to handle non-rigid deformations in dynamic light fields in a computationally efficient manner. The method is based on learning a deformation field and lifting them into a higher dimensional space for handling discontinuities. CoDyLiN is proposed to handle controllable attributes in addition to non-rigid deformations.

Tensor4D: Efficient Neural 4D Decomposition for High-fidelity Dynamic Reconstruction and Rendering: Tensor4D uses a 4D tensor decomposition model for capturing dynamic 3D scenes from sparse-view camera rigs or even a monocular camera. The tensor is decomposed hierarchically into three time-aware volumes and nine compact feature planes. The proposed tensor factorization scheme allows for structural motion and detailed changes to be learned from coarse to fine.

K-Planes: Explicit Radiance Fields in Space, Time, and Appearance: K-Planes is a white-box model for radiance fields in arbitrary dimensions that represents a d-dimensional scene using choose-2 planes, making it easy to add dimension-specific priors. Despite using a linear feature decoder, it yields similar performance to a non-linear black-box MLP decoder with low memory usage and fast optimization. The method achieves state-of-the-art reconstruction fidelity whereby one can easily add temporal smoothness and other structure priors.

Representing Volumetric Videos as Dynamic MLP Maps: The paper proposes a method for real-time view synthesis of dynamic 3D scenes by representing the radiance field of each frame as a set of shallow MLPs stored as “MLP maps,” and dynamically predicted by a shared 2D CNN decoder. This achieves high rendering quality with state-of-the-art efficiency and speed.

Editable and Composable

The left image shows the observation, the central one shows the optimization process of <a href="https://arxiv.org/abs/2306.08748" target="_blank" rel="noopener">ONSF</a>, and on the right, the optimized light position is highlighted as a green dot. — The left image shows the observation, the central one shows the optimization process of ONSF, and the right the optimized light position highlighted as a green dot.

This section covers NeRFs that propose composing, controlling, or editing methods.

Multi-Object Manipulation via Object-Centric Neural Scattering Functions: Object-centric neural scattering functions (OSFs) are used as object representations to enable compositional scene re-rendering under object rearrangement and varying lighting conditions. This approach leads to improved model-predictive control performance and generalization in compositional multi-object environments.

NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds: NeuralEditor is a shape-editing algorithm that works on the explicit point cloud representation of NeRF. It employs K-D tree-guided density-adaptive voxels to perform deterministic integration and optimize the neural network. The resulting point cloud is then used to perform shape editing to achieve state-of-the-art results on shape deformation tasks.

Removing Objects From Neural Radiance Fields: The paper presents a method for inpainting objects in an already generated NeRF using confidence-based view selection. A user-provided mask is used to overwrite data likely to contain the object, and the NeRF is then re-trained with individual 2D images selected by the view selection procedure.

UV Volumes for Real-time Rendering of Editable Free-view Human Performance: The UV Volumes proposes an approach for rendering human performers in real-time for VR/AR applications. It separates the high-frequency appearance from the 3D volume and encodes them into 2D texture stacks, which allows faster computation with shallower neural networks while maintaining editability and generalization. Other applications, like retexturing, are also made possible using this approach.

DA Wand: Distortion-Aware Selection using Neural Mesh Parameterization: DPLayer is a differentiable parameterization layer to pick a meaningful region of a mesh for low-distortion parameterization. The method uses a neural segmentation network to learn to select the 3D region, which is then parameterized in 2D.

SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields: The paper proposes a novel 3D inpainting method for removing unwanted objects from a 3D scene. The method leverages learned 2D image inpainters and a 3D segmentation mask to address the challenges of view consistency and geometric validity. Additionally, the authors introduce a dataset comprised of real-world scenes to evaluate the effectiveness of the proposed method.

PaletteNeRF: Palette-based Appearance Editing of Neural Radiance Fields: PaletteNeRF provides an efficient and realistic approach to edit the appearance of neural radiance fields. The appearance can be decomposed into palettes shared across the scene and optimized alongside the basis functions. The method allows efficient editing of the appearance through direct modification of the color palettes. Additionally, compressed semantic features can be introduced for semantic-aware appearance editing.

Transforming Radiance Field with Lipschitz Network for Photorealistic 3D Scene Stylization: LipRF uses a Lipschitz mapping to stylize 3D scenes photorealistically. LipRF couples pre-trained NeRF with 2D photorealistic style transfer and learns 3D styles from views.

OCC-NeRF: Occlusion-Free Scene Recovery via Neural Radiance Fields: The proposed method directly maps occlusion-free scene details from position and viewing angles with Neural Radiance Field. The scheme optimizes both camera parameters and scene reconstruction in the presence of occlusions without the need for labeled external data for training.

ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects: ReNe is a dataset that contains real-world object scenes captured with one-light-at-a-time (OLAT) conditions. The dataset contains complex geometries and challenging materials. The authors perform an ablation study to identify a lightweight architecture capable of rendering objects under novel light conditions and establish a non-trivial baseline for the new dataset.

EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points: EditableNeRF models dynamic scenes by detecting key points and weights. Then the key points can be dragged and dropped, allowing for editing from single-camera image sequences.

StyleRF: Zero-shot 3D Style Transfer of Neural Radiance Fields: StyleRF is a 3D style transfer method that performs style transformation within the feature space of a radiance field. StyleRF employs an explicit grid of high-level features to represent 3D scenes for reliable geometry restoration via volume rendering, with deferred style transformation of 2D feature maps that ensure high-quality zero-shot style transfer across a variety of new styles. The proposed method performs sampling-invariant content transformation to ensure multi-view consistency.

Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization: Ref-NPR uses a reference image as a style source to stylize a 3D scene using radiance fields, with view interpolation and semantic disambiguation fills occlusion areas with visuals consistent with the stylized reference.

SINE: Semantic-driven Image-based NeRF Editing with Prior-guided Editing Field: SinE lets users easily edit NeRFs using semantic strokes or text prompts. Techniques like cyclic constraints with a proxy mesh, color compositing, and feature cluster-based regularization are used to stabilize the process. The authors show examples of both real-world and synthetic data that achieve high-quality multi-view consistency.

Pose Estimation

Pose alignment with L2G-NeRF compared to prior work.

Estimating the pose of objects or the camera is a fundamental problem in computer vision. This can also be done to improve the quality of scenes with noisy camera poses.

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields: L2G-NeRF is a bundle-adjustment method for finding accurate camera poses for Neural Radiance Fields in novel view synthesis. The method applies a pixel-wise local alignment, followed by a frame-wise global alignment using differentiable parameter estimation solvers. L2G-NeRF improves reconstruction and outperforms the state-of-the-art.

DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields: DBARF is an extension of BARF for GeNeRFs. The geometric cost feature map used in DBARF has been shown to support self-supervised learning, allowing it to be trained across domains, in contrast to state of the art.

AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training: AlignNerf combines convolutional layers with MLPs in high-resolution NeRF reconstructions and also includes a novel training strategy and a high-frequency aware loss to improve reconstruction quality, performing better than other state-of-the-art NeRF models in high-frequency detail recovery.

NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior: The authors propose a novel method for training and rendering NeRF from mobile camera videos. The method works better with the addition of monocular depth prior to the relative motion estimation between frames. The authors show that their method has promising results for mobile camera use cases.

SPARF: Neural Radiance Fields from Sparse and Noisy Poses: SPARF is introduced to allow novel view synthesis from a few input views given noisy camera poses. It exploits multi-view geometry constraints to jointly refine camera poses and estimate the NeRF. SPARF sets a new state-of-the-art in the sparse-view regime on multiple datasets.

BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields: The authors present BAD-NeRF, a bundle-adjusted deblur neural radiance field, which models the physical image formation process to jointly learn camera poses and NeRF parameters and is robust to motion blur. They show superior performance over prior works on synthetic and real datasets.

Level-S2fM: Structure from Motion on Neural Level Set of Implicit Surfaces: This paper introduces Level-S2fM, a neural incremental Structure-from-Motion (SfM) approach that uses coordinate MLPs to estimate camera poses and scene geometry from uncalibrated images. It addresses challenges in optimizing volumetric neural rendering with unknown camera poses and demonstrates promising results in camera pose estimation, scene geometry reconstruction, and neural implicit rendering.

Decomposition

Neural Fields meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes allows exporting captured urban objects into any graphics engine.

In this section, the radiance of NeRF is split into geometry, BRDF, and illumination. This enables consistent relighting under any illumination.

NeFII: Inverse Rendering for Reflectance Decomposition with Near-Field Indirect Illumination: The authors of this paper present an inverse rendering pipeline that considers near-field indirect illumination and uses path tracing Monte Carlo sampling. They demonstrate state-of-the-art performance in inter-reflection decomposition by introducing radiance consistency constraints between implicit neural radiance and path tracing results.

I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs: I^2-SDF is a neural radiance field-based framework that jointly estimates shapes, incident radiance, and materials for indoor scene reconstruction and editing. The neural radiance field is decomposed into a spatially-varying material through differentiable Monte Carlo raytracing that enables photorealistic scene relighting and editing applications.

DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering: DANI-Net is an inverse rendering framework for uncalibrated photometric stereo problems. It incorporates differentiable shadow handling and anisotropic reflectance modeling in its design.

Complementary Intrinsics From Neural Radiance Fields and CNNs for Outdoor Scene Relighting: The proposed framework combines NeRF and CNNs for outdoor scene relighting through intrinsic image decomposition. NeRF provides richer and more reliable pseudo-labels for CNNs training to predict interpretable lighting parameters, which enable realistic relighting.

NeuFace: Realistic 3D Neural Face Rendering from Multi-view Images: NeuFace introduces an approximated BRDF integration and a low-rank prior to create accurate and physically-meaningful 3D facial representations using neural rendering techniques. The method incorporates neural BRDFs into physically based rendering, allowing for the capture of complex facial geometry and appearance clues.

Neural Fields meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes: The presented method jointly reconstructs geometry, materials, and HDR lighting from posed RGB images. An explicit mesh is used to model high-order lighting effects such as shadows. The method disentangles complex geometry and materials from lighting effects for photorealistic relighting and virtual object insertion.

Multi-view Inverse Rendering for Large-scale Real-world Indoor Scenes: TexIR proposes a Texture-based Lighting representation of indoor scenes that models direct and infinite-bounce indirect lighting. A hybrid lighting representation with precomputed irradiance helps with efficiency and material optimization noise. The method enables mixed-reality applications such as material editing, novel view synthesis, and relighting.

Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting: The paper presents a 3D human reconstruction framework that includes a visibility field in addition to the occupancy field and the albedo field. A discretized visibility is supplied with coupled 3D depth and 2D image features, and a TransferLoss is proposed to improve the alignment between visibility and occupancy fields. The proposed method improves reconstruction accuracy and achieves accurate relighting comparable to ray-traced ground truth.

WildLight: In-the-wild Inverse Rendering with a Flashlight: The authors propose a photometric approach to inverse rendering for unknown ambient lighting. The approach exploits a smartphone’s flashlight to produce a minimal light source and decomposes image intensities into a static appearance that corresponds to ambient flux and a dynamic reflection induced by the flashlight.

TensoIR: Tensorial Inverse Rendering: TensoIR proposes an inverse rendering approach based on tensor factorization and neural fields, allowing for efficient and physically-based model estimation for multi-view images in unknown lighting conditions. This provides photorealistic novel view synthesis and relighting results.

Other

Neural Fields can be used to learn a differentiable and bijective lens with Neural Lens Modeling.

Several works with excellent results in various fields.

Raw Image Reconstruction with Learned Compact Metadata: The paper proposes a framework to learn a compressed latent representation of raw images containing fewer metadata than commonly compressed raw image files. The method features an sRGB-guided context model with improved entropy estimation strategies. The compressed representation enables the allocation of more bits for important regions.

NeRF-Supervised Deep Stereo: This paper proposes a novel framework for training deep stereo networks without ground-truth by using a combination of Neural Rendering and NeRF-supervised training. The rendered stereo triplets are used to compensate for occlusions and depth maps as proxy labels. The proposed model shows a significant improvement over existing self-supervised methods on the Middlebury dataset.

AnyFlow: Arbitrary Scale Optical Flow with Implicit Neural Representation: AnyFlow generates optical flow accurately by representing it as a coordinate-based representation. It can accurately estimate flow from images of various resolutions and performs better in detail preservation of tiny objects than previous models when given low-resolution inputs.

Neural Lens Modeling: NeuroLens is an end-to-end optimization method for image distortion and vignetting, which can be used for point projection and ray casting. It allows performing pre-capture and post-reconstruction calibration, outperforming standard methods and being easy to use.

NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation: In NeuWigs, two stages are used to model human hair for virtual reality: the first learns a latent space of 3D hair states and the second performs temporal hair transfer. The learned model outperforms the state-of-the-art and can create new hair animations without additional observations.

Neural Voting Field for Camera-Space 3D Hand Pose Estimation: NVF unifies the two-stage process used traditionally in hand pose estimation, directly predicting 3D dense local evidence and global hand geometry via point-wise voting of 3D points in the camera frustum to alleviate 2D-to-3D ambiguities.

JacobiNeRF: NeRF Shaping with Mutual Information Gradients: JacobiNeRF learns to encode mutual correlation patterns between entities via maximizing their mutual information and is used to improve label propagation for sparse label regimes, semantic, and instance segmentation.

pCON: Polarimetric Coordinate Networks for Neural Scene Representations: Polarimetric Coordinate Networks (pCON) is an architecture designed to preserve polarimetric information and address artifacts created by coordinate network architecture when reconstructing three polarimetric quantities of interest. The current state-of-the-art models do not consider preserving physical quantities.

Neural Scene Chronology: The proposed scene representation Space-Time Radiance Field (STRF) can model discrete scene-level changes as piece-wise constant temporal step functions. By using this representation, the proposed method can reconstruct a time-varying 3D scene model from internet imagery while separating scene-level and illumination changes.

ORCa: Glossy Objects as Radiance Field Cameras: The proposed method uses glossy objects to recover the 5D environment radiance field visible to them by conversion into radiance-field cameras, which can be used to create noise-free images in real-time and to synthesize novel viewpoints. This method can also image around occluders in a scene while estimating object geometry, radiance, and the radiance field.

OReX: Object Reconstruction from Planar Cross-sections Using Neural Fields: OReX uses a Neural Field as an interpolation prior to reconstruct 3D shapes from planar cross-sections. The approach involves iterative estimation architecture and a hierarchical input sampling scheme. A regularization scheme is employed to alleviate gradient ripples. OReX outperforms previous methods and scales well with input size.

Efficient View Synthesis and 3D-based Multi-Frame Denoising with Multiplane Feature Representations: The authors propose a 3D-based multi-frame denoising method that outperforms 2D-based methods with lower computational requirements. The approach extends the multiplane image framework with a learnable encoder-renderer pair manipulating multiplane representations in feature space for better performance.

NeRF-RPN: A general framework for object detection in NeRFs: NeRF-RPN is an object detection framework that directly operates on NeRF to detect objects in 3D. It exploits a novel voxel representation and multi-scale 3D neural volumetric features to regress the 3D bounding boxes of objects without rendering the NeRF at any viewpoint. A benchmark dataset with both synthetic and real-world data is also provided.

vMAP: Vectorised Object Mapping for Neural Field SLAM: vMAP is a dense SLAM system using MLPs to represent objects, enabling efficient, incrementally-built models without 3D priors. The authors show this approach to be more efficient and effective than previous neural field SLAM systems.

PersonNeRF: Personalized Reconstruction from Photo Collections: PersonNeRF is a method that builds a customized neural volumetric 3D model from a collection of photos of a subject captured across multiple years, enables the rendering of the subject with arbitrary combinations of viewpoint, body pose, and appearance. It addresses the issue of sparse observations by recovering a canonical T-pose neural volumetric representation of the subject that allows for changing appearance across different observations but uses a shared pose-dependent motion field across all observations.

Differentiable Shadow Mapping for Efficient Inverse Graphics: The authors show that pre-filtered shadow mapping can be combined with existing differentiable rasterizers to allow efficient shadow computation for inverse graphics problems. Such a technique has faster convergence than differentiable light transport simulation and allows for better results in implicit 3D reconstruction.

Depth Estimation from Indoor Panoramas with Neural Scene Representation: The proposed method for depth estimation from multi-view indoor panoramic images with the Neural Radiance Field technology outperforms previous works by a large margin in quantitative and qualitative evaluations. Two networks were developed to learn the Signed Distance Function for depth measurement and the Radiance Field from panoramas, respectively, as well as a novel spherical position embedding scheme. A geometric consistency loss leveraging surface normal further refines depth estimation.

Neural Kernel Surface Reconstruction: This paper introduces enhancement over the Neural Kernel Field (NKF) method. The new method, SparseVox, uses compactly supported kernel functions, making it robust to noise, trainable with any dataset of dense oriented points, and capable of reconstructing millions of points in a few seconds. SparseVox also outperforms NKF for reconstructing scenes, objects, and outdoor environments.

Renderable Neural Radiance Map for Visual Navigation: The Renderable Neural Radiance Map (RNR-Map) is a grid structure that stores latent codes and extracts the radiance field of images using the grid positions. This structure serves as a visual guideline for efficient navigation and localization. It provides data for camera tracking, visual localization, and curved scenarios navigation that are fast and robust to environmental changes and actuation noises.

NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis: SPARTN uses NeRFs to synthetically inject corrective noise into visual robotic manipulation policies, eliminating the need for expert supervision or additional interaction. It improves success rates by 2.8x over imitation learning without augmentation and even outperforms some methods that use online supervision.

GazeNeRF: 3D-Aware Gaze Redirection with Neural Radiance Fields: GazeNeRF is a 3D-aware method for gaze redirection that models the face and eye volumes separately and projects them onto a 2D image by finetuning using a 3D rotation matrix.

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects: Neural Object Field (NOF) combines 6-DoF tracking and 3D reconstruction for arbitrary objects, with learning and optimization done at the same time. It produces a high-fidelity reconstruction of objects even without visual textures and performs well in sequences with large pose changes.

Novel-view Acoustic Synthesis: ViGAS learns to synthesize the sound of an arbitrary point in 3D space from audio-visual input. The authors introduce a novel-view acoustic synthesis task, which is addressed by learning to reason about the spatial acoustics cues. To enable this work, two large-scale multi-view datasets have been collected: one synthetic and one real.

NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction from Multi-view Images: NEF is a learned implicit curve representation based on neural networks. It is optimized with view-based rendering loss and is able to output 3D feature curves without relying on 3D geometric operators or cross-view correspondence. On a synthetic benchmark, NEF outperforms existing methods.

NeuralPCI: Spatio-temporal Neural Field for 3D Point Cloud Multi-frame Non-linear Interpolation: NeuralPCI introduces Neural Field Interpolation, an end-to-end 4D spatio-temporal neural field for 3D point cloud interpolation, extrapolation, morphing, and auto-labeling to handle both indoor and outdoor scenarios’ large non-linear motions. It achieves state-of-the-art performance on the DHB (Dynamic Human Bodies) and NL-Drive datasets.

Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields: CaFi-Net can produce a canonical coordinate-based implicit representation of an object category without the need for pre-aligned datasets, using a Siamese network architecture for category-level canonicalization. It extracts features from radiance fields and estimates canonical fields with consistent 3D pose. The method is tested on a dataset of 1300 NeRF models across 13 object categories and compared to point cloud-based methods.

Semantic Ray: Learning a Generalizable Semantic Field with Cross-Reprojection Attention: The proposed S-Ray model combines semantic understanding of radiance and cross-view attention mechanisms to learn from multiple scenes efficiently while providing generalizable results.

EventNeRF: Neural Radiance Fields from a Single Colour Event Camera: The authors present an approach to dense and photorealistic view synthesis from a single color event stream. This is done with a neural radiance field trained in a self-supervised way using a tailored ray sampling strategy. The resulting method produces significantly denser and more visually appealing renderings than existing methods while being robust in challenging scenarios.

Panoptic Lifting for 3D Scene Understanding with Neural Fields: Panoptic Lifting presents a new method to learn 3D panoptic representations of in-the-wild scenes using a neural field representation trained on machine-generated 2D panoptic segmentation masks. The method accounts for inconsistencies in 2D instance identifiers and incorporates improvements to make it more robust to noisy labels. The approach shows improvement in scene-level PQ over state-of-the-art methods on several datasets.

Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving: ImplicitO reports an implicit representation of the occupancy of the street with a flow factor. This approach reduces computational cost as the motion planner can directly query it and avoids preoccupying the infrastructure with unnecessary computational work. The model also uses a global attention mechanism.

Conclusion

That was many papers, and compared to the previous CVPR (57 papers), the paper count has now grown to 175. CVPR 23 alone had more papers than NeurIPS, ECCV, and CVPR 2022 combined (140).

I have the feeling my little ARCHIVE tool will come in handy this year.

Apart from the sheer number of papers, there are prominent trends in the papers. To no one’s surprise, leveraging NeRFs for anything generative is a substantial new area, and it shows that NeRF and its extension are a mighty tool for this task. I also noticed many papers in the decomposition area - very close to my heart. This is great, and here also, NeRFs mainly play the role of a tool in the reconstruction process. There are also several papers in the SLAM field. So while Frank Dellaert originally termed it as the NeRF Explosion, we may now enter the time of the NeRFusion, where NeRF becomes a building block in many different areas.

Especially with the current speed of research and new trendy fields constantly emerging, it is exciting to see where NeRF will head next :).

NeRF at CVPR 2023

NeRF

Fundamentals

Priors and Generative

Dynamic

Editable and Composable

Pose Estimation

Decomposition

Other

Conclusion

Mark Boss

Research Scientist

Frank Dellaert

CTO, Professor

Related