During training, they synthesise new viewpoints, i.e. like those that you would use for a turntable animation. Then they use a diffusion model to optimise these frames for visual coherency. Unsurprisingly, the resulting turntable animation then looks visually more coherent than the results by other models.
In effect, I believe what happens here is that where the actual data is ambiguous, they use the diffusion model's hallucination ability to fill in details (which, thus, weren't in the training data).
But since diffusion models are known to memorise things (like the Getty watermark) and were trained on most of the internet, there is also a chance that the diffusion model memorised an actual frame of the original training data which was withheld during evaluation. So there is a chance of groundtruth data leaking into the testing results, I believe.
It looks like they use their own dataset and should be able to avoid that kind of information leakage.
From their paper:
"Our base diffusion model is a re-implementation of the Latent Diffusion Model [42] that has been trained on an internal dataset of image-text pairs with input resolution 512×512×3 "
They also use multi-view datasets during training, but presumably they haven't included those in the diffusion pretraining.
"Training Dataset To learn a generalizable diffusion prior for novel view synthesis, we train on a mixture of the synthetic Objaverse [10] dataset and three real-world datasets: CO3D [38], MVImgNet [64], and RealEstate10K"
"For CO3D and RealEstate10K, we select the input views evenly from all the frames and use every 8th of the remaining frames for evaluation."
to me sounds like the diffusion model had access to 87% of all frames for it to memorize. Then reconstructing from 3 views + diffusion is closer to using those 3 views to recall the memorized 100 views and use those for reconstruction.
Is there any research out there that can be used to reconstruct a fairly accurate model for CAD purposes? Or is classic photogrammetry still the way to go there?
"SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering" has the explicit goal to make accurate meshes from gaussian splats. I haven't seen anyone use it yet, it came out in November.
it's difficult to get a triangle mesh from a NeRF if that's what you're asking. I believe the current technique is just ray marching the NeRF at various points which will cause some loss. NVIDIA published a paper doing that last year. There isn't an elegant way to convert it yet AFAIK. Then again my information is about 4 months out of date.
Personally I don't think NeRFs are an elegant way of representing scenes, I'd prefer something more structured than a blob of weights. But maybe it's still a good intermediary to go from images to the final form, I'm far from an expert.
Weights are just numbers, essentially by using a neural network you are telling the system to "find the best way to represent the scene with a budget of X numbers/parameters". Modern NeRFs like instant-ngp also use some grid representations. I guess Gaussian Splatting is slightly more geometrically appealing because you get points around the surfaces that you are trying to model. These points are however not guaranteed to be exactly on the surface, which additional surface losses try to solve (e.g. NeuSG).
This was my next assumption as well however, good luck with getting a src tree working with it. I'm dying to try some of this stuff but I don't have GPU farms available.
In effect, I believe what happens here is that where the actual data is ambiguous, they use the diffusion model's hallucination ability to fill in details (which, thus, weren't in the training data).
But since diffusion models are known to memorise things (like the Getty watermark) and were trained on most of the internet, there is also a chance that the diffusion model memorised an actual frame of the original training data which was withheld during evaluation. So there is a chance of groundtruth data leaking into the testing results, I believe.