Given the complete lack of any actual details about performance I would hazard a guess that this approach is likely barely realtime, requiring top hardware, and/or delivering an unimpressive fps. I would love to get more details though.
Gaussian splats can pretty much be rendered in any off the shelf 3D engine with reasonable performance, and the focus of the paper is generating the splats so there's no real reason for them to mention runtime details
Relightable Gaussian Codec Avatars are very, very far from your off-the-shelf splatting tech. It's fair to say that this paper is more about a way of generating more efficiently, but in the original paper from the codec avatars team (https://arxiv.org/pdf/2312.03704) they required a A100 to run at just above 60fps at 1024x1024.
What would practically move the needle is enough money to buy an A100 in the cloud, or even 4-6 A100s to produce a Full HD video suitable for a regular "high quality" video call; typical video calls use half as much, and run at much less than 60 fps.
An A100 is $1.15 per hour at Paperspace. It's so cheap it could be profitably used to scam you out of rather modest amounts, like a few thousand dollars.
Interesting that under the "URAvatar from Phone Scan" section, the first example shows a lady with blush/flush, which only appears in the center video when viewed straight on - the other angles remove this
Seems like this would (eventually) be big for VR applications. Especially if the avatar could be animated using sensors installed on the headset so that the expressions match the headset user. Reminds me of the metaverse demo with Zuckerberg and Lex Friedman
Those demo videos look great! Does anyone know how this compares to the state of the art in generating realistic, relightable models of things more broadly? For example, for video game assets?
I'm aware of traditional techniques like photogrammetry - which is neat, but the lighting always looks a bit off to me.
I don’t do video game programming but what I have heard about engines is that lighting is controlled by the game engine and it’s one step in the pipeline to render the game. Ray tracing is one technique where the light source and the location of the 3d model has simulated light rays in relation of the light source and model.
They are probably rendering with a simple lighting model since this is a system where lighting in a game is handled by another algorithm
With the computational efficiency of Gaussian splatters, this could be ground-breaking for photorealistic avatars, possible driven by LLMs and generative audio.
This is great work, although I note that the longer you look at them, and the more examples you look at in the page, the wow factor drops off a bit. The first example is exceptional, but when you get down to the video of "More from Phone Scan" and look at any individual avatar, you find yourself deep in the uncanny valley very quickly
I noticed that too. It also doesn't seem to always know how to map (or remove) certain things, like the hair bun on the input image, to the generated avatars once you get outside of the facial region.
Unfortunately not yet. Also code alone without the training data and weights might still requires considerable effort. I also wonder how diverse their training data is, i.e. how well the solution will generalize.
I'll note that they had pretty good diversity in the test subjects shown - weight, gender, some racial diversity. I thought it was above average compared to many AI papers that aren't specifically focused on diversity as a training goal or metric. I'm curious to try this. Something tells me this is more likely to get bought and turned into a product or an offering than to be open sourced, though.
Wear a hood and sunglasses at all times in pubic spaces so that it would be hard to make a good shot of you from three sides. A complex, dynamic hairdo would also be helpful. That is, if you're at risk of being impersonated, e.g. if you possess or control something valuable. Better yet, devise non-obvious verbal protocols that should be followed by you and your counter-party a decision about something important is to be discussed over a video call. Even if that counter-party is your parent or co-founder. Maybe especially then.
Another box of nails into admissibility of video recordings in court, sadly.
Given the complete lack of any actual details about performance I would hazard a guess that this approach is likely barely realtime, requiring top hardware, and/or delivering an unimpressive fps. I would love to get more details though.
Gaussian splats can pretty much be rendered in any off the shelf 3D engine with reasonable performance, and the focus of the paper is generating the splats so there's no real reason for them to mention runtime details
Relightable Gaussian Codec Avatars are very, very far from your off-the-shelf splatting tech. It's fair to say that this paper is more about a way of generating more efficiently, but in the original paper from the codec avatars team (https://arxiv.org/pdf/2312.03704) they required a A100 to run at just above 60fps at 1024x1024.
Nothing here seems to have moved that needle.
What would practically move the needle is enough money to buy an A100 in the cloud, or even 4-6 A100s to produce a Full HD video suitable for a regular "high quality" video call; typical video calls use half as much, and run at much less than 60 fps.
An A100 is $1.15 per hour at Paperspace. It's so cheap it could be profitably used to scam you out of rather modest amounts, like a few thousand dollars.
Interesting that under the "URAvatar from Phone Scan" section, the first example shows a lady with blush/flush, which only appears in the center video when viewed straight on - the other angles remove this
Seems like this would (eventually) be big for VR applications. Especially if the avatar could be animated using sensors installed on the headset so that the expressions match the headset user. Reminds me of the metaverse demo with Zuckerberg and Lex Friedman
Those demo videos look great! Does anyone know how this compares to the state of the art in generating realistic, relightable models of things more broadly? For example, for video game assets?
I'm aware of traditional techniques like photogrammetry - which is neat, but the lighting always looks a bit off to me.
I don’t do video game programming but what I have heard about engines is that lighting is controlled by the game engine and it’s one step in the pipeline to render the game. Ray tracing is one technique where the light source and the location of the 3d model has simulated light rays in relation of the light source and model.
They are probably rendering with a simple lighting model since this is a system where lighting in a game is handled by another algorithm
With the computational efficiency of Gaussian splatters, this could be ground-breaking for photorealistic avatars, possible driven by LLMs and generative audio.
This is great work, although I note that the longer you look at them, and the more examples you look at in the page, the wow factor drops off a bit. The first example is exceptional, but when you get down to the video of "More from Phone Scan" and look at any individual avatar, you find yourself deep in the uncanny valley very quickly
I noticed that too. It also doesn't seem to always know how to map (or remove) certain things, like the hair bun on the input image, to the generated avatars once you get outside of the facial region.
Wow that looks pretty much solved! Is there code?
Unfortunately not yet. Also code alone without the training data and weights might still requires considerable effort. I also wonder how diverse their training data is, i.e. how well the solution will generalize.
I'll note that they had pretty good diversity in the test subjects shown - weight, gender, some racial diversity. I thought it was above average compared to many AI papers that aren't specifically focused on diversity as a training goal or metric. I'm curious to try this. Something tells me this is more likely to get bought and turned into a product or an offering than to be open sourced, though.
ABEL
Who will use this? Handing over your photo information so someone can impersonate you in video call to trick your family or friends?
Wear a hood and sunglasses at all times in pubic spaces so that it would be hard to make a good shot of you from three sides. A complex, dynamic hairdo would also be helpful. That is, if you're at risk of being impersonated, e.g. if you possess or control something valuable. Better yet, devise non-obvious verbal protocols that should be followed by you and your counter-party a decision about something important is to be discussed over a video call. Even if that counter-party is your parent or co-founder. Maybe especially then.
Another box of nails into admissibility of video recordings in court, sadly.