Also recently released is "CoDeF: Content Deformation Fields for Temporally Consistent Video Processing" https://qiuyu96.github.io/CoDeF/
"We present the content deformation field (CoDeF) as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis. Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline. We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video."
You missed "the duck looks like a swan wearing a duck skin-suit" :)
I wouldn't call it entirely useless though - it makes for an interesting surreal effect. I could see something really cool coming out of this in the A Scanner Dark movie vein.
I agree with the general sentiment that things seems to be improving over time. But it won't always be like that, as at one point we'll reach AI winter again, and then it doesn't matter if the ecosystem moved incredibly fast for N years, you can't extrapolate that to future developments.
There is all of this talk about an ai winter, and somehow I just kind of doubt it. Unlike in past decades there is a ridiculous amount of money, energy, and time going into the space, and unlike 40 years ago (or even 10) there are things that can actually make money now, and are not just academic research.
(See stable diffusion/llama/ chatgpt)
There will be businesses that actually make money on these technologies, and they will be research heavy (even a 5% improvement is a big deal) as things are still getting figured out.
I could see speed dropping back towards 2017 like rates, but I kind of doubt we will ever see an true ai winter like the 90's early 00's
The field is just too young with too many things as of yet untried, along with the fact that I doubt funding will dry up any time soon. (There are too many interests, from Nvidia wanting to sell more chips, to Microsoft wanting to sell more productivity, to defence, and political concerns between the US and China.)
Yes, it won't go on forever, but also this time seems qualitatively different from the past AI cycles.
(Granted I was not alive then)
> The field is just too young with too many things as of yet untried
The field has been around since the 50s with various summers and winters, with each summer having people saying it's now too big to fail, with ever increasing resources and time being spent on it, only for it to eventually stagnate again for some time. If there is one field in computer science I wouldn't call "too young", it would be AI. The first "true" AI winters happened in the 1970s/1980s, and second one in the late 1980s. You seem to have missed some of them by a large margin.
It's the natural movement of ecosystems that are hyped a lot. They get hyped until there is no more air, and it goes back into "building foundations" mode until someone hits gold and the cycle repeats all over again.
Not that the AI summer/winter cycle won't ever stop, but people said the exact same things about how it's different this time for previous winters too. We might see plateaus after transformers and realize that we can't improve for some number of years.
I feel like thats a view from the outside looking in. There are always limits. This is a moore's law type situation - it keeps advancing right up until it can't. That's not to say this is or isnt that case - but things only improve because magnificently smart people discover dozens of really clever little tricks you care about. There is no guarantee that an undiscovered little trick exists in the place you hope it does.
I'm sure things will develop, but develop into flawless midjourney-but-for-video? literally only time will tell, its a fools errand to extrapolate
one (quite convincing) theory is that anything that can be achieved by a carbon-based neural network (eg. human brain) can also be achieved by a silicon-based neural network. The hardware may change, but the hardware's software expressiveness shouldnt be affected, unless there is a fundamental chemistry constraint.
Since human brains during dreams (lucid or otherwise) can generate coherent scenes, and transform individual elements in a scene, diffusion based models running on cpu/gpus should eventually be able to do the same.
> one (quite convincing) theory is that anything that can be achieved by a carbon-based neural network (eg. human brain) can also be achieved by a silicon-based neural network.
That the human brain is exactly equivalent in function to our current model of a neural network is a huge, unproven hypothesis.
It is not warranted but is the logical next step. You're not going to get Holywood quality generated video one day to the other. See MJ results one year apart.
Indeed technically that might not be possible due to probabilistic nature of these models and may require a whole different technology. But one thing for sure is that enough labour and capital is going into it so the chances are not little.
Video to video is not too special, we've had temporally stable solutions to those for some time now, and they're even in commercially available apps. The real test is true text to video, which is much harder.
To my knowledge, the only open source solution that works well for text to video is Zeroscope v2 XL, and v3 is coming soon. v2 is already on par with RunwayML's Gen-2 while v3 is better.
Runway runs circles around zeroscope. But that shouldn't be a blocker for zeroscope to catch up and/or surpass. Both runway and pika Labs deliver better quality at the moment. Evidence: struggling with all three of them.
Runway outputs the best video quality and options for video length whilst pika delivers better fidelity to an input image as inspiration. All of this subject to change without notice
I'm the car example the wheels don't spin which is interesting.
The original frames have different wheel angles so simple text prompted img2img frame by frame approach would preserve the motion, but at the cost of interframe consistency.
Here you get consistent look of the scene and no rapid transitions, but the wheel motion is gone.
Being able to condition on a video vs. just text massively simplifies the task. Mainly because you get consistent camera motion and movement in the scene for free!
(1) with enough high-quality training data, «AI» models should be able to output H265 / H266 / AV1 directly, can achieve simplicity and reduce artifacts by skipping an inferior compression step and leveraging temporal elements
(2) if AI video compression (as demoed by nvidia) becomes standard, the training data and generated data will become [more] «AI-native», boosting these efforts by miles
why would you want 1? having the raster frames is surely better for post production. I agree that models should take a stab at compression but I think it should be independent. At the end of the day you also don't want to be doing video compression on your GPU, using a dedicated chip for that is so much more efficient.
lastly, you don't want to compress the same all the time. FOr low latency we compress with no b-frames and a smallish GOP, with VOD we have a long GOP and b-frames are great for compression.
2. as long as we can again port the algo's to dedicated hardware, which are on mobiles a must for energy efficiency for both encode and decode
For some reason you are being downvoted, but you are right - that is exactly what i was looking for. The videos on the original page did not play for me or give any indication they were videos.
Thanks mate, I really wanted to see some demos myself, but OPs link didn't show me much. This is pretty exciting, I know it's in its infancy but still... exciting times!
"We present the content deformation field (CoDeF) as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis. Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline. We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video."
The results also look very stable/impressive.