Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Stablevideo: Text-driven consistency-aware diffusion video editing (rese1f.github.io)
210 points by satvikpendem on Aug 21, 2023 | hide | past | favorite | 43 comments


Also recently released is "CoDeF: Content Deformation Fields for Temporally Consistent Video Processing" https://qiuyu96.github.io/CoDeF/

"We present the content deformation field (CoDeF) as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis. Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline. We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video."

The results also look very stable/impressive.


It's fun to watch how it blends together videos, but none of the three example videos are actually of high enough quality to be useful.

1. The boat video transforms the coast into rolling waves and looks super weird

2. The swan/duck video looks better than the others, but the lighting is obviously wrong when looking closer. Looks like a cardboard cutout bird.

3. The car video looks like a video game from the 2000s, with low quality textures and wheels not turning.

Again, super interesting to see how it can come up with these "by itself", but utterly useless at the moment.


You missed "the duck looks like a swan wearing a duck skin-suit" :)

I wouldn't call it entirely useless though - it makes for an interesting surreal effect. I could see something really cool coming out of this in the A Scanner Dark movie vein.

I particularly liked Rusty Car in the Desert.


"Utterly useless" would be it not outputting anything at all. This is useful enough for many things.


Stability of video models has been a big issue. This is progress, not necessarily the end state.


I suppose this will be a stepping stone for it to get there.


That's what everyone seems to suppose. I don't see why it's warranted, though.


Compare where we were 1 year ago to where we are now, do the same for 5 years ago vs. now. The trajectory is pretty clear, is it not?


I agree with the general sentiment that things seems to be improving over time. But it won't always be like that, as at one point we'll reach AI winter again, and then it doesn't matter if the ecosystem moved incredibly fast for N years, you can't extrapolate that to future developments.


There is all of this talk about an ai winter, and somehow I just kind of doubt it. Unlike in past decades there is a ridiculous amount of money, energy, and time going into the space, and unlike 40 years ago (or even 10) there are things that can actually make money now, and are not just academic research.

(See stable diffusion/llama/ chatgpt)

There will be businesses that actually make money on these technologies, and they will be research heavy (even a 5% improvement is a big deal) as things are still getting figured out.

I could see speed dropping back towards 2017 like rates, but I kind of doubt we will ever see an true ai winter like the 90's early 00's

The field is just too young with too many things as of yet untried, along with the fact that I doubt funding will dry up any time soon. (There are too many interests, from Nvidia wanting to sell more chips, to Microsoft wanting to sell more productivity, to defence, and political concerns between the US and China.)

Yes, it won't go on forever, but also this time seems qualitatively different from the past AI cycles. (Granted I was not alive then)


> The field is just too young with too many things as of yet untried

The field has been around since the 50s with various summers and winters, with each summer having people saying it's now too big to fail, with ever increasing resources and time being spent on it, only for it to eventually stagnate again for some time. If there is one field in computer science I wouldn't call "too young", it would be AI. The first "true" AI winters happened in the 1970s/1980s, and second one in the late 1980s. You seem to have missed some of them by a large margin.

It's the natural movement of ecosystems that are hyped a lot. They get hyped until there is no more air, and it goes back into "building foundations" mode until someone hits gold and the cycle repeats all over again.


Not that the AI summer/winter cycle won't ever stop, but people said the exact same things about how it's different this time for previous winters too. We might see plateaus after transformers and realize that we can't improve for some number of years.


I feel like thats a view from the outside looking in. There are always limits. This is a moore's law type situation - it keeps advancing right up until it can't. That's not to say this is or isnt that case - but things only improve because magnificently smart people discover dozens of really clever little tricks you care about. There is no guarantee that an undiscovered little trick exists in the place you hope it does.

I'm sure things will develop, but develop into flawless midjourney-but-for-video? literally only time will tell, its a fools errand to extrapolate



one (quite convincing) theory is that anything that can be achieved by a carbon-based neural network (eg. human brain) can also be achieved by a silicon-based neural network. The hardware may change, but the hardware's software expressiveness shouldnt be affected, unless there is a fundamental chemistry constraint.

Since human brains during dreams (lucid or otherwise) can generate coherent scenes, and transform individual elements in a scene, diffusion based models running on cpu/gpus should eventually be able to do the same.


> one (quite convincing) theory is that anything that can be achieved by a carbon-based neural network (eg. human brain) can also be achieved by a silicon-based neural network.

That the human brain is exactly equivalent in function to our current model of a neural network is a huge, unproven hypothesis.


It is not warranted but is the logical next step. You're not going to get Holywood quality generated video one day to the other. See MJ results one year apart.

Indeed technically that might not be possible due to probabilistic nature of these models and may require a whole different technology. But one thing for sure is that enough labour and capital is going into it so the chances are not little.


Because that was the case with generative images and audio.


all sd video results will seem useless until a creative genius figures out a hack to make it useful.


It makes me wonder if the car video might have come out better if they'd prompted to have it put in the desert, rather than in a dessert.


Did these guys just straight up solve the generative video problem?

The results are better than anything that I've ever seen.

What's the catch? Large processing times? Are the results cherry picked? Or what?

I guess it only works for video to video, but that's still amazing!


Video to video is not too special, we've had temporally stable solutions to those for some time now, and they're even in commercially available apps. The real test is true text to video, which is much harder.

To my knowledge, the only open source solution that works well for text to video is Zeroscope v2 XL, and v3 is coming soon. v2 is already on par with RunwayML's Gen-2 while v3 is better.


Runway runs circles around zeroscope. But that shouldn't be a blocker for zeroscope to catch up and/or surpass. Both runway and pika Labs deliver better quality at the moment. Evidence: struggling with all three of them.

Runway outputs the best video quality and options for video length whilst pika delivers better fidelity to an input image as inspiration. All of this subject to change without notice


I'm the car example the wheels don't spin which is interesting.

The original frames have different wheel angles so simple text prompted img2img frame by frame approach would preserve the motion, but at the cost of interframe consistency.

Here you get consistent look of the scene and no rapid transitions, but the wheel motion is gone.


Being able to condition on a video vs. just text massively simplifies the task. Mainly because you get consistent camera motion and movement in the scene for free!


If I understand correctly, you have to train a custom NLA model for each video segment before actually using it.


Totally unrelated to the paper itself, but is there a GitHub page template to showcase a paper/project as they have?

I keep seeing these type of webpages to promote papers but haven't found the template yet.


You can see the source of the github pages site on github: https://github.com/rese1f/StableVideo/tree/web

It seems they forked from somebody else and then changed the content to match their paper.


Layman thoughts:

(1) with enough high-quality training data, «AI» models should be able to output H265 / H266 / AV1 directly, can achieve simplicity and reduce artifacts by skipping an inferior compression step and leveraging temporal elements

(2) if AI video compression (as demoed by nvidia) becomes standard, the training data and generated data will become [more] «AI-native», boosting these efforts by miles


why would you want 1? having the raster frames is surely better for post production. I agree that models should take a stab at compression but I think it should be independent. At the end of the day you also don't want to be doing video compression on your GPU, using a dedicated chip for that is so much more efficient. lastly, you don't want to compress the same all the time. FOr low latency we compress with no b-frames and a smallish GOP, with VOD we have a long GOP and b-frames are great for compression.

2. as long as we can again port the algo's to dedicated hardware, which are on mobiles a must for energy efficiency for both encode and decode


This looks exciting. Can't wait to see what the future holds!


Fake news. Lots and lots of fake news.


greenscreen cartoons made in the basement


Weird, its about video but there is no example video (?)


More examples on this page:

https://github.com/rese1f/StableVideo


It's the same 3 examples shown in web page (It's a carousel, you have to click the "righ/left" buttons to see the others)


Yes, which is what the poster is asking for. They are seeing still images on the web page.


For some reason you are being downvoted, but you are right - that is exactly what i was looking for. The videos on the original page did not play for me or give any indication they were videos.


Thanks mate, I really wanted to see some demos myself, but OPs link didn't show me much. This is pretty exciting, I know it's in its infancy but still... exciting times!


There are few examples on top of the page. Check your content blocker I guess?


I see a carousel with sample image frames. There’s video examples?


You might have autoplay turned off in your browser. Those are videos, not images.


Ah apparently. It works in firefox on desktop linux, but did not work on iphone even when I clicked on them.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: