Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
VideoPoet A large language model for zero-shot video generation (research.google)
126 points by fchyan on Dec 19, 2023 | hide | past | favorite | 40 comments


The results look very impressive. The prompting however, is a bit weird - there's suspiciously many samples with an "8k"-suffix, presumably to get more photorealistic results? I really don't like that kind of stuff, when prompting becomes more like reciting sacred incantations instead of actual descriptions of what you want.


"8k HD" was a prompt engineering trick from the VQGAN + CLIP and Stable Diffusion 1.X era, since they did indeed have an impact in getting photorealism as CLIP's text encoder is funny like that. When Stable Diffusion 2.X was released with a new text encoder, it broke all these tricks and people were upset.

Here's a fun demo of the impact of prompt engineering tricks back in the VQGAN + CLIP days: https://imgur.com/a/SnSIQRu

Odd to see the same trick work on a completely new text encoder though.


Prompting has always been an esoteric technological spell full of "unreal engine 5" and "8k uhd cinematic"


Hyper realistic. Kodak candid shot. Kodachrome. Stomach ache. Grotesque. Smoked meats


Large fries, chocolate shake?


Coming from SD prompting, you quickly get used to tons of different terms to maximize the quality of your output, and not straightfoward to know which one is a signal and which is a noise. It all depends on the training data.


you dont have to do that stuff, you just start doing it because someone else got a better result than you

also it is very easy for software to concatenate strings for you behind the scenes before sending it to the model


> you dont have to do that stuff, you just start doing it because someone else got a better result than you

This is the point where humans possibly 'hallucinate' ;)


Always research, never a real product.

Back to using Runway.

Google is going to have to start getting courageous and really productize the work, and let the real world interact and decide.

At this point these academic exercises aren’t cutting it and the courageous AI companies are beating them.


I think the real takeaway is that RunwayML has no moat. Pika Labs raised $50M on the same product. In the same month both Stability and Facebook revealed their text to video. AnimateDiff is going to kick the pants off of all of them. And I can count two dozen papers - a lot of them with code - that do the same thing.

Edge models are no longer a competitive edge. Look at companies like ElevenLabs. Margins eaten to zero by a dozen companies nipping at the heels, raising on the same terms. Building the same features.

The magic in weights is gone. These companies are all clones.

It might sound like this is counter to my point, but in actuality think about all of the competition this opens the playing field to. A company starting next year could easily outdo RunwayML.


Ever since Gemini's demonstration, I've assumed that all the promotional cases Google present are greatly exaggerated, especially since they don't offer trials.


The restaurant booking AI demo never materializing put me off of Google's PR stunts.


All of it is really cool but the Image to video generation is really impressive, animating statue images seems really useful as well as bringing other static imagery to life.

People with great imagination are going to become sought after in the future as Imagination Architects who can put this sort of tech to good use.


This is an example of where AI improves human employment, rather than "destroying" it.

Now we can create new works by people who have the vision or imagination, but not the skill to render them in some medium. That specialisation will improve the quality of the work since it increases the pool of potential participants.

Of course artists' jobs will change, but they will not disappear.


Since they didn't release any source code, weights, or API, no one will be able to use it, until someone reproduces the work from scratch from the papers.


> Imagination Architects

I bet a bunch of those old fashioned "Artists" have a good imagination, maybe they'll use it?


The AI will add their artistic distinctiveness to its own. Their work will be adapted to service it. Resistance is futile.



Google has impressive demos, but they don’t always translate into practical products that we can use in our daily lives. And sometimes, their demos are not as realistic or honest as they seem, like the recent Gemini case.


Indeed. In addition to killing products, Google also seems to have become very good at demoing AI products that are never meaningfully released.


This is how language modeling evolved - it used to be able to only output 5-10 words that made sense at a time. Now we get 5-10 seconds of video at a time.


Given Hollywood edits together a bunch of 5-10 second sequences, game over once planning AI that can compose minutes of 5-10 second sequences matures.


Hands and feet being particularly difficult for current AI means that humanity still needs to be part of the equation to get good results.


For now


Does anyone know what "zero-shot" means in this context? Even the blog post doesn't mention it outside the title.


That would mean most likely that the video is the result of a prompt without any example.

Illustrated examples from how Google does it here: https://blog.research.google/2023/11/zero-shot-adaptive-prom...

EDIT: corrected answer factuality.


I wonder why "zero-shot" rather than "one-shot".


I was wrong indeed, this is technically called zero-shot prompting.

Zero-shot prompting here means without an example, i.e. "show me a panda on a skate". The opposite would be few-shot prompting, "write me a limerick like these 5 examples".


In addition, there are some tasks that the model can do and was never trained on, see the newly added paper link in the website.


i interpret it that the VideoPoet has not seen the things prompted. One shot means you show it one example of something and it generates equivalents.

However i find it very misleading, as the training data is most likely gigantic. So it is not very accurate calling it zero shot


Author here. We demonstrate zero-shot capability on a few tasks by chaining smaller tasks together. For example, the model was never trained on text-to-audio but we can do it by generating text-to-video followed by video-to-audio.

We just added a link to the paper in the website, you can read more about it.


Ah, thanks, that makes sense.


Why should I care about this? I don't care what google have hidden away. The best AIs are always going to be for the the most privileged.


All these prompt to video (VideoPoet, Pika, Runway, etc) are going to turn TikTok and other short video sites into toxic wastelands where everything and nothing is real anymore.

But the real game-changer is going to be on-demand personal entertainment based on a prompt, and all the pieces are falling into place.


If it's entertaining, great. If it isn't, then why would it do any better than all the other crappy videos? Just like photoshop doesn't magically make something look good, neither does this. You still need to have an interesting idea that people would want to watch. At that point, then this is just a tool that helps you achieve that idea faster.


> You still need to have an interesting idea that people would want to watch.

The point is that it is something you alone want to watch, but I agree that it needs to be entertaining. I would personally want a Grok-like AI to be generating the dialog rather than ChatGPT.

And taking it a step further, there's no reason you couldn't stop and ask for a new story if you hated it, or make it interactive where you prompt the story in a direction that interests you. At the end, if you enjoyed it, save and share? The future is what we make it...


>TikTok and other short video sites into toxic wastelands where everything and nothing is real anymore.

Media is not real in the first place, including internet content. Toxicity could be argued similarly. If realness and wholesomeness are the targets, they are already greatly missed.


If a model generates video in the woods but there is no-one that can see it, does it really generate video?

Seriously, I'm sure it's awesome creating all the papers but the real test is making it possible for people to use the things. Google seems to be massively failing at that.


Does it have anything to do with Google research not having any product arm that is fast enough to be productive like openAI has?


It has a lot to do with Google. MS Research it a pretty good example how to turn your industrial research into products.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: