VideoPoet A large language model for zero-shot video generation

qayxc · on Dec 19, 2023

The results look very impressive. The prompting however, is a bit weird - there's suspiciously many samples with an "8k"-suffix, presumably to get more photorealistic results? I really don't like that kind of stuff, when prompting becomes more like reciting sacred incantations instead of actual descriptions of what you want.

minimaxir · on Dec 19, 2023

"8k HD" was a prompt engineering trick from the VQGAN + CLIP and Stable Diffusion 1.X era, since they did indeed have an impact in getting photorealism as CLIP's text encoder is funny like that. When Stable Diffusion 2.X was released with a new text encoder, it broke all these tricks and people were upset.

Here's a fun demo of the impact of prompt engineering tricks back in the VQGAN + CLIP days: https://imgur.com/a/SnSIQRu

Odd to see the same trick work on a completely new text encoder though.

ronsor · on Dec 19, 2023

Prompting has always been an esoteric technological spell full of "unreal engine 5" and "8k uhd cinematic"

zoklet-enjoyer · on Dec 20, 2023

Hyper realistic. Kodak candid shot. Kodachrome. Stomach ache. Grotesque. Smoked meats

benjijay · on Dec 20, 2023

Large fries, chocolate shake?

3abiton · on Dec 19, 2023

Coming from SD prompting, you quickly get used to tons of different terms to maximize the quality of your output, and not straightfoward to know which one is a signal and which is a noise. It all depends on the training data.

yieldcrv · on Dec 20, 2023

you dont have to do that stuff, you just start doing it because someone else got a better result than you

also it is very easy for software to concatenate strings for you behind the scenes before sending it to the model

mobiuscog · on Dec 20, 2023

> you dont have to do that stuff, you just start doing it because someone else got a better result than you

This is the point where humans possibly 'hallucinate' ;)

aantix · on Dec 20, 2023

Always research, never a real product.

Back to using Runway.

Google is going to have to start getting courageous and really productize the work, and let the real world interact and decide.

At this point these academic exercises aren’t cutting it and the courageous AI companies are beating them.

echelon · on Dec 20, 2023

I think the real takeaway is that RunwayML has no moat. Pika Labs raised $50M on the same product. In the same month both Stability and Facebook revealed their text to video. AnimateDiff is going to kick the pants off of all of them. And I can count two dozen papers - a lot of them with code - that do the same thing.

Edge models are no longer a competitive edge. Look at companies like ElevenLabs. Margins eaten to zero by a dozen companies nipping at the heels, raising on the same terms. Building the same features.

The magic in weights is gone. These companies are all clones.

It might sound like this is counter to my point, but in actuality think about all of the competition this opens the playing field to. A company starting next year could easily outdo RunwayML.

deyiao · on Dec 20, 2023

Ever since Gemini's demonstration, I've assumed that all the promotional cases Google present are greatly exaggerated, especially since they don't offer trials.

addandsubtract · on Dec 20, 2023

The restaurant booking AI demo never materializing put me off of Google's PR stunts.

AlexDragusin · on Dec 19, 2023

All of it is really cool but the Image to video generation is really impressive, animating statue images seems really useful as well as bringing other static imagery to life.

People with great imagination are going to become sought after in the future as Imagination Architects who can put this sort of tech to good use.

dkjaudyeqooe · on Dec 20, 2023

This is an example of where AI improves human employment, rather than "destroying" it.

Now we can create new works by people who have the vision or imagination, but not the skill to render them in some medium. That specialisation will improve the quality of the work since it increases the pool of potential participants.

Of course artists' jobs will change, but they will not disappear.

ilaksh · on Dec 19, 2023

Since they didn't release any source code, weights, or API, no one will be able to use it, until someone reproduces the work from scratch from the papers.

peddling-brink · on Dec 19, 2023

> Imagination Architects

I bet a bunch of those old fashioned "Artists" have a good imagination, maybe they'll use it?

throwup238 · on Dec 20, 2023

The AI will add their artistic distinctiveness to its own. Their work will be adapted to service it. Resistance is futile.

ashvardanian · on Dec 20, 2023

The blogpost is more informative: https://blog.research.google/2023/12/videopoet-large-languag...

thyrox · on Dec 19, 2023

Google has impressive demos, but they don’t always translate into practical products that we can use in our daily lives. And sometimes, their demos are not as realistic or honest as they seem, like the recent Gemini case.

infotainment · on Dec 20, 2023

Indeed. In addition to killing products, Google also seems to have become very good at demoing AI products that are never meaningfully released.

visarga · on Dec 20, 2023

This is how language modeling evolved - it used to be able to only output 5-10 words that made sense at a time. Now we get 5-10 seconds of video at a time.

fo76yo · on Dec 20, 2023

Given Hollywood edits together a bunch of 5-10 second sequences, game over once planning AI that can compose minutes of 5-10 second sequences matures.

spartanatreyu · on Dec 19, 2023

Hands and feet being particularly difficult for current AI means that humanity still needs to be part of the equation to get good results.

Hedepig · on Dec 19, 2023

For now

stevage · on Dec 20, 2023

Does anyone know what "zero-shot" means in this context? Even the blog post doesn't mention it outside the title.

omneity · on Dec 20, 2023

That would mean most likely that the video is the result of a prompt without any example.

Illustrated examples from how Google does it here: https://blog.research.google/2023/11/zero-shot-adaptive-prom...

EDIT: corrected answer factuality.

stevage · on Dec 20, 2023

I wonder why "zero-shot" rather than "one-shot".

omneity · on Dec 20, 2023

I was wrong indeed, this is technically called zero-shot prompting.

Zero-shot prompting here means without an example, i.e. "show me a panda on a skate". The opposite would be few-shot prompting, "write me a limerick like these 5 examples".

hyperparticles · on Dec 20, 2023

In addition, there are some tasks that the model can do and was never trained on, see the newly added paper link in the website.

busssard · on Dec 20, 2023

i interpret it that the VideoPoet has not seen the things prompted. One shot means you show it one example of something and it generates equivalents.

However i find it very misleading, as the training data is most likely gigantic. So it is not very accurate calling it zero shot

hyperparticles · on Dec 20, 2023

Author here. We demonstrate zero-shot capability on a few tasks by chaining smaller tasks together. For example, the model was never trained on text-to-audio but we can do it by generating text-to-video followed by video-to-audio.

We just added a link to the paper in the website, you can read more about it.

stevage · on Dec 20, 2023

Ah, thanks, that makes sense.

ciwolsey · on Dec 20, 2023

Why should I care about this? I don't care what google have hidden away. The best AIs are always going to be for the the most privileged.

dwd · on Dec 20, 2023

All these prompt to video (VideoPoet, Pika, Runway, etc) are going to turn TikTok and other short video sites into toxic wastelands where everything and nothing is real anymore.

But the real game-changer is going to be on-demand personal entertainment based on a prompt, and all the pieces are falling into place.

ehsankia · on Dec 20, 2023

If it's entertaining, great. If it isn't, then why would it do any better than all the other crappy videos? Just like photoshop doesn't magically make something look good, neither does this. You still need to have an interesting idea that people would want to watch. At that point, then this is just a tool that helps you achieve that idea faster.

dwd · on Dec 20, 2023

> You still need to have an interesting idea that people would want to watch.

The point is that it is something you alone want to watch, but I agree that it needs to be entertaining. I would personally want a Grok-like AI to be generating the dialog rather than ChatGPT.

And taking it a step further, there's no reason you couldn't stop and ask for a new story if you hated it, or make it interactive where you prompt the story in a direction that interests you. At the end, if you enjoyed it, save and share? The future is what we make it...

npteljes · on Dec 20, 2023

>TikTok and other short video sites into toxic wastelands where everything and nothing is real anymore.

Media is not real in the first place, including internet content. Toxicity could be argued similarly. If realness and wholesomeness are the targets, they are already greatly missed.

terryf · on Dec 20, 2023

If a model generates video in the woods but there is no-one that can see it, does it really generate video?

Seriously, I'm sure it's awesome creating all the papers but the real test is making it possible for people to use the things. Google seems to be massively failing at that.

praveen9920 · on Dec 20, 2023

Does it have anything to do with Google research not having any product arm that is fast enough to be productive like openAI has?

flanfly · on Dec 20, 2023

It has a lot to do with Google. MS Research it a pretty good example how to turn your industrial research into products.