The results look very impressive. The prompting however, is a bit weird - there's suspiciously many samples with an "8k"-suffix, presumably to get more photorealistic results? I really don't like that kind of stuff, when prompting becomes more like reciting sacred incantations instead of actual descriptions of what you want.
"8k HD" was a prompt engineering trick from the VQGAN + CLIP and Stable Diffusion 1.X era, since they did indeed have an impact in getting photorealism as CLIP's text encoder is funny like that. When Stable Diffusion 2.X was released with a new text encoder, it broke all these tricks and people were upset.
Here's a fun demo of the impact of prompt engineering tricks back in the VQGAN + CLIP days: https://imgur.com/a/SnSIQRu
Odd to see the same trick work on a completely new text encoder though.
Coming from SD prompting, you quickly get used to tons of different terms to maximize the quality of your output, and not straightfoward to know which one is a signal and which is a noise. It all depends on the training data.
I think the real takeaway is that RunwayML has no moat. Pika Labs raised $50M on the same product. In the same month both Stability and Facebook revealed their text to video. AnimateDiff is going to kick the pants off of all of them. And I can count two dozen papers - a lot of them with code - that do the same thing.
Edge models are no longer a competitive edge. Look at companies like ElevenLabs. Margins eaten to zero by a dozen companies nipping at the heels, raising on the same terms. Building the same features.
The magic in weights is gone. These companies are all clones.
It might sound like this is counter to my point, but in actuality think about all of the competition this opens the playing field to. A company starting next year could easily outdo RunwayML.
Ever since Gemini's demonstration, I've assumed that all the promotional cases Google present are greatly exaggerated, especially since they don't offer trials.
All of it is really cool but the Image to video generation is really impressive, animating statue images seems really useful as well as bringing other static imagery to life.
People with great imagination are going to become sought after in the future as Imagination Architects who can put this sort of tech to good use.
This is an example of where AI improves human employment, rather than "destroying" it.
Now we can create new works by people who have the vision or imagination, but not the skill to render them in some medium. That specialisation will improve the quality of the work since it increases the pool of potential participants.
Of course artists' jobs will change, but they will not disappear.
Since they didn't release any source code, weights, or API, no one will be able to use it, until someone reproduces the work from scratch from the papers.
Google has impressive demos, but they don’t always translate into practical products that we can use in our daily lives. And sometimes, their demos are not as realistic or honest as they seem, like the recent Gemini case.
This is how language modeling evolved - it used to be able to only output 5-10 words that made sense at a time. Now we get 5-10 seconds of video at a time.
I was wrong indeed, this is technically called zero-shot prompting.
Zero-shot prompting here means without an example, i.e. "show me a panda on a skate". The opposite would be few-shot prompting, "write me a limerick like these 5 examples".
Author here. We demonstrate zero-shot capability on a few tasks by chaining smaller tasks together. For example, the model was never trained on text-to-audio but we can do it by generating text-to-video followed by video-to-audio.
We just added a link to the paper in the website, you can read more about it.
All these prompt to video (VideoPoet, Pika, Runway, etc) are going to turn TikTok and other short video sites into toxic wastelands where everything and nothing is real anymore.
But the real game-changer is going to be on-demand personal entertainment based on a prompt, and all the pieces are falling into place.
If it's entertaining, great. If it isn't, then why would it do any better than all the other crappy videos? Just like photoshop doesn't magically make something look good, neither does this. You still need to have an interesting idea that people would want to watch. At that point, then this is just a tool that helps you achieve that idea faster.
> You still need to have an interesting idea that people would want to watch.
The point is that it is something you alone want to watch, but I agree that it needs to be entertaining. I would personally want a Grok-like AI to be generating the dialog rather than ChatGPT.
And taking it a step further, there's no reason you couldn't stop and ask for a new story if you hated it, or make it interactive where you prompt the story in a direction that interests you.
At the end, if you enjoyed it, save and share? The future is what we make it...
>TikTok and other short video sites into toxic wastelands where everything and nothing is real anymore.
Media is not real in the first place, including internet content. Toxicity could be argued similarly. If realness and wholesomeness are the targets, they are already greatly missed.
If a model generates video in the woods but there is no-one that can see it, does it really generate video?
Seriously, I'm sure it's awesome creating all the papers but the real test is making it possible for people to use the things. Google seems to be massively failing at that.