Depending on how you're measuring it, video generation will come in less than 12...

nl · on Sept 13, 2022

It's very likely that realistic looking (to the level of Stable Diffusion) video will happen and tools to create it will be available within 12 months (maybe 80% likelihood).

What is likely to be missing is the ability to control that video in useful ways directly from prompts. There will be some type of direction, but as anyone who has spent time doing prompt-based image generation actual control isn't there yet.

in3d · on Sept 13, 2022

I’ll take the 20%. Text-to-video won’t be good enough to be called realistic in a year.

nl · on Sept 23, 2022

I admit this gave me pause when you said this. But I'd note my comment was pretty specific:

> It's very likely that realistic looking (to the level of Stable Diffusion) video will happen and tools to create it will be available within 12 months (maybe 80% likelihood).

> What is likely to be missing is the ability to control that video in useful ways directly from prompts.

ie, I'm not claiming it will entirely text based.

As an example of the kind of things we'll see I'd point at https://twitter.com/karenxcheng/status/1564626773001719813

This is using live video as a source, but I think an integrated version of this combined with some kind maybe game based interface to script it is achievable.

nodja · on Sept 13, 2022

> but as anyone who has spent time doing prompt-based image generation actual control isn't there yet

It's not, but there are many efforts to fix this and they solve most problems, although I admit they're not ideal. One way you can control generation is simply by messing around with per token weights, for example if your prompt is being partially ignored, you can use weights to have the guidance give more emphasis on that part of the prompt. Another way you can control it is by using img2img and input as simple drawing, this can help it better understand say, which color you want things to be. The best tool of all is of course, copious amounts of inpainting and doing composition, eyes not the right color? inpaint them in, messed up hand? inpaint, etc. If your issue is that you can't generate a specific character or style at all, you have tools like text-inversion that can create a special token that represents the rough idea, assuming you already have a couple images that represent that idea already.

There is a real fix tho, the big issue with these prompts is that they use the clip text encoder, the clip text encoder was trained only on image captions, which means that it's understanding of the world is limited to whatever is represented by captions that exist on images found on the internet, a very limited subset of language, limiting the quality of the generated embeddings as the model not only is bad at basic language, it doesn't properly understand the relationships between words and such. Image models that use a large language model (LLM) listen to the prompt much better, since these LLM models generate much less noisy embeddings, a LLM contains embeddings big enough that the diffusion model can actually spell, i.e. if you ask for a sign that says something, the sign will come out with proper spelling and font choice. Sadly there's no such model open to the public yet, but stability.ai is currently training such a model and we can expect its release hopefully before christmas.

Having a image model trained with a LLM text encoder and then applying the tricks used to improve current models will give you an unprecedented level of control over image generation, the next step is probably having an instruction based model that takes both an image and text as input, and runs the instruction over the image. Example, you give it a image of a person and you send the instruction of "draw this in the style of pixel art" and it'll do it for you, it's already currently possible with img2img, but having a model than listens to instructions can extend this to more useful things like "remove the background", "add another cat", "tilt the sign 90 degrees", "make everything black and white except her dress", etc.

holoduke · on Sept 13, 2022

Could you for example create a simple 3d render with a very simple environment with cubes resembling houses and blobs as trees etc and a camera moving through that environment. Then use video generation in way like img2img works? That would be very great.

nodja · on Sept 14, 2022

Yup, that would be easily doable, the code would essentially be the same.

amrrs · on Sept 13, 2022

Stable Diffusion Video generation is kind of here[1] I guess Stability Team will partner up with more open source devs to accelerate this.

[1] https://www.youtube.com/watch?v=yyxgv6MxSDk

nodja · on Sept 13, 2022

This is not really what is being suggested, this repo just navigates the latent space between 2 prompts and generates an image at specific intervals. It create cool effects but will never be able to generate a coherent video of for example "a man walking on a beach".