Depending on how you're measuring it, video generation will come in less than 12 months, or it's already here.
- In the Video Diffusion paper[1] they generate 64x64 16 frame videos, model is not released, but an open source implementation using an imagen-like pipeline exists[2], no model is available publicly.
- The CogVideo paper[3], which is essentially a huge transformer model, can generate 480x480 videos, the code for the paper and models are open source[4], but be warned that you need a huge GPU like an A100 to run the damned thing.
The future of text to video generation will probably be video diffusion, i.e. using 3D UNets, or more likely a much more optimized version of a UNet. Improvements on diffusion models are happening on a daily basis, and we're probably a couple papers away from having a breakthrough in efficiency that we can generate good looking videos on a high end GPU.
It's very likely that realistic looking (to the level of Stable Diffusion) video will happen and tools to create it will be available within 12 months (maybe 80% likelihood).
What is likely to be missing is the ability to control that video in useful ways directly from prompts. There will be some type of direction, but as anyone who has spent time doing prompt-based image generation actual control isn't there yet.
I admit this gave me pause when you said this. But I'd note my comment was pretty specific:
> It's very likely that realistic looking (to the level of Stable Diffusion) video will happen and tools to create it will be available within 12 months (maybe 80% likelihood).
> What is likely to be missing is the ability to control that video in useful ways directly from prompts.
This is using live video as a source, but I think an integrated version of this combined with some kind maybe game based interface to script it is achievable.
> but as anyone who has spent time doing prompt-based image generation actual control isn't there yet
It's not, but there are many efforts to fix this and they solve most problems, although I admit they're not ideal. One way you can control generation is simply by messing around with per token weights, for example if your prompt is being partially ignored, you can use weights to have the guidance give more emphasis on that part of the prompt. Another way you can control it is by using img2img and input as simple drawing, this can help it better understand say, which color you want things to be. The best tool of all is of course, copious amounts of inpainting and doing composition, eyes not the right color? inpaint them in, messed up hand? inpaint, etc. If your issue is that you can't generate a specific character or style at all, you have tools like text-inversion that can create a special token that represents the rough idea, assuming you already have a couple images that represent that idea already.
There is a real fix tho, the big issue with these prompts is that they use the clip text encoder, the clip text encoder was trained only on image captions, which means that it's understanding of the world is limited to whatever is represented by captions that exist on images found on the internet, a very limited subset of language, limiting the quality of the generated embeddings as the model not only is bad at basic language, it doesn't properly understand the relationships between words and such. Image models that use a large language model (LLM) listen to the prompt much better, since these LLM models generate much less noisy embeddings, a LLM contains embeddings big enough that the diffusion model can actually spell, i.e. if you ask for a sign that says something, the sign will come out with proper spelling and font choice. Sadly there's no such model open to the public yet, but stability.ai is currently training such a model and we can expect its release hopefully before christmas.
Having a image model trained with a LLM text encoder and then applying the tricks used to improve current models will give you an unprecedented level of control over image generation, the next step is probably having an instruction based model that takes both an image and text as input, and runs the instruction over the image. Example, you give it a image of a person and you send the instruction of "draw this in the style of pixel art" and it'll do it for you, it's already currently possible with img2img, but having a model than listens to instructions can extend this to more useful things like "remove the background", "add another cat", "tilt the sign 90 degrees", "make everything black and white except her dress", etc.
Could you for example create a simple 3d render with a very simple environment with cubes resembling houses and blobs as trees etc and a camera moving through that environment. Then use video generation in way like img2img works?
That would be very great.
This is not really what is being suggested, this repo just navigates the latent space between 2 prompts and generates an image at specific intervals. It create cool effects but will never be able to generate a coherent video of for example "a man walking on a beach".
- In the Video Diffusion paper[1] they generate 64x64 16 frame videos, model is not released, but an open source implementation using an imagen-like pipeline exists[2], no model is available publicly.
- The CogVideo paper[3], which is essentially a huge transformer model, can generate 480x480 videos, the code for the paper and models are open source[4], but be warned that you need a huge GPU like an A100 to run the damned thing.
The future of text to video generation will probably be video diffusion, i.e. using 3D UNets, or more likely a much more optimized version of a UNet. Improvements on diffusion models are happening on a daily basis, and we're probably a couple papers away from having a breakthrough in efficiency that we can generate good looking videos on a high end GPU.
[1] https://arxiv.org/abs/2204.03458 [2] https://github.com/lucidrains/imagen-pytorch/tree/main/image... [3] https://arxiv.org/abs/2205.15868 [4] https://github.com/THUDM/CogVideo