Generative Image Dynamics — A Summary

Published in

Generative AI

3 min readOct 1, 2023

Diffusion models have been around for a while now and I‘ve always wondered what a good use-case for diffusion models could be. Should I just generate images of baby Yoda riding a bicycle or maybe an astronaut riding a horse?

Well, a recent release [paper] by the Google Research team has led to a crazy good use case i.e. generating looping videos with dynamics that we experience due to motion caused by wind, water currents, respiration, and other natural factors.

The basic idea behind this paper is to bring natural object dynamics to images in response to an interactive user excitation. The dataset consists of a large collection of automatically extracted motion trajectories of real video sequences to predict a neural stochastic motion texture i.e. a set of coefficients of a motion basis that characterize each pixel’s trajectory into the future.

An Overview

The goal is to get an input image I0 and generate a video of length T featuring oscillation dynamics. GID (Generative Image Dynamics) uses a frequency-coordinated diffusion sampling process to predict a per-pixel long-term motion representation in the Fourier domain, which is called a neural stochastic motion texture. This representation can be converted into dense motion trajectories that span an entire video.

The system comprises of 2 modules:

Motion prediction module
Image based rendering module

Motion prediction module: It consists of a Latent Diffusion Model (LDM) that predicts a neural stochastic motion texture (basically a frequency representation of per-pixel motion trajectories) for an input image I0. The predicted neural stochastic motion texture is then transformed to a sequence of motion displacement fields F using inverse discrete Fourier transform. These fields help to determine the position of each input pixel at each future time step.

Image based rendering model: Given the predicted motion fields, rendering module animates the input RGB image using image rendering technique that splats encoded features from input image and decodes these splatted features input an output frame with an image synthesis network.

Splatting — In simple words, it combines different images or textures by making them partially or completely transparent. It is like painting over one image with another image, but instead of completely covering the first image, the second image is only visible where it is not transparent.

Both modules work in sync with each other to generate dynamic videos from static images. The motion prediction module predicts the motion of each pixel in the input image, and the image-based rendering module uses this information to generate a sequence of output frames.

Conclusion

To conclude, we can say that this is a useful application of diffusion models, especially in the field of computer graphics. The paper covers some easy-to-understand processes with loads of applications to apply the GID technique.

Resources for you:

Check out the following link for the paper and demo.👇