a) Variational Autoencoders (VAEs)

Generative AI Models For Image Synthesis

Variational Autoencoder (VAE)

The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.

Variational Autoencoders (VAEs) generate new data by learning a latent representation of the input data, and then sampling from this representation.

U-Net block

The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Residual Network (ResNet) is a Deep Learning Model (DLM) used for computer vision applications. It is a Convolutional Neural Network (CNN) architecture designed to support hundreds or thousands of convolutional layers.

Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.

For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of Latent Diffusion models (LDMs).

Last updated