# a) Variational Autoencoders (VAEs)

## Generative AI Models For Image Synthesis

#### Variational Autoencoder (VAE)

The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.

Variational Autoencoders (VAEs) generate new data by learning a latent representation of the input data, and then sampling from this representation.

**U-Net block**

The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Residual Network (ResNet) is a Deep Learning Model (DLM) used for computer vision applications. It is a Convolutional Neural Network (CNN) architecture designed to support hundreds or thousands of convolutional layers.

&#x20;Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.

&#x20;For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of Latent Diffusion models (LDMs).
