Stable Diffusion and VAE

1) Stable Diffusion

‘Stable Diffusion’ is a software that uses a ‘Diffusion Model’ combined with VAE -Variational AutoEncoder, one of the four well-known Deep Learning-based Image Generative Models for text2image creation. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. Stable Diffusion was developed by the start-up StabilityAi in collaboration with a number of academic researchers and non-profit organizations.

Stable Diffusion is a Latent Diffusion Model, a kind of Deep Generative Neural Network. Latent diffusion models are machine learning models designed to learn the underlying structure of a dataset by mapping it to a lower-dimensional latent space. This latent space represents the data in which the relationships between different data points are more easily understood and analyzed.

How does LoRA work in Stable Diffusion? LoRA applies small changes to the most critical part of Stable Diffusion models: The cross-attention layers. It is the part of the model where the image and the prompt meet. Researchers found it sufficient to fine-tune this part of the model to achieve good training.

Latent, by definition, means “hidden”. The concept of “latent space” is important because it's utility is at the core of 'deep learning' — learning the features of data and simplifying data representations for the purpose of finding patterns. Latent space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. ‘Latent space’ refers specifically to the space from which the low-dimensional representation is drawn. ‘Embedding’ refers to the way the low-dimensional data is mapped to ("embedded in") the original higher dimensional space. If it seems that this process is 'hidden' from you, it's because it is.

Stable Diffusion’s code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E (OpenAi), Imagen (Google) and Midjourney which were accessible only via cloud services.

Stable Diffusion’s Training data

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality). The dataset was created by LAION, a German non-profit which receives funding from Stability AI. The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with Pinterest taking up 8.5% of the subset, followed by websites such as WordPress, Blogspot, Flickr, DeviantArt and Wikimedia Commons.

Training procedures

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them. The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability. Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.

The creators of Stable Diffusion acknowledge the potential for algorithmic bias, as the model was primarily trained on images with English descriptions. As a result, generated images reinforce social biases and are from a western perspective as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages with western or white cultures often being the default representation.[

2) Automatic1111 – Rendering UI for Stable Diffusion

‘Automatic1111’ is a browser interface based on Gradio library for Stable Diffusion. It is one of the most popular Stable Diffusion GUI out there. Now, it's compatible with Stable Diffusion v2 Models. 512x512 and 768x768 both the models.

Stable Diffusion is a machine learning model. By itself is not very user friendly. You will need to write codes to use it. It’s kind of a hassle. Most users use a GUI (Graphical User Interface) to use Stable Diffusion. Instead of writing codes, we write prompt in a text box and click some buttons.

Automatic1111 was one of the first GUIs developed for Stable Diffusion. Although it associates with the original author Automatic1111’s GitHub account, it has been a community effort to develop this software. Automatic1111 is feature-rich: You can use text-to-image, image-to-image, upscaling, depth-to-image, run and train custom models all within this GUI. Many of the tutorials in this book will be demonstrated with this GUI.

NOTE: RunPod ( is a Rent Cloud GPUs service from $0.2/hour. You can use RunPod to deploy container-based GPU instances that spin up in seconds using both public and private repositories to do GPU intensive processing.

If you don't want to use Automatic1111 locally on your computer, and you don't want to have to set up a RunPod, the developer of this kohya-LoRA notebook has just come out with a brand spanking new 'Automatic1111’ notebook with ControlNet-1 and the brand new ControlNet-2 as well as the ability to use your newly trained LoRA files.

You can use the Cagliostro Colab UI and it is basically Automatic1111 to generate your images. Cagliostro Colab UI is an innovative and powerful notebook designed to launch Automatic1111's Stable Diffusion Web UI in Google Colab (

Cagliostro Colab offers an efficient way to utilize Stable Diffusion Web UI for your projects. (Note: Vladmantic and Anapnoe collaborated on creating the best UI-UX for stable diffusion. It is forked version of Automatic1111's Stable Diffusion Web UI. Users can still use Automatic1111 by disabling the use_anapnoe_ui option. Also note that Anapnoe's still uses an old commit, so if users want to experience Gradio 3.23.0, it's better to disable the use_anapnoe_ui option.

Login to Cagliostro Colab Ui with you Google Account at:

3) ControlNet is an extension for Stable Diffusion

‘ControlNet’ is a neural network structure to control diffusion models by adding extra conditions, a game changer for AI. ControlNet is a brand new extension for Stable Diffusion, the open-source text-to-image AI tool from Stability AI. ControlNet is capable of creating an image map from an existing image, so you can control the composition and human poses of your AI-generated image.

Official implementation of Adding Conditional Control to Text-to-Image Diffusion Models. It copys the weights of neural network blocks into a "locked" copy and a "trainable" copy.

Last updated