What is ControlNet for Stable Diffusion? How to Use It?

This article provides a quick introduction to Stable Diffusion & ControlNet, and a step-by-step process for using ControlNet to control the image generation process in stable diffusion.

What is ControlNet?

Diffusion Models are one of the foundation models in the generative AI space. Popular Image generation tools like Stable Diffusion, Midjourney, and DALL-E also leverage these models. Since their debut, the image-generating tools have attracted numerous users and widespread adoptions.

One such is Stable Diffusion was released by stability.ai in 2022. Stable Diffusion is an open-source latent diffusion model that can generate an image by passing a prompt describing the image.

ControlNet is a neural network structure used to condition or Control the generation process in diffusion models. Focussing on image generation, it enables users to customize the image generation process by certain aspects of an image like depth map, position, scribble, and many other spatial contexts.

Diffusion Models and Stable Diffusion

The Concept of diffusion models is inspired by a concept of physics called diffusion. Say, take a drop of ink and drop it into a bowl of water. Initially, the exact location of the drop is visible. But, it diffuses into the water with time until it disappears.

To make a Neural network generate an image you want it to know the concept about what the image is and its fine details. To do this we follow a process called ‘noising process’. Take an image of a tree, initially the model has no idea what a tree is.

We simply add different levels of noise to the image iteratively until we get to a point of pure Gaussian noise. We make the model to get approximate noise to add in the next iteration. The goal is to train the reverse process. That is, Given Gaussian noise the diffusion model iterates backwards along the chain, to generate new images(‘denoising process’) corresponding to the input prompt.

The process going through the model to generate a tree at each iteration: I have an image(Gaussian noise at iteration 1) and what is the noise that I need to add to this image such that the image looks more like a tree?

Stable Diffusion

Stable Diffusion is a type of Latent Diffusion model that was proposed in the paper High-Resolution Image Synthesis with Latent Diffusion Models. In Latent diffusion, the model is trained to generate latent(compressed) representations of images. Latent diffusion can reduce the memory and compute complexity by applying a diffusion process over lower dimensional latent space, instead of using actual pixel space.

There are 3 main components in latent diffusion:

  1. An autoencoder (To convert image to low dimensional latent space.)
  2. A U-Net
  3. A text-encoder (embeds text prompt into an embedding space that can be fed to U-Net)

In the case of Stable Diffusion with ControlNet, we first use the CLIP text encoder, then the diffusion model unet and control net, and then the VAE decoder.

stable diffusion

Running Stable Diffusion

To run the stable diffusion model locally:
1. Clone the repository of stable diffusion
2. Install required libraries, define the image generation pipeline using the pre-trained model and generate the image.

!git clone https://github.com/Stability-AI/stablediffusion.git
!pip install diffusers transformers accelerate scipy safetensors
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch

model_id = "runwayml/stable-diffusion-v1-5"

scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a professional photograph of an astronaut riding a horse on moon"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")

How to use ControlNet?

ControlNet manipulates the input conditions of neural network blocks to further control the overall behaviour of an entire neural network. A “network block” refers to a set of neural layers that are put together as a frequently used unit to build neural networks, e.g., “resnet” block, multi-head attention block, transformer block, etc.

To improve the image generation process we lock the parameters of a network block and create a trainable copy that is trained with external condition vector c. (Refer to the image that follows)

The intuition is to avoid overfitting and generalize the model for new examples at the same time. By training the trainable copy with an external condition vector, we can control the behaviour of the network without modifying the original model’s weights directly.

How ControlNet works

A special type of convolution layer called “zero convolution” connects the two copies of neural networks. This layer has weights and biases initialized to zero and it allows information to flow from trained copy to locked copy. The following image depicts the model architecture of a stable diffusion model with ControlNet.

The ControlNet paper discusses 14 different implementations of ControlNets with different image-based conditions to control large diffusion models.

Run Stable Diffusion with ControlNet

ControlNet provides user communication and a certain level of monitoring with the Stable diffusion model. This provides more custom image generation and better results which are closer to the user’s vision.

In the following implementation, we use StableDiffusionControlNetPipeline and use the canny model(Edge detector), and generate images. We will use the famous girl with a pearl image for the implementation.

!pip install -q diffusers==0.14.0 transformers xformers git+https://github.com/huggingface/accelerate.git
!pip install -q opencv-contrib-python
!pip install -q controlnet_aux         #a simple collection of pre-processing models for ControlNet
from diffusers import StableDiffusionControlNetPipeline
from diffusers.utils import load_image

image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
image

We will detect edges by passing the image to a canny processor

IMPORT CV2
FROM PIL IMPORT IMAGE
IMPORT NUMPY AS NP

IMAGE = NP.ARRAY(IMAGE)

LOW_THRESHOLD = 100
HIGH_THRESHOLD = 200

IMAGE = CV2.CANNY(IMAGE, LOW_THRESHOLD, HIGH_THRESHOLD)
IMAGE = IMAGE[:, :, NONE]
IMAGE = NP.CONCATENATE([IMAGE, IMAGE, IMAGE], AXIS=2)
CANNY_IMAGE = IMAGE.FROMARRAY(IMAGE)
CANNY_IMAGE

Now, we load runwaylml/stable-diffusion-v1-5 and the ControlNet model for canny edges. We define a pipeline for image generation using StableDiffusionControlNetPipeline.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)

We use UniPCMultistepScheduler for the fast sampling of the diffusion model. Sampling involves iteratively predicting noise from a random noise sample and subtracting it from the original noise during the denoising process. It helps to improve the quality of image generation.

from diffusers import UniPCMultistepScheduler

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

enable_model_cpu_offload is a function that significantly saves memory consumption. It makes sure to load the component to GPU only when it is necessary.

pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()

We are ready to run the ControlNet pipeline now. ControlNet allows more control over the generated image as we can control the targeted aspects in the image generation. We define an image grid function that takes images, the number of rows and columns as input. The function returns an image grid.

The next step is to assign an input prompt to a variable and run the pipeline to generate the image.

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows * cols

    w, h = imgs[0].size
    grid = Image.new("RGB", size=(cols * w, rows * h))
    grid_w, grid_h = grid.size

    for i, img in enumerate(imgs):
        grid.paste(img, box=(i % cols * w, i // cols * h))
    return grid

prompt = ", best quality, extremely detailed"
prompt = [t + prompt for t in ["taylor swift", "Jessica Watkins", "Kalpana Chawla", "Sania Mirza"]]
generator = [torch.Generator(device="cpu").manual_seed(2) for i in range(len(prompt))]

output = pipe(
    prompt,
    canny_image,
    negative_prompt=["monochrome, lowres, bad anatomy, worst quality, low quality, sharp edges"] * len(prompt),
    generator=generator,
    num_inference_steps=20,
)

image_grid(output.images, 2, 2)

Conclusion

The article provides a quick start for you to start implementing Stable Diffusion and ControlNet. We have discussed the implementation of StableDiffusionControlNetPipeline generating targeted images.

If you are willing to explore more about Stable Diffusion, follow the references and related resources that were used for the blog. Feel free to play with the Colab notebooks attached and understand the resources to dive deep into diffusion models.

You can also learn how to use Hugging Face Models with LangChain.

Resources and References :

Ultra-fast ControlNet with 🧨 Diffusers

Train your ControlNet with diffusers 🧨

The latest release of stable diffusion models stable-diffusion-xl-base-1.0

Adding Conditional Control to Text-to-Image Diffusion Models

High-Resolution Image Synthesis with Latent Diffusion Models

Stable Diffusion with diffusers

0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like