Text-to-Image generation tools are becoming more and more common as AI image-generation tools keep evolving. Compared to their first iterations the latest release of the models can generate realistic faces and legible text within generated images. One such is InstructPix2Pix. In this article, we will learn about its architecture, how to run the InstructPix2Pix model and its applications.
What is the IntructPix2Pix?
When you have an image and you want to revise the image in certain areas, How do you do it using Generative AI? You can feed the image and an edit instruction to a suitable pre-trained image generator (or a ControlNet model). The output image may have desired changes but still, look different from the original image as it was re-generated.
For example, you may edit a particular cat and the edited image may have a cat of different colour or build or ears or any other aspect. That’s where InstructPix2Pix will be helpful.
InstructPix2Pix is a Stable Diffusion model which can edit images from human instructions. It is trained such that given an image and an instruction for how to edit that image, the model performs the appropriate edit.
This fine-tuned version of stable diffusion models was proposed by researchers from the University of California, Berkeley. They generated paired datasets using GPT-3 for text instructions and Stable diffusion for text2Image generation.
Using this generated paired data the researchers trained a conditional diffusion model InstructPix2Pix. The model edits images using text instructions and does not require any additional example images, full descriptions of the input/output images, or per-example fine-tuning.
Training InstructPix2Pix
Training InstructPix2Pix can be discussed in 2 parts: Generating a paired Image editing dataset and training a stable diffusion model on the dataset.
To create a paired image dataset, 1st let’s stick to the text domain i.e. creating text instructions. To generate the text dataset GPT-3 was fine-tuned using the LAION-Aesthetics V2 6.5+ dataset. By doing this GPT-3 trains on editing triplets: (1) input captions, (2) edit instructions, (3) output captions.
The large training dataset is created by using creative, sensible captions generated by fine-tuned GPT-3. The next part is generating paired images from paired captions. We use a pre-trained text-to-image model to transform a pair of captions into a pair of images.
Again, as we discussed earlier the challenge of consistency in the images generated from both the captions shows up. This problem is solved by using a method called prompt-to-prompt. Refer to the following image
Prompt-to-prompt aims at helping multiple generations from a text-to-image model to be similar. This is done through borrowed cross-attention weights in some number of denoising steps.
However, there is no measure of what similarity between generated image pair is the sweet spot. Prompt-to-prompt has a parameter to control the similarity between the generated image pair. It is still difficult to get the optimal value of this parameter.
Therefore 100 sample pairs of images per caption-pair each with a random parameter value are generated and these are filtered by clip-based metric. CLIP generates captions for generated images and changes between the image captions give a measure of consistency.
We use our generated training data to train a conditional Stable diffusion model that edits images from written instructions.
How Does InsructPix2Pix Work?
The final task of the InstructPix2Pix is to condition the input image(CI) and text instruction(CT). Classifier-free guidance involves training the model for both conditional and unconditional denoising. When the guidance scale(s) is greater than 1 the image generated gives more weight to conditioned denoising and less to unconditional denoising.
As InstructPix2Pix has 2 conditions 2 guidance scales (sI) and (sT) are used. The scales can be adjusted to trade off how strongly the generated samples correspond with the input image and how strongly they correspond with the edit instruction.
Running InstructPix2Pix
Install diffusers and relevant dependencies:
!pip install diffusers accelerate safetensors transformers
Load the model, image and edit the image:
import PIL
import requests
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline, EulerAncestralDiscreteScheduler
model_id = "timbrooks/instruct-pix2pix"
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16, safety_checker=None)
pipe.to("cuda")
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
url = "https://raw.githubusercontent.com/timothybrooks/instruct-pix2pix/main/imgs/example.jpg"
def download_image(url):
image = PIL.Image.open(requests.get(url, stream=True).raw)
image = PIL.ImageOps.exif_transpose(image)
image = image.convert("RGB")
return image
image = download_image(url)
prompt = "add classic black cooling glasses"
images = pipe(prompt, image=image, num_inference_steps=10, image_guidance_scale=1).images
images[0]
How is different From ControlNet?
ControlNet is focused on generating new images based on certain pre-set conditions. While Instruct Pix2Pix is focused on editing existing images based on text instructions. Both can be used together for a more controlled and specific image generation and editing process.
Both ControlNet and Instruct Pix2Pix provide additional control and flexibility in the image generation and editing process. The choice between the two depends on the specific task at hand.
Conclusion
The article explains the InstructPix2Pix model paper and its implications in a simple way. It is good at following general image editing instructions, but it might break when complex instructions are passed.
InstructPix2Pix model uses data generated from stable diffusion for training. So, the quality of images generated by the InstructPix2Pix model is similar to its training data. A ControlNet training setup might improve the performance of the model and the quality of training data can be further improved to boost performance.