AI Architect: Design your homes with Stable Diffusion

5 min readJan 26, 2024

in collaboration with Nitin Tiwari (ML Google Developers Expert)

Looking for a makeover for your home? or just inspired by your favorite anime character’s house? or want your house to look minimalistic yet modern? But who will pay those architects hefty amounts to get an idea🤔

Well behold our latest POC for your home, add your room images and share your inspiration for the room to look like, and let the stable diffusion do its magic.🪄

The Big Picture

AI Architect — a stable diffusion model fine-tuned using Dreambooth and LoRA, that takes in a bunch of images along with a prompt — say “A photo of TOK Home, a well lit up living room near the beach with comfy sofa and small plants” and generates an image of the room conditioned as per your prompt.

Let’s dive deeper into the project and understand the code better.

The Dataset

Let’s start with uploading the dataset for our project. We uploaded a bunch of images of the living room and resized them to 256 x 256.

BLIP for custom captions

Once we have uploaded and resized the images, then we use the BLIP image captioning model to add captions to the images automatically. Once done, save the images into a folder named “SDXL_train”.

# load the processor and the captioning model
blip_processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base",torch_dtype=torch.float16).to(device)

Token Identifier

Next step — we incorporate a token identifier (e.g., TOK) into each caption by introducing a caption prefix. The token identifier we used in this project is caption_prefix = “a photo of TOK home, “.

Pre-processing done ✅ Let’s move forward and fine-tune our model with DreamBooth and LoRA.

Setting up hyperparameters

To ensure seamless integration of DreamBooth with LoRA on a resource-intensive pipeline like Stable Diffusion XL, we implemented the following techniques:

Gradient checkpointing (--gradient_accumulation_steps)
8-bit Adam (--use_8bit_adam)
Mixed-precision training (--mixed-precision="fp16") with floating point 16.
Specify the LoRA model repository name using --output_dir.
Use --caption_column to indicate the name of the caption column in your dataset.

In this example, “prompt” was used to save captions in the metadata file.

!accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --dataset_name="SDXL_train" \
  --output_dir="SDXL_LoRA_model" \
  --caption_column="prompt"\
  --mixed_precision="fp16" \
  --instance_prompt="a photo of TOK home" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=3 \
  --gradient_checkpointing \
  --learning_rate=1e-4 \
  --snr_gamma=5.0 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --max_train_steps=500 \
  --checkpointing_steps=717 \
  --seed="0"

It’s Inference time ⏰

For inferencing, we set up a diffusion pipeline using a pre-trained diffusion model by stability ai. Then use the load_lora_weights function and pass in the lora weights. You can try out inferencing this model here.

Text To Image Generation

Once the pipeline is set, simply pass the prompt to the pipeline and save the generated image.

# Setting up the pipeline and lora weights

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    vae=vae,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.load_lora_weights('/content/SDXL_LoRA_model/pytorch_lora_weights.safetensors')
_ = pipe.to("cuda")

# Pass in the prompt to pipeline 
prompt = "A photo of TOK home, an Indian living room basking in Republic Day morning sun, adorned with saffron, white, and green decor, and vibrant festive accents."

image = pipe(prompt=prompt, num_inference_steps=25).images[0]

# Save the SDXL image output.
image.save('/content/sdxl_output.png')

Generating video from generated image 📽️

Surprise!! Surprise!!

Diffusion models can generate videos as well. We used Stable Video Diffusion(SVD) from Stability AI to generate video from the above-generated image. SVD has a similar pipeline as SDXL.

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)

pipe.enable_model_cpu_offload()
pipe.unet.enable_forward_chunking()

# Load the image.
image = load_image("/content/sdxl_output.png")
resized_image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(resized_image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]

export_to_video(frames, "generated.mp4", fps=7)

Note: SVD can be really heavy on memory, so we used chunking instead. Chunking helps the feed-forward layer to run in a loop instead of running with a single huge feed-forward batch size.

But our Goal !! → Image to Image generation

Our ultimate goal is to upload an image of our home/ room, pass in a prompt, and output a home image conditioned by your prompt.

To achieve this, we reuse the fine-tuned LoRA weights from above and use AutoPipelineForImage2Image package from diffusers.


import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.load_lora_weights('/content/SDXL_LoRA_model/pytorch_lora_weights.safetensors')
_ = pipeline.to("cuda")

pipeline.enable_model_cpu_offload()

url = "https://i.pinimg.com/736x/e2/35/89/e235890ef901c7173f9453d779635f3e.jpg"
init_image = load_image(url)
image = init_image.resize((1024, 576))

prompt = "A cozy Indian living room glows with morning sunshine on Republic Day, 
its walls decked in saffron, white, and green tapestries and art, while colorful 
cushions and festive garlands add a vibrant, celebratory air."

# pass prompt and image to pipeline
image_out = pipeline(prompt, image=image, strength=0.5).images[0]
make_image_grid([image, image_out], rows=1, cols=2)

The Result:

Output Image (Image to Image generation)

You can follow the complete code for this project on Google colaboratory linked in resources for you below.

Note: You might encounter runtime restart while running Stable Video Diffusion and image to image generation inference. Both of these pipelines are memory and GPU extensive, please reconnect to runtime if problem occurs.

Conclusion

In this blog, we covered three use cases of stable diffusion — Text to image, image to video and Image to image. Diffusion models are very powerful when it comes to exploring vision related use cases, the benefits are simply endless.

Resources For You

Code: https://github.com/NSTiwari/Stable-DiffusionXL-using-DreamBooth-and-LoRA

Demo Video: https://www.youtube.com/watch?v=X_p0E2vBRjY

Try out Text to Image model Inference: https://huggingface.co/NSTiwari/SDXL_LoRA_model

Explore stable diffusion more: