Generating consistent images of characters using AI

23.08.2024

If you want to use diffusion models to create consistent stories with storytelling — it is not as easy as you may have hoped. It is easy to create a single character using Dall-e or stable diffusion. But what if you want to create a whole story with the same characters in different environments and styles? While generative models researchers are working tirelessly to make it easier, it is not there yet.

One of our clients asked us to help them create an application that generates fairytales for kids where the same heroes progress through the story. Is it possible? Let’s see.

Every fairytale has a prince. The prince is manly, good-looking, and noble — pretty much like myself.

First, let’s generate a training dataset using vanilla ChatGPT 4.0 with Dall-e using a very detailed prompt like this:

The image depicts a storybook prince riding a majestic white horse. The prince is portrayed with fair skin, dark brown hair, and a confident expression. He is wearing a royal outfit consisting of a blue tunic with gold trim and a matching cape that drapes elegantly over the horse’s back. The cape is fastened with a red gemstone clasp. His attire is completed with red trousers, a brown leather belt with gold buckle, and tall brown boots with golden spurs. The prince’s posture is upright, showcasing a regal bearing, as he holds the horse’s reins in one hand. The horse has a muscular build, with a flowing grey mane and tail, and is outfitted with a red bridle and a saddle with gold accents. Both the prince and the horse are in motion, captured in a side profile view against a plain white background, which emphasizes their detailed rendering and the storybook illustration style of the image.

And here are the results:

Image created by AI tool DALL·E 3 — the author has the provenance and copyright

The results are not bad (certainly looks like me). The images are more or less consistent and look similar. They are not as similar as we need to use prompting alone to create a coherent story but good enough for training.

Stable Diffusion

With vanilla prompting failed miserably to solve our problem let’s try more advanced techniques. We all heard about Stable Diffusion. In fact, I have an article about it here. What if we fine-tune stable diffusion embeddings for a prince with our training dataset? Isn’t fine-tuning supposed to solve all problems after all?

We are going to use this approach: https://stable-diffusion-art.com/embedding. It allows one to create a character and use it for subsequent generations. We tried several experiments and here are the results:

Doesn’t look a lot like what we want, does it? Consistency is horrible. The problems we see here:

  1. Generated images do not look like the original image
  2. When we generate the scene — the model ignores all the details about the location and just generates the character. No horse either.
  3. Training the model is possible only through the Stable Diffusion web interface, not easy to automate.

This is a newbie approach though, so let’s use the adult’s stuff.

LORA and diffusion dreambooth

Lora is a finetuning technique that allows us to fine-tune models using a small number of learnable parameters. Nowadays models are massive so this allows for training existing models using reasonable time and resources.

DreamBooth is a training technique that updates the entire text-to-image model by training on just a few images of a subject or style.

We want to update a base Stable Diffusion model with an image of our prince so it knows what kind of exact prince it wants. You can optionally associate it with a specific token, for example <prince>.

Dreambooth is needed here as an additional step as without it fine-tuning on a limited set of images leads to the model outputting the character only in the same angles/poses as in the training dataset and in general, becomes dumber. I’m not going to dive into explaining how it works Lora here, as it is a topic for another article but we want something like this:

For training let’s use stable diffusion services here:

As a base model, we will use Midjourney and the dataset above. Training params:

{
"_token": "",
"key": "",
"channel": "train_in_model",
"class_prompt": "An image of a prince on the white horse",
"base_model_id": "midjourney",
"endpoint": "create-request",
"training_type": "men",
"instance_prompt": "An image of prince_on_the_white_horse person",
"images": [
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/1.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/10.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/11.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/2.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/3.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/4.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/5.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/6.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/7.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/8.png",
"https://s3.timeweb.com/bac216d5-ai-labs-public/princedataset/9.png"
],
"webhook": "https://stablediffusionapi.com/training_status/SmapI87z5JRrbcHFQUtTD3P2A",
"seed": null,
"base_model": "prompthero/openjourney",
"max_train_steps": "2000",
"training_id": "SmapI87z5JRrbcHFQUtTD3P2A"
}

Training took 30 minutes. After the training, we generated images in different locations. Here are some examples:

{
"prompt": "mdjrny-v4 style An image of prince_on_the_white_horse person riding a horse on the seashore. Show blue sky with the sun. Dreamy style.",
"negative_prompt": "painting, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, deformed, ugly, blurry, bad anatomy, bad proportions, extra limbs, cloned face, skinny, glitchy, double torso, extra arms, extra hands, mangled fingers, missing lips, ugly face, distorted face, extra legs, anime",
"model_id": "SmapI87z5JRrbcHFQUtTD3P2A",
"panorama": null,
"self_attention": null,
"width": "512",
"guidance": 7.5,
"height": "768",
"samples": "1",
"safety_checker": "no",
"steps": 20,
"seed": null,
"webhook": null,
"track_id": null,
"scheduler": "UniPCMultistepScheduler"
}
 
{
"prompt": "mdjrny-v4 style An image of prince_on_the_white_horse person riding a horse through mountains",
"negative_prompt": "painting, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, deformed, ugly, blurry, bad anatomy, bad proportions, extra limbs, cloned face, skinny, glitchy, double torso, extra arms, extra hands, mangled fingers, missing lips, ugly face, distorted face, extra legs, anime",
"model_id": "SmapI87z5JRrbcHFQUtTD3P2A",
"panorama": null,
"self_attention": null,
"width": "512",
"guidance": 7.5,
"height": "768",
"samples": "1",
"safety_checker": "no",
"steps": 20,
"seed": null,
"webhook": null,
"track_id": null,
"scheduler": "UniPCMultistepScheduler"
}

Looks very impressive, if you live in 1992. Here are some notes::

  1. Trained images don’t look completely the same as the original although there is a definite similarity
  2. Quality is worse than that of midjourney and DALL·E 3
  3. Training takes 30 minutes and costs 1$ for one session
  4. Stable diffusion can only handle a limited number of tokens in one go. There are difficulties with generating detailed location

This approach can be used to solve this task but its quality is not guaranteed.

Using a combination of gpt-4 vision, gpt-4 and Dall-e 3

Gpt-4 vision model takes an image of a here as an input and makes a very detailed description of a hero (face, clothes, hair, etc.). Then we use Dalle-3 to generate images:

Here is a prompt example:

Generate an image according to the following description:

The Prince: The prince has a noble and confident demeanor, with fair skin and dark, wavy brown hair. He has a strong jawline and a slight stubble, and his eyes are focused and determined. He wears a royal blue, double-breasted jacket with gold embroidery and button details. Underneath, there’s a white shirt with a standing collar. A bright red cape drapes over his left shoulder and flows behind him, indicating movement.

The Horse: The horse is a powerful and elegant white steed with a muscular build, a broad chest, and an arched neck. Its mane and tail are long, flowing, and well-groomed, with a soft, silvery sheen. The horse’s tack is luxurious, with intricate red velvet and gold leaf designs. Notable are the decorative elements like the forehead band, which is adorned with red roses and golden filigree, and the saddle, which is equally ornate.

The Attire and Accessories: The prince’s attire is complemented by brown leather gloves and matching high-riding boots with subtle detailing. He carries a sword at his left hip, suggested by the hilt peeking out from under the cape. His posture is upright, projecting authority and grace.

The Environment: The setting is a mountain landscape with sharp, rugged peaks in the background. The sky is clear blue with a hint of soft clouds, suggesting it is a fair weather day. The sunlight enhances the scene with a golden hue, emphasizing the richness of the colors and the grandeur of the prince and his horse.

The action: the prince is holding a sword in his left hand

The Composition: The prince and his horse are centrally positioned in the frame, with the prince slightly turned towards the viewer, giving a three-quarter view of his face and body. The horse’s head is turned slightly to its right, showing off the decorative tack and its spirited expression.

The Color Palette: The color scheme has a harmonious blend of royal blue, gold, and red for the prince’s attire, contrasting with the white of the horse and the natural browns and greens of the mountains.

You will not always be able to get a character that is 100% similar to the original but it is close.

This approach allows us to get the maximum quality of generation by using Dalle-3 model.

It looks very similar to the original and quite fast so depending on your needs it could work.

LORA + ControlNet + Stable Diffusion

The most comprehensive approach. ControlNet is a technique that allows to use of other images as templates to specify parameters of the desired result image, like pose or background in a much more precise way:

This is a very useful thing for our task. Here is what we do:

  1. Using GPT-4 create about 100 images of that face
  2. Do lora training using these images and https://stable-diffusion-art.com/train-lora/. We are training for face embeddings.
  3. Using controlnet we need to create debd hero mask in a required position
  4. Using controlnet generate our prince
  5. Artifacts like extra fingers are still possible so you might still need to fix them manually

This is a more complex and expensive solution though so only use it if it is worth it. And it is still not completely automatic and will require manual work.

While I was writing this similar approach was implemented here using controlnet images:

https://huggingface.co/spaces/InstantX/InstantID. Results are very good:

A photo of a person on Arrakis in a Fremen outfit.

Works well, so try it.

This is all, have fun. If you have any comments or questions or create your own set of images you want to share — leave them in the comments section.

Back

Share:

  • icon
  • icon
  • icon