The Chosen One: Consistent Characters in Text-to-Image Diffusion Models: Related Work

cover
18 Jul 2024

Authors:

(1) Omri Avrahami, Google Research and The Hebrew University of Jerusalem;

(2) Amir Hertz, Google Research;

(3) Yael Vinker, Google Research and Tel Aviv University;

(4) Moab Arar, Google Research and Tel Aviv University;

(5) Shlomi Fruchter, Google Research;

(6) Ohad Fried, Reichman University;

(7) Daniel Cohen-Or, Google Research and Tel Aviv University;

(8) Dani Lischinski, Google Research and The Hebrew University of Jerusalem.

Text-to-image generation. Text conditioned image generative models (T2I) [63, 69, 95] show unprecedented capabilities of generating high quality images from mere natural language text descriptions. They are quickly becoming a fundamental tool for any creative vision task. In particular, text-to-image diffusion models [9, 30, 52, 77–79] are employed for guided image synthesis [8, 15, 18, 22, 27, 51, 87, 97] and image editing tasks [5, 7, 10, 13, 28, 38, 47, 49, 55, 75, 84]. Using image editing methods, one can edit an image of a given character, and change its pose, etc., however, these methods cannot ensure consistency of the character in novel contexts, as our problem dictates.

In addition, diffusion models were used in other tasks [56, 96], such as: video editing [23, 44, 45, 50, 59, 92], 3D synthesis [19, 31, 48, 58], editing [11, 25, 74, 98] and texturing [67], typography generation [35], motion generation [60, 81], and solving inverse problems [32].

Text-to-image personalization. Text-conditioned models cannot generate an image of a specific object or character. To overcome this limitation, a line of works utilizes several images of the same instance to encapsulate new priors in the generative model. Existing solutions range from optimization of text tokens [20, 85, 88] to fine-tuning the parameters of the entire model [6, 70], where in the middle, recent works suggest fine-tuning a small subset of parameters [1, 17, 26, 33, 41, 71, 82]. Models trained in this manner can generate consistent images of the same subject. However, they typically require a collection of images depicting the subject, which naturally narrows their ability to generate any imaginary character. Moreover, when training on a single input image [6], these methods tend to overfit and produce similar images with minimal diversity during inference.

Unlike previous works, our method does not require an input image; instead, it can generate consistent and diverse images of the same character based only on a text description. Additional works are aimed to bypass the personalization training by introducing a dedicated personalization encoder [3, 16, 21, 37, 42, 76, 90, 93]. Given an image and a prompt, these works can produce images with a character similar to the input. However, as shown in Section 4.1, they lack consistency when generating multiple images from the same input. Concurrently, ConceptLab [66] is able to generate new members of a broad category (e.g., a new pet); in contrast, we seek a consistent instance of a character described by the input text prompt.

Story visualization. Consistent character generation is well studied in the field of story visualization. Early GAN works [43, 80] employ a story discriminator for the imagetext alignment. Recent works, such as StoryDALL-E [46] and Make-A-Story [62] utilize pre-trained T2I models for the image generation, while an adapter model is trained to embed story captions and previous images into the T2I model. However, those methods cannot generalize to novel characters, as they are trained over specific datasets. More closely related, Jeong et al. [36] generate consistent storybooks by combining textual inversion with a face-swapping mechanism; therefore, their work relies on images of existing human-like characters. TaleCrafter [24] presents a comprehensive pipeline for storybook visualization. However, their consistent character module is based on an existing personalization method that requires fine-tuning on several images of the same character.

Manual methods. Other attempts for achieving consistent character generation using a generative model rely on

ad hoc and manually-intensive tricks such as using text tokens of a celebrity, or a combination of celebrities [64] in order to create a consistent human; however, the generated characters resemble the original celebrities, and this approach does not generalize to other character types (e.g., animals). Users have also proposed to ensure consistency by manually crafting very long and elaborate text prompts [65], or by using image variations [63] and filtering them manually by similarity [65]. Other users suggested generating a full design sheet of a character, then manually filter the best results and use them for further generation [94]. All these methods are manual, labor-intensive, and ad hoc for specific domains (e.g., humans). In contrast, our method is fully automated and domain-agnostic.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.