In the field of personalized image generation, the ability to create
images preserving concepts has significantly improved. Creating
an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging.
This paper introduces "InstantFamily," an approach that employs
a novel masked cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our
method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text
conditions. Additionally, our masked cross-attention mechanism
enables the precise control of multi-ID and composition in the generated images. We demonstrate the effectiveness of InstantFamily
through experiments showing its dominance in generating images
with multi-ID, while resolving well-known multi-ID generation
problems. Additionally, our model achieves state-of-the-art performance in both single-ID and multi-ID preservation. Furthermore,
our model exhibits remarkable scalability with a greater number of
ID preservation than it was originally trained with.