Inspiration Seeds is a generative system for supporting visual exploration. Given two images, it generates multiple visual combinations that reveal non-obvious connections between them. The process is entirely visual and does not rely on text prompts, making it useful at early stages of creative work when goals are still unclear. Rather than executing a predefined idea, the system supports exploration and discovery of new visual directions.
Ideas rarely arrive fully formed.
Exploration and inspiration are central to the design process: creators sketch, assemble inspiration boards, and gather references.
This kind of exploration is open-ended and associative.
As creators travel through visual concepts, unexpected connections can emerge, sparking new ideas and helping guide creative direction.
Early exploration is open-ended and intuitive. The connections that lead to new ideas are often subtle and non-obvious, and recognizing them typically requires a trained or creative eye. Thinking in terms of abstract visual attributes β and imagining how they might combine β is inherently challenging.
Inspiration Seeds supports visual exploration through a purely visual interaction. Given two images, it generates multiple non-trivial combinations that surface hidden visual relationships, without requiring text prompts or predefined instructions.
When exploring visually, the goal is not to apply a specific edit, but to discover relationships that are not obvious in advance. This is why many existing image generation models struggle in this setting.
For example, models like Nano Banana are designed to execute explicit edits, ones that often appear in their training data, such as inserting one object into another, or replacing a visible part. As a result, when asked to combine images, they tend to fall back on these literal operations β even when prompted to be "creative":
Given a leaf and a portrait, Nano Banana in this case produced a trivial combination β replacing the earring with the leaf, even when prompted to produce a non-trivial, creative combination. Our method surfaces deeper connections: the leaf's decay pattern appears in the skin, and its aged quality carries over to the subject. This may inspire new ideas as part of an exploratory process.
Given two images $(I_A, I_B)$, our goal is to learn the mapping $f_\theta(I_A, I_B) \rightarrow I_{\text{comb}}$.
A central challenge is obtaining suitable training data: triplets of $(I_A, I_B, I_{\text{comb}})$, where the combination reflects a meaningful visual relationship rather than a superficial one.
Our key insight is to invert this problem: instead of searching for image pairs that combine well, we start from visually rich images and decompose them into two constituent visual aspects.
The original image then serves as a ground-truth combination, providing natural supervision for training.
Now our task is reduced into learning to decompose a given image in a non-constrained way.
(1) We formulate decomposition as controlled editing in CLIP latent space, where linear directions correspond to meaningful visual transformations.
(2) We leverage CLIP Sparse Autoencoders (SAEs) to find editing directions, which expose interpretable visual attributes from CLIP embeddings.
(3) We separate groups of similar attributes by clustering the top-k most active SAE features into two groups.
(4) We define the editing direction as the difference between the two groupsβ centroids.
(5) The resulting directions isolate distinct visual qualities and are used to construct training pairs for our combination model.
Below are some examples of the resulting decompositions:
We build a synthetic image pool of diverse, interesting images, from which we produce 2085 triplets $(I_A,I_B,I_{comb})$ using our decomposition pipeline.
Using this set we fine-tune Flux.1 Kontex to perform the inverse task: given two images, produce a combination that captures visual aspects of both.
Some examples of exploration with our method are shown below:
Standard perceptual similarity metrics such as CLIP similarity or DreamSim reward visual similarity to the inputs. This means outputs that simply preserve or insert elements from the inputs score higher than those that transform and recombine them. Such metrics are not designed to measure whether a combination is non-trivial or unique.
We observe that trivial combinations can often be explained in a few words ("place object A into scene B"), whereas non-trivial combinations require longer descriptions to articulate what visual qualities were extracted and how they were transformed.
We prompt Gemini 2.5 Flash to describe how each output image could be reconstructed from its two source images, and use word count as a proxy for the complexity of the visual relationship. Higher word counts indicate more complex, non-trivial combinations. We also report the percentage of outputs classified as trivial patterns (copy, insertion, split).
| Method | Word Count | Copy | Insertion | Split |
|---|---|---|---|---|
| Flux.1 Kontext | 23.5 Β± 21.4 | 2.8% | 0.3% | 85.4% |
| Qwen-Image | 37.4 Β± 19.2 | 16.2% | 18.9% | 10.6% |
| Nano Banana | 42.9 Β± 15.6 | 9.1% | 19.7% | 0.3% |
| Ours | 54.8 Β± 12.5 | 2.3% | 0.0% | 1.5% |
Description complexity comparison.
To validate this approach, we conducted a user study where participants classified the relationship between outputs and their input pairs. Description length increases with combination complexity: simple operations like duplication or insertion require fewer words, while texture transfer and other non-canonical relationships demand longer descriptions.
User study results.
We compare our method to a set of open-source (Flux.1 Kontext, Qwen-Image) and closed-source (Nano Banana, Reve, ChatGPT) models.
| Inputs | Flux.1 Kontext | Qwen-Image | Nano Banana | Reve | ChatGPT | Ours | |
|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Flux.1 Kontext and Qwen-Image tend to preserve inputs or place them side by side without meaningful integration β inserting one object into the other's scene or simply reproducing the inputs.
Nano Banana produces similar object-level insertions, occasionally transferring some visual qualities but largely relying on scene placement.
Reve and ChatGPT sometimes generate non-trivial combinations, but also tend to produce scene compositions or overly polished outputs.
Our method integrates visual qualities from both inputs into unified forms and can surface non-literal connections. Our method builds on Flux.1 Kontext and is fully open source.
More results are available in the paper.