Nano Banana can be prompt engineered for nuanced AI image generation

23 hours ago/216 comments/minimaxir.com

I have been generating a few dozen images per day for storyboarding purposes. The more I try to perfect it, the easier it becomes to control these outputs and even keep the entire visual story as well as their characters consistent over a few dozen different scenes; while even controlling the time of day throughout the story. I am currently working with 7 layers prompts to control for environment, camera, subject, composition, light, colors and overall quality (it might be overkill, but it’s also experimenting).

I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.

Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.

19 hours ago by taylorhughes

We use nano banana extensively to build video storyboards, which we then turn into full motion video with a combination of img2vid models. It sounds like we're doing similar things, trying to keep images/characters/setting/style consistent across ~dozens of images (~minutes of video). You might like the product depending on what you're doing with the outputs! https://hypernatural.ai

13 hours ago by nolroz

The website lets you type in an entire prompt, then tells you to login, then dumps your prompt and leaves you with nothing. Lame.

11 hours ago by scotty79

I noticed ChatGPT and others do exactly the same once you run out of anonymous usage. Insanely annoying.

18 hours ago by roywiggins

Your "Dracula" character is possibly the least vampiric Dracula I've ever seen tbh

14 hours ago by qmmmur

If anything, the ubiquity of AI has just revealed how many people have 0 taste. It also highlights the important role that these human-centred jobs were doing to keep these people from contributing to the surface of any artistic endeavour in "culture".

17 hours ago by Conscat

That looks exactly like the photos on a Spirit Halloween costume.

18 hours ago by observationist

I agree. Bruhcula? Something like that. He's a vampire, but also models and does stunts for Baywatch - too much color and vitality. Joan of Arc is way more pale.

Maybe a little mode collapse away from pale ugliness, not quite getting to the hints of unnatural and corpse-like features of a vampire - interesting what the limitations are. You'd probably have to spend quite a lot of time zeroing in, but Google's image models are supposed to have allowed smooth traversal of those feature spaces generally.

7 hours ago by tincholio

He looks like Dracula on LinkedIn

19 hours ago by Genego

Yes we are definitely doing the same! For now I’m just familiarizing myself in this space technically and conceptually. https://edwin.genego.io/blog

19 hours ago by nashadelic

> The more I try to perfect it, the easier it becomes I have the opposite experience, once it goes off track, its nearly impossible to bring it back on message

19 hours ago by Genego

How much have you experimented with it? For some stories I may generate 5 image variations of 10-20 different scenes and then spend time writing down what worked and what did not; and running the generation again (this part is mostly for research). It’s certainly advancing my understanding over time and being able to control the output better. But I’m learning that it takes a huge amount of trial and error. So versioning prompts is definitely recommended, especially if you find some nuances that work for you.

14 hours ago by vunderba

> I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api)

Are you talking about Automatic1111 / ComfyUI inpainting masks? Because Nano doesn't accept bounding boxes as part of its API unless you just stuffed the literal X/Y coordinates into the raw prompt.

You could do something where you draw a bounding box and when you get the response back from Nano, you could mask that section back back over the original image - using a decent upscaler as necessary in the event that Nano had to reduce the size of the original image down to ~1MP.

10 hours ago by Genego

No I am using my own workflows and software for this. I made nano-banana accept my bounding boxes. Everything is possible with some good prompting: https://edwin.genego.io/blog/lpa-studio < there are some videos of an earlier version there while I am editing a story. Either send the coords and describe the location well, or draw a box around the bb and tell it to return the image without the drawn bb, and only the requested changes.

It also works well if you draw a bb on the original image, then ask Claude for a meta-prompt to deconstruct the changes into a much more detailed prompt, and then send the original image without the bbs for changes. It really depends on the changes you need, and how long you're willing to wait.

- normal image editing response: 12-14s

- image editing response with Claude meta-prompting: 20-25s

- image editing response with Claude meta-prompting as well as image deconstructing and re-constructing the prompt: 40-60s

(I use Replicate though, so the actual API may be much faster).

This way you can also go into new views of a scene by zooming in and out the image on the same aspect-ratio canvas, and asking it to generatively fill the white borders around. So you can go from an tight inside shot, to viewing the same scene from outside of an house window. Or from inside the car, to outside the car.

an hour ago by vunderba

Thanks, that makes sense. I'll have to give the "red bounding box overlay" a shot when there are a great deal of similar objects in the existing image.

I also have a custom pipeline/software that takes in a given prompt, rewrites it using an LLM into multiple variations, sends it to multiple GenAI models, and then uses a VLM to evaluate them for accuracy. It runs in an automated REPL style, so I can be relatively hands-off, though I do have a "max loop limiter" since I'd rather not spend the equivalent of a small country's GDP.

14 hours ago by rcarr

You can literally just open the image up in Preview or whatever and add a red box, circle etc and then say "in the area with the red square make change foo" and it will normally get rid of the red box on the generated image. Whether or not it actually makes the change you want to see is another matter though. It's been very hit or miss for me.

14 hours ago by vunderba

Yeah I could see that being useful if there were a lot of similar elements in the same image.

I also had similar mixed results wrt Nano-banana especially around asking it to “fix/restore” things (a character’s hand was an anatomical mess for example)

15 hours ago by brulard

That sounds intriguing. 7 layers - do you mean its one prompt composed of 7 parts, like different paragraphs for each aspect? How do you send bounding box info to banana? Does it understand something like that? What does claude add to that process? Makes your prompt more refined? Thanks

14 hours ago by Genego

Yes, the prompt is composed of 7 different layers, where I group together coherent visual and temporal responsibilities. Depending on the scene, I usually only change 3-5 layers, but the base layers still stay the same; so the scenes all appear within the same story universe and same style. If something feels off, or feels like it needs to be improved, I just adjust one layer after the other to experiment with the results on the entire story, but also on individual scene level. Over time, I have created quite some 7-Layer style profiles, that work well, and I can cast onto different story universes. Keep in mind this is heavy experimentation, it may just be that there is a much easier way to do this, but I am seeing success with this. https://edwin.genego.io/blog/lpa-studio - at any point I may throw this all out and start over; depending on how well my understanding of this all develops.

Bounding boxes: I actually send an image with a red box around where the requested change is needed. And 8 out of 10 times it works well. But if it doesn't work, I use Claude to make the prompt more refined. The Claude API call that I make, can see the image + the prompt, as well understanding the layering system. This is one of the 3 ways I edit, there is another one where I just sent the prompt to Claude without it looking at the image. Right now this all feels like dial-up. With a minimum of 0.035$ per image generation (0.0001$ if I just use a LoRa though) and a minimum of 12-14 seconds wait on each edit/generation.

7 hours ago by yard2010

This is beautiful and inspiring, This is exactly what we need right now - tools to empower artists and builders leveraging the novel technologies. Claude Code is a great example IMHO and it's the tip of the iceberg - the future consists of a whole new world, new mental model and set of constraints and capabilities, so different that I can't really imagine it.

Who has thought that we reach this uncharted territory with so many opportunities for pioneering and innovation? Back in 2019 it felt like nothing was new under the sun, today it feels like there is a whole new world under the sun, for us to explore!

21 hours ago by simonw

I like the Python library that accompanies this: https://github.com/minimaxir/gemimg

I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:

  GEMINI_API_KEY="..." \
  uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
    python -m gemimg "a racoon holding a hand written sign that says I love trash"

Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...

19 hours ago by sorcercode

@simonw: slight tangent but super curious how you managed to generate the preview of that gemini-cli terminal session gist - https://gistpreview.github.io/?17290c1024b0ef7df06e9faa4cb37...

is this just a manual copy/paste into a gist with some html css styling; or do you have a custom tool à la amp-code that does this more easily?

18 hours ago by simonw

I used this tool: https://tools.simonwillison.net/terminal-to-html

I made a video about building that here: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...

It works much better with Claude Code and Codex CLI because they don't mess around with scrolling in the same way as Gemini CLI does.

17 hours ago by sorcercode

very cool. frequently, i want to share my prompt + session output; this will make that super easy! thanks again for sharing!

14 hours ago by ilyakaminsky

I use Gemini CLI on a daily basis. It used to crash often and I'd lose the chat history. I found this tool called ai-cli-log [1] and it does something similar out of the box. I don't run Gemini CLI without it.

[1] https://github.com/alingse/ai-cli-log

11 hours ago by minimaxir

I just merged the PR and pushed 0.3.1 to PyPI. I also added README documentation and allowed for a `gemimg` entrypoint to the CLI via project.scripts as noted elsewhere in the thread.

20 hours ago by ctippett

Any reason for not also adding a project.scripts entry for pyproject.toml? That way the CLI (great idea btw) could be installed as a tool by uv.

19 hours ago by simonw

I decided to avoid that purely to keep changes made to the package as minima as possible - adding a project.scripts means installing it adds a new command alias. My approach changes nothing other than making "python -m gemimg" do something useful.

I agree that a project.scripts would be good but that's a decision for the maintainer to take on separately!

21 hours ago by echelon

The author went to great lengths about open source early on. I wonder if they'll cover the QwenEdit ecosystem.

I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.

You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.

I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.

That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.

18 hours ago by vunderba

The Qwen-Edit images from my GenAI Image Editing Showdown site were all generated from a ComfyUI workflow on my machine - it's shockingly good for an open-weight model. It was also the only model that scored a passing grade on the Van Halen M&M test (even compared against Nanobanana)

https://genai-showdown.specr.net/image-editing

3 hours ago by irthomasthomas

Ha I created a Van Halen M&M test for text prompts. I would include an instruction demanding that the response contain <yellow_m&m> and <red_m&m> but never <brown_m&m>. Then I would fail any llm that did not include any m&ms, or if they wrote anything about the <brown_m&m> in the final output.

19 hours ago by msp26

> I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.

For imagegen, agreed. But for textgen, Kimi K2 thinking is by far the best chat model at the moment from my experience so far. Not even "one of the best", the best.

It has frontier level capability and the model was made very tastefully: it's significantly less sycophantic and more willing to disagree in a productive, reasonable way rather than immediately shutting you out. It's also way more funny at shitposting.

I'll keep using Claude a lot for multimodality and artifacts but much of my usage has shifted to K2. Claude's sycophancy is particular is tiresome. I don't use ChatGPT/Gemini because they hide the raw thinking tokens, which is really cringe.

18 hours ago by astrange

Claude Sonnet 4.5 doesn't even feel sycophantic (in the 4o) way, it feels like it has BPD. It switches from desperately agreeing with you to moralizing lectures and then has a breakdown if you point out it's wrong about anything.

Also, yesterday I asked it a question and after the answer it complained about its poorly written system prompt to me.

They're really torturing their poor models over there.

20 hours ago by minimaxir

I've been keeping an eye on Qwen-Edit/Wan 2.2 shenanigans and they are interesting: however actually running those types of models is too cumbersome and in the end unclear if it's actually worth it over the $0.04/image for Nano Banana.

20 hours ago by braebo

Takes a couple mouse clicks in ComfyUI

20 hours ago by CamperBob2

I was skeptical about the notion of running similar models locally as well, but the person who did this (https://old.reddit.com/r/StableDiffusion/comments/1osi1q0/wa... ) swears that they generated it locally, just letting a single 5090 crunch away for a week.

If that's true, it seems worth getting past the 'cumbersome' aspects. This tech may not put Hollywood out of business, but it's clear that the process of filmmaking won't be recognizable in 10 years if amateurs can really do this in their basements today.

18 hours ago by vunderba

Good read minimaxir! From the article:

> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.

In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.

> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.

This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.

[1] - https://mordenstar.com/portfolio/zeno-paradox

15 hours ago by junon

It might also be an explicit guard against Studio Ghibli specifically after the "make me Ghibli" trend a while back, which upset Studio Ghibli (understandably so).

15 hours ago by minimaxir

It happens with other styles. The demo documentation example which attempts to transfer an image into the very-public-domain Starry Night by Van Gogh doesn't do a true style transfer: https://x.com/minimaxir/status/1963429027382694264

15 hours ago by junon

Ah interesting! Thanks for the clarification. Great article :)

a day ago by dostick

Use Google AI Studio to submit requests, and to remove watermark, open browser development tools and right click on request to “watermark_4” image and select to block it. And from next generation there will be no watermark!

2 hours ago by dreis_sw

So the watermark is being added to the image on the client-side? That's pretty bad

4 hours ago by billynomates

That sounds dangerous honestly. Watermarks should be mandatory for AI generated images.

2 hours ago by dymk

How would you enforce that when it’s actually important? Any “bad actor” could just open photoshop and remove it. Or run a delobotimized model which doesn’t watermark.

a day ago by undefined

[deleted]

21 hours ago by mFixman

The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!

This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.

21 hours ago by jonas21

I am a human, and I would have done the same thing as Nano Banana. If the user had wanted a strawberry in the skull's left eye, they should've said, "Put a strawberry in its left eye socket."

20 hours ago by kjeksfjes

Exactly what I was thinking too. I'm a designer, and I'm used to receiving feedback and instructions. "The left eye socket" would to me refer to what I currently see in front of me, while "its left eye socket" instantly shift the perspective from me to the subject.

13 hours ago by bear141

I find this interesting. I've always described things from the users point of view. Like the left side of a car, regardless of who is looking at it from what direction, is the driver side. To me, this would include a body.

21 hours ago by martin-adams

I picked up on that also. I feel that a lot of humans would also get confused about whether you mean the eye on the left, or the subject's left eye.

20 hours ago by Closi

To be honest this is the sort of thing Nano Bannana is weak at in my experience. It's absolutely amazing - but doesn't understand left/right/up/down/shrink this/move this/rotate this etc.

See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:

https://gemini.google.com/share/a024d11786fc

20 hours ago by ffsm8

Mmh, ime you need to discard the session/rewrite the failing prompt instead of continuing and correcting on failures. Once errors occur you've basically introduced a poison pill which will continuously make things to haywire. Spelling out what it did wrong is the most destructive thing you can do - at least in my experience

18 hours ago by astrange

Almost no image/video models can do "upside-down" either.

17 hours ago by basch

to the point where you can say, raise the left arm and then raise the right arm and get the same image with the same arm raised.

2 hours ago by zulban

Extroverts tend to expect directions from the perspective of the skull. Introverts tend to expect their own perspective for directions. It's a psychology thing, not an error.

20 hours ago by minimaxir

I admit I missed this, which is particularly embarrassing because I point out this exact problem with the character JSON later in the post.

For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.

5 hours ago by frumiousirc

The lack of proper indentation (which you noted) in the Python fib() examples was even more apparent. The fact that both AIs you tested failed in the same way is interesting. I've not played with image generation, is this type of failure endemic?

2 hours ago by minimaxir

My hunch in that case is that the composition of the image implied left-justified text which overwrote the indentation rule.

2 hours ago by slightknack

The minimaxir/gemimg repo is pretty cool, fwiw.

Going further, one thing you can do is give Gemini 2.5 a system prompt like the following:

https://goto.isaac.sh/image-prompt

And then pass Gemini 2.5's output directly to Nano-Banana. Doing this yields very high-quality images. This is also good for style transfer and image combination. For example, if you then give Gemini 2.5 a user prompt that looks something like this:

    I would like to perform style transfer. I will provide the image generation model a photograph alongside your generated prompt. Please write a prompt to transfer the following style: {{ brief style description here }}.

You can get aesthetic consistently-styled images, like these:

https://goto.isaac.sh/image-style-transfer

2 hours ago by skocznymroczny

I like to use these AI models for generating mockup screenshots of game. I can drop a "create a mockup screenshot of a steampunk 2D platformer in which you play as a robot" and it will give me some interesting screenshot. Then I can ask it to iterate on the style. Of course it's going to be broken in some ways and it's not even real pixel art, but it gives a good reference to quickly brainstorm some ideas.

Unfortunately I have to use ChatGPT for this, for some reason local models don't do well with such tasks. I don't know if it's just the extra prompting sauce that ChatGPT does or just diffusion models aren't well designed for these kind of tasks.

a day ago by leviathant

I was kind of surprised by this line:

>Nano Banana is terrible at style transfer even with prompt engineering shenanigans

My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.

As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.

Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.

Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.

As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.

21 hours ago by echelon

You can't take a highly artistic image and supply it as a style reference. Nano Banana can't generalize to anything not in its training.

20 hours ago by leviathant

Fair enough! I suppose I've avoided that kind of "style transfer" for a variety of reasons, it hadn't even occurred to me that people were still interested in that. And I don't say that to open up debate on the topic, just explaining away my own ignorance/misinterpretation. Thanks

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.

Home About GitHub Kaggle

AI Blog Deep Learning Apps Security Checklist

Bookmarks Hacker News My Stack