Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
I Made Stable Diffusion XL Smarter by Finetuning It on Bad AI-Generated Images (minimaxir.com)
331 points by minimaxir on Aug 21, 2023 | hide | past | favorite | 64 comments


In general I'm really interested by the concept of personalized RLHF. As we have more and more interactions with a given generative AI system, it seems we'll start to have enough interaction data to meaningfully steer the output towards our personal preferences. I hope the UIs improve to make this as transparent as possible.

Just thinking about how to productize this flow, it should be quite easy to implement the "thumbs up/down" feedback option on every image generated in the UI, plus an optional text label to override "wrong". Then when you have enough HF (or nightly) you could have a batch job to re-train a new LoRA with your updated preferences.

In principle you could collect HF from the implicit tree-traversal that happens when you generate N candidate images from a prompt and then pick one to refine. Or more explicitly, have a quick UI to rank/score a batch, or a trash bin in the digital workspace to discard images you don't like at each iteration of refinement (batching that negative feedback to update your project/global LoRA later).

Going further I wonder what the fastest possible iteration loop for feedback would be? For images in particular you should be able to wire up a very short feedback loop with keypresses in response to image generation. What happens if you strap yourself to that rig for a few hours and collect ~10k preferences at 1/s? Can you get the model to be substantially more likely to output the sort of images that you're personally going to like? Also sounds pretty intense, I'm getting Clockwork Orange vibes.

I didn't spot in the article, how many `wrong` images were there? From a quick skim of the code it looks like maybe 6 per keyword with 13 keywords, so not many at all. ~100 is surprisingly little feedback to steer the model this well.


> Just thinking about how to productize this flow, it should be quite easy to implement the "thumbs up/down" feedback option on every image generated in the UI, plus an optional text label to override "wrong". Then when you have enough HF (or nightly) you could have a batch job to re-train a new LoRA with your updated preferences.

The AI Horde [1] (an open source distributed cluster of GPUs contributed by volunteers) has a partnership with Stability.ai to effectively do this [2]. They are contributing some GPU resources to AI Horde to run an A/B test.

If a user of one of the AI Horde UIs (Lucid Creations[3] or ArtBot[4]... made by me) requests an image using an SDXL model, they get 2 images back. One was created using SDXL v1.0. The other was created using an updated model (you don't know which is which).

You're asked to pick which image you like better of the two. That's pretty much it. The result is sent back to Stability.ai for analysis and incorporation into future image models.

EDIT: There is a similar partnership between the AI Horde and LAION to provide user-defined aesthetics ratings for the same thing[5].

[1] https://aihorde.net/

[2] https://dbzer0.com/blog/stable-diffusion-xl-beta-on-the-ai-h...

[3] https://dbzer0.itch.io/lucid-creations

[4] https://tinybots.net/artbot

[5] https://laion.ai/blog/laion-stable-horde/


> I didn't spot in the article, how many `wrong` images were there? From a quick skim of the code it looks like maybe 6 per keyword with 13 keywords, so not many at all. ~100 is surprisingly little feedback to steer the model this well.

Correct: 6 CFG values * 13 keywords = 78 images. Some of them aren't as useful though; apparently "random text" results in old-school SMS applications sometimes!

LoRAs only need 4-5 images to work well, although that was for older/smaller Stable Diffusion which is why I used more images and trained the LoRA a bit longer for SDXL. The Ugly Sonic LoRA in comparison used about 14 images and I suspect it overfit.


It's really weird that this works. I can see how LoRA on a specific fine-grained concept like Ugly Sonic can work with so few samples, but naively I'd think such a diffuse concept as "!wrong" should require more bits to specify! Like, isn't the loss function already penalizing the model for being "wrong" on all generated images?

(I wonder if there is a follow-up experiment to test if this LoRA'd model actually has better loss on the original training dataset? There's a very interesting interpretability question here I think. Maybe it's just doing much better on a small subset of possible images, but is slightly worse on the remainder of the training data distribution.)


I noticed some of your bad prompts are a little "wishcasted", although that's pretty common.

People put stuff like "bad hands" into every model assuming it'll work, but it only works on NovelAI descendents because that's based on Danbooru which has a "bad hands" tag.


Some of the generated hands are really bad: I opted not to include them to avoid disturbing imagery.


You may be interested in the open source framework we're developing at https://github.com/agentic-ai/enact

It's still early, but the core insight is that a lot of these generative AI flows (whether text, image, single models, model chains, etc) will need to be fit via some form of feedback signal, so it makes sense to build some fundamental infrastructure to support that. One of the early demos (not currently live, but I plan on bringing it back soon) was precisely the type of flow you're talking about, although we used 'prompt refinement' as a cheap proxy for tuning the actual model weights.

Roughly, we aim to build out core python-level infra that makes it easy to write flows in mostly native python and then allows you track executions of your generative flows, including executions of 'human components' such as raters. We also support time travel / rewind / replay, automatic gradio UIs, fastAPI (the latter two very experimental atm).

Medium term we want to make it easy to take any generative flow, wrap it in a 'human rating' flow, auto-deploy as an API or gradio UI and then fit using a number of techniques, e.g., RLHF, finetuning, A/B testing of generative subcomponents, etc, so stay tuned.

At the moment, we're focused on getting the 'bones' right, but between the quickstart (https://github.com/agentic-ai/enact/blob/main/examples/quick...) and our readme (https://github.com/agentic-ai/enact/tree/main#why-enact) you get a decent idea of where we're headed.

We're looking for people to kick the tires / contribute, so if this sounds interesting, please check it out.


> RLHF

Reinforcement Learning from Human Feedback

Aren't these systems already trained to score good things higher and bad things worse dictated by human feedback?


personalized RLHF is the keyword


Implicit RLHF works better than explicit.

It's just like the Mom test: if you ask people to rate you affect their rating

You can have the upscale flow, but you're not limited like Discord based Midjourney was: you can even show all the full sized images and detect that the person copied/saved/right clicked for example


Creating art with stable diffusion has become such a fun hobby of mine. The difference between SD 1.5/2.0 and SDXL is massive, and it's impressive how quickly the quality is improving with this stuff.


>The difference between SD 1.5/2.0 and SDXL is massive,

Can you explain?

I havent used SDXL yet, but I spent a ton of time in 1.5.

So far I gathered:

>Higher res

>higher 'quality'

But given I was using realistic vision 3 for so long, I never had a quality issue. With upscaling, I never needed higher res.


I hope you'll forgive me for a bit of a self promotion here, but I think I have an interesting example of SD 1.5 (what most people are familiar with and what most models are based off of) vs SDXL.

Before Phony Stark shut down the Twitter API, I was running a bot that created landscape images with Stable Diffusion v1.5. Its name is Mr. RossBot [1]. Check out the Twitter page for some examples of the quality.

This weekend, I finally updated the code to get it running on Mastodon. In the process, I updated the model to use SDXL [2]. It's running the exact same code otherwise to randomly generate prompts.

The image caption is a simplified version of the prompt. e.g., "Snowcapped mountain peaks with an oxbow lake at golden hour."

Behind the scenes, a whole bunch of extra descriptive stuff is added, so the prompt that SD v1.5 / SDXL get is: "beautiful painting of snowcapped mountain peaks with an oxbow lake at golden hour, concept art, trending on artstation, 8k, very sharp, extremely detailed, volumetric, beautiful lighting, serene, oil painting, wet-on-wet brush strokes, bob ross style"

Anyway, I feel like the quality of SDXL is sharper and it just nails subjects a lot better. It also tries to add reflections and shadows (not always correctly), whereas that didn't happen as much with SD v1.5.

I'm pretty impressed! Especially because Stability.ai had released an update model of Stable Diffusion before SDXL: SD v2.0 and SD v2.1. The results (IMHO) were absolute garbage using the same prompts.

[1] https://twitter.com/mrrossbot

[2] https://botsin.space/@MrRossBot


Here's an example using my dog - a trained checkpoint on one of the nicer SD 1.5 models and a LoRA for the SDXL ones: https://imgur.com/a/PklEKwC

The first 3 images are some of my attempts at making her into a Pokemon. Some turned out pretty good (after generating 50+ per type), but I struggled with water in particular. It was hard to get her to have a fin, especially with no additional tail.

I haven't done many in SDXL, but that's the point. I've probably generated..10 images of her as a Pokemon, just when I first trying out the LoRA. The next 2 images are from that, and that was before I had a good ComfyUI workflow to boot.

The rest are various sample images from SDXL showing how versatile it is. In most of those, I only had to generate a few images per prompt to get something pretty darn great. In the Halo 2 one the prompt was literally "an xbox 360 screenshot of cinderdog in Halo 2, multiplayer."

And it made her into a freaking Elite, and it worked wonderfully. I previously tried to generate ones like those candyland images in 1.5 models and the foreground and background just didn't look good. In SDXL it just works.


Very cool! How many images did you use to create the LoRa of your dog? Do you have any guide to recommend?


It was about 30 images, though I'm planning on adding more and training again sometime. Either that or splitting it up between when her hair is short and when it's long, as it really changes how she looks.

This isn't what I used for my dog's LoRa but I used it for my wife and it worked better than what I was doing before (Adafactor): https://civitai.notion.site/SDXL-1-0-Training-Overview-4fb03...

I'd recommend increasing the network dimension to at least 64, if your VRAM can take it. I can do 64 with my 12GB card. At least for people, I've had better luck using a token that's a celebrity. I'm not sure how to try that with my dog - perhaps just "terrier dog" or something.


Thanks! Looks like I'll need to rent a GPU to use SDXL fine tuning. Poor old RTX2060 not gonna cut it.


That's a very low learning rate -- between 2-3 orders of magnitude lower than what I've seen for that number of steps. I'll have to give it a try.


I should have been clear - I'm using the Prodigy settings on that page, not the Adafactor one. You set the learning rate to 1 and the scheduler to cosine, but the real learning rate is figured out by the optimizer.


From my experiments it seems that SD XL understands prompts much better. While SD 1.5 is great at generating your typical "anime girl with big boobs" stuff - if you try to generate something a little bit more unusual - it usually doesn't generate exactly what you want and seems to straight up ignore large parts of the prompt.

SD XL seems to understand weird and unusual prompts a lot better.

SD XL is capable of generating 1024x1024 images without hacks like "hires fix". That's a very good thing, because hires fix sometimes introduces additional glitches while upscaling. Especially at higher denoising strength. Hires fix fixed the broken face - yay, but the subject now has 3 legs instead of two. Things like that happen far less often with SD XL.


> While SD 1.5 is great at generating your typical "anime girl with big boobs" stuff - if you try to generate something a little bit more unusual - it usually doesn't generate exactly what you want and seems to straight up ignore large parts of the prompt.

Pretty much experience with SD 1.5, but I'll give XL a try.


For simplicity, it feels like SDXL has better "defaults". You don't have to include a bunch boilerplate keywords to wrangle it into generating good images.

The flip side is I've found it a bit harder to tweak prompts


I've found it very hard to create different styles with SDXL. If you want photorealism, anime, sci-fi, or somewhere in between, it's amazing.

But I've been trying to get it to generate equivalent quality in other styles, e.g. watercolor, abstract painting etc. It doesn't seem to be easy - the quality drops a lot and it's harder to avoid weird results like people wearing enormous hats or distorted perspective.

Admittedly I haven't spent a huge amount of time on this because generation is just a bit too slow to be enjoyable on my machine. Has anyone else had success here?


Yes, currently SDXL doesn't really beat the best SD1.5 checkpoints quality-wise. But it (and the currently available checkpoints) shows awesome promise, so give it a six months or so.


The best 1.5 checkpoints are constrained in their output flexibility to achieve the quality they get though, and they don't follow prompts nearly as well as SDXL, so if the model doesn't naturally gravitate towards doing what you want it's very hard to steer it anywhere. SDXL also does a better job with full anatomy, which is the reason shared 1.5 generations tend to be torso up or portrait shots.


Currently SDXL is better than SD1.5 checkpoints at pretty much everything other than portraits (or anime drawings) of pretty women.

Unfortunately it seems that's all people want to generate, as is evident when you search for SD on Twitter.


Yes, point conceded, I should've said something about the flexibility and capability of SDXL rather than just image quality in a narrow sense.


Stable diffusion doesn’t grant the user a good imagination or taste unfortunately


It became a trend among some data scientists maybe 5 years ago to start recording every keystroke they made on their PC. I'm kind of jealous now when that data is actually kind of useful.

I have a large 30,000 image collection of anime art that I like, that I even competitively ranked for aesthetic score 5 years ago that would come in useful for something like this.


Very cool. Will give this idea a spin soon. I'm a bit of a scientist myself too :)

Here's something interesting I did few days ago.

- Generated images using mixture of different styles of prompts with SDXL Base Model ( using Diffusers )

- Trained a LoRA with them

- Generated again with this LoRA + Prompts used to generate the training set.

Ended up with results with enhanced effects - glitchier, weirder, high def.

Results => https://imgur.com/gallery/vUobKPK

I’m gonna train another LoRA with these generations and repeat the process obviously!

This is a pretty neat way to bypass the 77 token limit in Diffusers and develop tons of more styles now that I think about it.

You can play around with the LoRA at https://replicate.com/galleri5/nammeh ( GitHub account needed )

Will publish it to CivitAI soon.


Please consider posting the LoRa on civitai.com as well as the stable diffusion Reddit.

These results look pretty good, looking forward to trying it out. I hadn't realized that the generative images buzz was dying out, since I'm using it regularly I guess it is always in buzz for me.


I posted the original release to /r/StableDiffusion but all the comments are "why not compatable with A1111?" and I can't find a good script to do the conversion: https://www.reddit.com/r/StableDiffusion/comments/15r5k3i/i_...

Civitai has syndicated the LoRA: https://civitai.com/models/128708/sdxl-wrong-lora


You will get more users if you provide a safetensors file instead of bin and pickletensors a lot of people have gotten really scared by the malware scare that was going through social media a few months ago.


Thank you for note on this. I had not heard there were already trojan horse malware being slipped into tensor files as python scripts. Apparently torch pickle uses eval on the tensor file with no filter.

Heard surprisingly little commentary on this topic. The full explanation of how Safetensors are "Safe" can be found from the developer at: https://github.com/huggingface/safetensors/discussions/111



And for a good reason. A big hunk of floating-point numbers really shouldn't be able to execute arbitrary code. Or any code at all.


I would also ask that sha hashes are posted somewhere. It annoys me to know end how difficult it can be to confirm you are using the real model.


Agreed I feel like, and I do this a lot as well, people have a tendency to track their habits and assume everyone follows that. From my perspective, the gen image buzz is still as hot as ever!

If I lacked excitement for SDXL it was because it felt like the there was no massive jump in image quality to me. Sure the size doubling is great, but it also presents a problem, as I don't always want to generate 1024x1024 images. I still use third party trained 1.5 models because they create damned good outputs and I have like 5 different upscaling solutions and at least one will add new detail as things are upscaled.


SDXL is more resolution-agnostic than SD1x, 768x768 works fine, but admittedly going down to 512x512 does tend to produce cropped images.


Tangentially related: for reasons I don't yet really understand, the LORAs that I build for Stable Diffusion XL only work well if I give a pretty generic negative prompt.

These are fine-tuned on 6 photos of my face, and if I use them with positive prompts, the generated characters don't look much like me. But if I add generic negative terms like "low quality", suddenly the depiction of my face is almost exactly right.

I've trained several models and this has been true across a range of learning rates and number of training epochs.

To me, this feels like it will somehow ultimately be connected to whatever is driving minimaxir's observations in this post.


>The release went mostly under-the-radar because the generative image AI buzz has cooled down a bit. Everyone in the AI space is too busy with text-generating AI like ChatGPT (including myself!).

I disagree with this statement. The release went mostly under the radar for 2 reasons, according to the people I've talked to.

1. Higher vram and compute requirements

2. Perceived lower quality outputs compared to specialized SD1.5 models.

If either of these points had been different, it would have gained a lot more popularity I'm sure.

But alas, most people now simply wait and see if specialized SDXL models can actually improve upon specialized 1.5 models.


Lower quality output. It’s that.

I think most people casually associated with it find it as a toy they mess around with for a minute. The hardcore SD fans… are making hardcore I think.

XL is bad at porn. Stability got scared of what they created and tried to hedge towards “safety”. Can’t have your Kate Middleton or Emma Watson porn being TOO convincing.

People will stick with 1.5 until something is better… for porn.


This concept is not new. Lots of "negative embeddings" that you put into negative prompts to fix hands and bad anatomy on civit.ai


That was my previous textual inversion experiment that I mentioned in the post: https://minimaxir.com/2022/11/stable-diffusion-negative-prom...

This submission is about a negative LoRA which does not behave the same way at a technical level.


Must be the formative years spent in the nineties' contradiction field of "counter culture vs also counter culture, but counter culture that's on MTV": there's something about prompts ending with tag references like "award winning photo for vanity fair" (or whatever the promptist's standard tag suffix turns out to be in these posts) that inspires a very deep desire in me to not be part of this generative image wave.


"award winning photo for vanity fair" is more a trick for good photo composition (e.g. rule of threes) than anything else.


>A minor weakness with LoRAs is that you can only have one active at a time

Uh this isn't true at all, at least with auto1111.


IIRC it does merging/weighting behind the scenes.


I'm pretty sure that it's just serially summing the network weights, which results in an accumulated offset to the self-attention layers of the transformer. It's not doing any kind of analysis of multiple networks prior to application to make them "play nice" together; it's just looping and summing.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob...


Source for this?


I wonder how much of this effect is just undoing stabilities fine tuning against inappropriate images.


This is really interesting. Like mentioned in the article, this is a kind of RLHF, and that's what takes GPT3 from a difficult to use LLM to a chat bot which is able to confuse some people into thinking is has consciousness. It makes it much more usable.

I don't know how these models are trained, but hopefully future models will include bad results as negative training data, baking it into the base model.

It's only mentioned in passing in the article, but apparently it's possible to merge LoRAs? How would you do that, I'd like to use one LoRA to include my own subjects, this LoRA to make the results better, and maybe a third one for a particular style.


Merging LoRAs is essentially taking a weighted average of the LoRA adapter weights. It's more common in other UIs.

diffusers is working on a PR for it: https://github.com/huggingface/diffusers/pull/4473



> XL

Extra Large? 40 times?


What? XL is the current version of Stable Diffusion.


It's already on version 40?


Extra large


It’s a Roman numeral joke.


Not a very good one.


1024 x 1024 instead of 512 x 512.


XL more likely refers to the parameter count, which is 3 billion instead of <1 billion


No, I think it is mainly because it's optimized for 1024 x 1024 images, rather than 512 x 512 as the previous version was.


It’s both. More pixel space and more parameters.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: