Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But you can't selectively re-train them, can you? As in, don't use elements from this part of the training data anymore, but use elements from this body of work that wasn't part of the training data? If I understand correctly you'd still need a full re-training for that.


What you can do is

- lexical filtering by applying a blacklist of artist names on the original prompt

- perceptual filtering - drop all generated images that look too close to copyrighted images in your training set

- re-captioning based filtering - use a model to generate captions for an image and apply filters on the captions; you can also filter by visual style

- CLIP based filtering where you use embeddings to find nearest neighbours, and if they are copyrighted then you can drop the image

- or train a copyright violation detection model that takes generated images and compares them to images from the original authors

Copyright enforcement struggles are going to be interesting to watch in this decade. But I think it will slowly become irrelevant, because anything can be generate again slightly different until they finally pass the filters.


I was aiming more at the centralized-control angle (though I didn't make that very clear), i.e. are open-source models actually viable long-term? If only orgs with absurd amounts of compute can do updates because those imply a full re-training, wouldn't that effectively centralize control over any such model? Is there the option to to an incremental, limited re-training?


Much of modern deep learning is actually premised on the discovery that training on a large, noisy dataset _first_, and then fine tuning (starting training on new data with the same weights) is generally quicker to converge, and also more accurate.

This is part of the motivation for “foundation models”.

There’s another paradigm called student/teacher models where a randomly initialized model updates it’s weights according to another pretrained model. This could (maybe?) be used to achieve the desired effect of a model that learned in a “clean room”.


you can retrain on completely separate data - I am currently doing this.


From what I've seen, it's possible to take a version of Stable Diffusion and add your training set on top.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: