But you can't selectively re-train them, can you? As in, don't use elements from...

visarga · on Oct 21, 2022

What you can do is

- lexical filtering by applying a blacklist of artist names on the original prompt

- perceptual filtering - drop all generated images that look too close to copyrighted images in your training set

- re-captioning based filtering - use a model to generate captions for an image and apply filters on the captions; you can also filter by visual style

- CLIP based filtering where you use embeddings to find nearest neighbours, and if they are copyrighted then you can drop the image

- or train a copyright violation detection model that takes generated images and compares them to images from the original authors

Copyright enforcement struggles are going to be interesting to watch in this decade. But I think it will slowly become irrelevant, because anything can be generate again slightly different until they finally pass the filters.

dividedbyzero · on Oct 21, 2022

I was aiming more at the centralized-control angle (though I didn't make that very clear), i.e. are open-source models actually viable long-term? If only orgs with absurd amounts of compute can do updates because those imply a full re-training, wouldn't that effectively centralize control over any such model? Is there the option to to an incremental, limited re-training?

ShamelessC · on Oct 22, 2022

Much of modern deep learning is actually premised on the discovery that training on a large, noisy dataset _first_, and then fine tuning (starting training on new data with the same weights) is generally quicker to converge, and also more accurate.

This is part of the motivation for “foundation models”.

There’s another paradigm called student/teacher models where a randomly initialized model updates it’s weights according to another pretrained model. This could (maybe?) be used to achieve the desired effect of a model that learned in a “clean room”.

plutonorm · on Oct 21, 2022

you can retrain on completely separate data - I am currently doing this.

rngname22 · on Oct 21, 2022

From what I've seen, it's possible to take a version of Stable Diffusion and add your training set on top.