Hacker News new | past | comments | ask | show | jobs | submit login
GenAI should be as copyright-okay as search
3 points by golol on Jan 17, 2024 | hide | past | favorite | 9 comments
I'm interested in the copyright-related issues arising from generative AI and I would like to hear some opinions on the following perspective. I'm not an expert on copyright and I am thinking in some kind of (50% moral, 50% legal) sense of copyright here, as copyright and especially fair use is legally quite vaguely defined.

Let's broadly distinguish 3 types of copyright concerns with genAI: 1. training 2. commerical distribution 3. end-user

I think while some disagree, many people could potentially agree that merely training on copyrighted works without commercial or non-commercial distribution of the results is kind of fine. Similarly, most people would agree that the end-user also has some role to play in regards to copyright considerations.

I think the main point that is causing trouble right now is the question of commercial distribution of genAI works, meaning that users give OpenAI (replace with others) some money and a specification, and OpenAI gives them a work according to their specification.

I think this process can be implemented in a way just as copyright-okay as google image search.

If I perform a query on google images, it spits out some samples of the distribution of reasonable images, conditioned on my query. If I do the same with genAI, essentially the same thing happens, except that the distribution is somwhat smoothed out, approximated, latent-space interpolated, however you want to call it. That is, genAI as search is fundamentally more original and less infringing than google images.

Why does google images not cause copyright outrage? Because no claims whatsoever are being made about ownership of rights. If you pay an artist for a digital painting, they are creating an original work and then selling you the rights for it. Google images does not sell you any rights, nor claim to own the rights. It is just a snapshot of a region of the distribution of reasonable images.

Afaik this is where genAI and search so-far differ. I think OpenAI atm does claim to own the rights of generated works and sell them to you, which leads to copyright infringement by OpenAI. However, I think if they changed their general terms and conditions to remove that, then at least the commercial or non-commercial distribution aspect of genAI is copyright-okay.

Note: I think some people would say google images is copyright-okay mainly because it will usually give you sources for the images it shows you. Well, then I think we should consider the following: Paid-per-query image-search with images-only, no links and no text. Is it okay? I personally think that since google gives no guarantees that their links actually lead you to the correct source of the image or its copyright owner, the presence of links should not protect them from copyright considerations. I.e. Then since google images is okay, the above is also okay.




The topic is far more complex than you perhaps realize.

Consent is as big a problem as copyright.

Training data includes intellectual property that isn't legally publicly available online, for which consent was not provided, and that's without getting into the commercial aspect.

If I publish a piece of information on the internet with the intent of it reaching people and helping people, I might consent to that if the goal is to further the access to that information rather than a major corp profiting from my hard work with no share of that to anyone but their shareholders and employees.

A song or a piece of art? No, I personally would not consent to being assimilated, not just because it's exploitive if both the AI service provider and their customer have made money from using my work because they haven't purchased a license, but because they have not asked for my consent, but did it anyway.

Until people understand the incredibly complex topic of consent, every other conversation lacks the necessary context.


Well I am focusing in the commerical distribution aspect here, not the training aspect. Furthermore, for the case of image generators one could easily imagine that Midjourney could use only publicly and legally viewable/accessible artworks. That would certainly be sufficient to have attain capabilities it currently has, as most artworks are publicly available. The point is to find a way to deploy genAI which is just as powerful and is copyright-okay, at least from the commercial distribution side.

As a side note, I personally believe that if you publish something for everyone to see and interact with, your consent to what I do with it is morally not so important. There are certain things I could do which would make me mean or disrespectful to you, like to make a parody of your work, but none of them go so far in being immoral that you should have any legal power to stop me from doing them.


Wow… well I'm of the opinion that folk who believe my consent is morally not so important should go back to kindergarten for a refresher course on playing nice with others.


> There are certain things I could do which would make me mean or disrespectful to you, like to make a parody of your work, but none of them go so far in being immoral that you should have any legal power to stop me from doing them.

Being mean is not and should not in general be a crime. There is asshole-immorality and should-be-illegal-immorality. In certain extreme circumstances, being mean is a crime (insult, extortion, harrassment, slander etc.), but it is not in general. Just consider parody, it is in often mean, from a plain view, but it is generally fair use.


If what you said is correct then why do things like this happen?

https://80.lv/articles/meta-has-reportedly-trained-its-ai-wi...

There's plenty of stuff in the public domain. There's also a big war chest they could pay authors and publishers via, they also know what they've trained it on. Unethical.

Just because something is publicly available doesn't mean you are permitted to use it as you see fit. As we've seen, generative AI can regurgitate what it was trained on:

https://spectrum.ieee.org/midjourney-copyright

You're still missing the understanding of consent. If we are going to paint with broad strokes, are you supporting this?

https://www.independent.co.uk/news/deepfake-nude-westfield-h...

You can't tell me that's in any way OK—right?

If you can agree that isn't OK then I must also assume that you can see from my perspective that something being publicly available or even privately available doesn't mean that I have consented to you doing as you please. Consent is important. It comes before everything, including commercialization and intent.


> If what you said is correct then why do things like this happen?

You misunderstand what I am saying. I am interested in finding a way to deploy genAI that has no copyright issues from the commercial deployment side. I am not saying that this is what companies are doing right now. But I want to see if it is possible to provide the service they are providing, without the copyright issues. My point is not that midjourney didn't use non-publicly viewable data in their training, it is that they did not need to. I am confident their services can be replicated without using non-publicly viewable works. Then the copyright issues that remain are those of training on publicly-viewable but not copyright-owned works, and distributing the results commercially. These I am addressing in my OP.

> Just because something is publicly available doesn't mean you are permitted to use it as you see fit.

Again I am not so interested in discussing the training aspects, as many people would, for example, agree that training on publicly available data for scientific reasons, or to make some internal model whose results never get shown to the public, is quite okay. It is imo the easier issue to deal with so it is not the focus of my post.

> If we are going to paint with broad strokes, are you supporting this?

The issue of deepfakes can be dealt with on a level that does not involve copyright. It is not relevant to genAI vs copyright for me.

> I must also assume that you can see from my perspective that something being publicly available or even privately available doesn't mean that I have consented to you doing as you please.

For me, morally, I am allowed to think about your work in any way I wish, and I am also allowed to use pen and paper, a drawing board, or a computer to assist me in my considerations of your work. It is just my responsibility that the products of my considerations do not reach the public eye if they are be harmful, criminal or infringing etc. If I draw profit from your publicly available work, that is not necessarily to the detriment of you. If I use your posts on reddit to train my companies internal application filtering model then I am not taking anything away from you, and I am not infringing on your copyright. If, however, I train a model that copies your very funny tweeting style, and create a bot that makes twitter posts based on that, then I am potentially taking something away from you and potentially infringing on your copyright, depending on the similarity of my posts to your posts. If I create some PR-consulting company that creates tweets for customers and claims to own and sells the right to that intellectual property to their customers, then I am quite likely infringing on copyright because my model might have produced works for which you actually own the rights.


I don't misunderstand it whatsoever, you just choose to ignore that consent trumps any notion of commercialization and doing as you please, in both a legal and ethical context.

If these things don't need to be trained on the work of others for whom they don't have consent and have not paid a license fee to, then why are they? Because it's exploitation by some of the biggest corps in the world. They should not be above the law and the utmost basic of ethics—consent. That is why it isn't "okay".

I am not debating that generative AI can't be done ethically, I am pointing out that search is consensual. I can use robots.txt to opt out of being included in search results, but search engines direct traffic to the author. Authors did not consent to having their non-web content included in an LLM as an example. The problem is consent. The problem is your attitude towards consent.

The primary purpose of the commercially and publicly available generative AI is to undercut human labour. Fact. What it serves up is frequently too similar to the existing work that it was trained on, and therein lies the problem on every level, not just from copyright infringement, but one of consent.

You approached it from the angle of copyright, but the problem is one of consent. The reason it isn't "okay" is because consent wasn't given with these public, commercialized models.

Deepfakes are part-and-parcel to copyright infringement. If you can take x and make it look more like y by applying z attributes then this is why you can make it regurgitate someone's intellectual property who did not consent to being included.


> The primary purpose of the commercially and publicly available generative AI is to undercut human labour.

As is the primary purpose of almost all technology, so not per se a bad thing.

> What it serves up is frequently too similar to the existing work that it was trained on, and therein lies the problem on every level, not just from copyright infringement, but one of consent.

What's the difference to google images then? If the problem is only that people get to see something similar to my work, then google iamges is just as bad. Well perhaps the difference for you is consent, in which case see above. But for me the difference was that OpenAI is trying to sell you rights for thise things that look similar to your works. So for me, if OpenAI just stopped trying to sell you rights for its generations, it would be copyright-okay.

> You approached it from the angle of copyright, but the problem is one of consent.

Fair enough, that is true. And consent is also an interesting topic to discuss. For me personally, with consent issues one has to think wheter one is talking about being ethical, or nice, or talking about legal action being necessary. I personally think that the consent issues with genAI are more in the "soft" side, compared to the copyright issues. With the latter it is a very pressing issue to find out how it should be treated legally.


> If these things don't need to be trained on the work of others for whom they don't have consent and have not paid a license fee to, then why are they?

You made a point about meta training on non-publicly viewable/freely available works, such as pirated movies. That's why I made the point that using non-publicly viewable data is sufficient. At the same time I said that, using data for which you have neither consent nor license is indeed necessary, and for the training on such data the copyright/fair use considerations are complicated.

> I am pointing out that search is consensual.

If I train a diffusion model on images from google search,and make a paid latent-space search where I just show people some pictures without any claims of selling them rights to anything, then I claim that this is just as consensual as google images. That is because I took the source images from google images where they were consensually posted, and in the worst case my model is exactly reproducing them. More likely my model is only approximately/interpolatively reproducing them. Sure, you might change your robots.txt to update google image behavior and while model won't update, but I think generally if your images has been reposted, it is quite likely that your image is still searchable on google images, even if you change the robots.txt for your personal blog. So it's not like google is very good at updating consent either.

Perhaps there is a point to be made that my diffusion-search may not just reproduce your works approximately, for which I claim that consent is kind of given, but furthermore reworks or recombines them with other works. There it is perhaps true that new consent should be necessary.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: