I mean, it ingested all of the content from my blog. Without my permission. It's not a major part of their corpus of data, but still -- I wasn't asked and I don't really care to donate work to large corporations like that.
So the technology is cool, but I'm firmly of the stance that they cut corners and trampled peoples' rights to get a product out the door. I wouldn't be entirely unhappy if this iteration of these products were sued into the ground and were forced to start over on this stuff The Right Way.
One thing I've been thinking about: it's only a matter of time before your friends load an AI assistant on their phone, and it devours every text message you have ever sent to that person, every photo you've shared together, every record of an in-person meeting. This makes me really uncomfortable.
That's what bothers me for years now in the context of contacts on smartphones. Maybe I'm making a mistake when thinking about it, but - if I refuse to share my contacts with let's say Instagram, but all of my friends share their contacts list which includes me, does it really matter if I decline to share or not?
Another part which bothers me is that I have lots of different personalities online. On most sites I use different usernames, and I wonder if there will someday be an AI which can match all the different online profile to a single person, even if different username are being used etc.
Anyone who reads your blog is "ingesting content" from it. That is presumably the purpose of your blog in the first place. Whether that content is used to train a human mind or an artificial one is probably not up to you as the author.
This type of comments can be seen every single time a thread about LLM, or OpenAI or some such comes up.
And it adds nothing. I'm sorry but saying "Whether that content is used to train a human mind or an artificial one is probably not up to you" may be worse than saying nothing at all.
First because it shows enough doubt on whether it's up to the authors of content (IP laws, fair use, intent of the use, and many things I ignore), while giving no laws as an example or frame of reference.
And second because it's comparing a human mind that we know exist, to an artificial one, which implies:
1. An LLM is an artificial mind, or close to one, whatever that is (again, not defined).
2. If they were to exist, they would be both equivalent and treated the same as a human one.
The amount of jumps in a couple sentences, added to the uncertainty of how copyright would/will work, multiplied by the numer of times I/we read that type of comment every single time, it's getting tiresome. And it's adding noise to the noise-signal ratio.
I think you’ve missed the point. Copyright laws prevent others from copying your work without permission. (Hence the name.) Copyright laws say nothing about who can read your work.
If you want to prevent a web spider from scraping your blog, use a captcha or robots.txt. Copyright law doesn’t apply to this scenario.
I disagree, and though the GP maybe didn't have this sentiment, my personal view is that intellectual property is a bunch of crap and just because there are laws around it in our capitalist society doesn't mean that the laws are moral/just/ethical/good. IP is constantly ingested and transformed which is exactly what LLMs are doing. The fact that ChatGPT can't even accurately reproduce data from its training (it gets basic facts/dates/quotes wrong all the time) really reinforces that it's not infringing on anyone's IP.
If you're tired of responding to these comments then stop. It's the internet, everyone is at different places in exploring topics and having discussions. Don't poo-poo on someone else's journey and instead move on with your day. There is no required reading (other than TFA) on hacker news.
No. Both legally and practically, you absolutely do not.
The only thing copyright law gives you is an exclusive right to sell it for a limited period of time, as a whole in its original form or similar -- and to transfer that right.
Regardless of your desires, anyone can reuse it under the conditions of fair use. They can copy parts of it for parody purposes. If they're not selling anything or taking away from your sales*, they can reproduce it verbatim for private purposes. And even if they are selling something, they can summarize it, quote from it, rephrase it, and so forth.
Read the comments you're replying to. I didn't comment on the legality of ChatGPT training on my content, I said I didn't like it. Regardless, the act of posting content publicly does not mean I give up my copyright claim. Yes, there are fair use situations. Training ChatGPT might be one of them, but I'm not seeing lot of concrete information one way or the other and I am seeing arguments that ChatGPT could be considered a derivative work, which would place OpenAI in violation of my copyright.
Send some links if you see some definitive case law sorting this stuff out.
Anyone can read your blog and then post their own blog post using knowledge they learned while reading yours. ChatGPT "learned" from your blog that same way
Since the way GPT "learns" is not materially similar to how a human learns, I don't see why this talking point is particularly relevant. Nothing stops the courts from distinguishing between an AI and a human with regard to what may be permissible.
I agree, it seems like all the arguments that the use of data by AI should have no more restrictions than the use of data by humans hinge on the implicit (or sometimes explicit) assumption that human learning and machine learning are identical. While there are parallels, there also seem to be significant differences not only in how the learning is done, but also in outcomes for the person whose data is being used. And since a major purpose of IP, copyright, etc. is at least ostensibly to protect the creators of information from negative outcomes, I don't think the outcomes can be ignored when comparing human learning to ML.
Anthropomorphizing that it "learned" is disingenuous and I expect better from the HN crowd.
If ChatGPT regurgitates verbatim or nearly verbatim, something it slurped up from OP's blog, is that not plagiarism? Where do you draw the line? Where would a reasonable person draw the line?
A human is both capable of reciting things from memory in an infringing manner, and learning from experiences to create something new. Maybe we should tape people's mouth shut if they dare to violate copyright by reciting a copyrighted book word for word or put them in a straight jacket if they recreate a copyrighted painting from memory.
Actually I fear that people that say this are doing worse than anthropomorphizing.
Often rather than claiming human aspects to the machine, they are going further, and claiming machine aspects to the human.
Using mechanistic analogies for explaining the human body or mind isn't new, but as machines become better and better at imitating humans, those analogies become more seductive.
That's my rant; the danger with 'AI' isn't so much that humans are enslaved by machines, but that we enslave each other -- or dehumanize each other -- with machines.
Like with everything in law, "intent" is paramount. Obviously it's not the trainer's, nor the end-user's goal to reproduce training set data verbatim; quite contrary, overfitting as such is undesirable.
Intent only goes so far. If I continually but unintentionally reproduce copyrighted works verbatim, I could still face consequences, particularly if I did not show due diligence in preventing it from happening in the first place.
I think you've made up an irrelevant argument. The work has been incorporated into a commercial product, intentionally, under the control of someone else. Software isn't humans that pay taxes, appear in court, have rights, etc.
> A US court ruled this week that Google's creation and display of thumbnail images does not infringe copyright. It also said that Google was not responsible for the copyright violations of other sites which it frames and links to.
Part of this ruling is about how the images are used -- Fair use -- not just that they were stored in a particular way. If Google was using the smaller versions of the images (thumbnails) in other ways, it could have been infringing.
> The Court said that Google did claim fair use, and that whether or not use was fair depended on four factors: the purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; the nature of the copyrighted work; the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and the effect of the use upon the potential market for or value of the copyrighted work.
Taking copyrighted material and using it to train a model is not a copyright infringement - it is sufficiently transformative and has a different use than the original images.
Note that AI models can be used for different things. A model trained to identify objects in an image has never had uproar about the output of "squirrel" showing up in the output text.
The model also, as a purely mathematical transformation on the original source material does not get a copyright. If it needs to be protected, trade secrets are the tools to use to protect it. A model is no more copyright worthy than tanking an image and applying `gray = .299 red + .587 green + .114 blue` to it.
The output of a model is ineligible for copyright protection (in the US - and most other places).
The output of a model may fall into being a derivative work of the original content used to train the model.
It is up to the human, with agency in asking the model to generate certain output to be responsible for verifying that it does not infringe upon other works if it is published.
Note that the responsibility of the human publishing the work is not anything new with an AI model. It is the same responsibility if they were to copy something from Stack Overflow or commission a random person on Fiverr... its just that those we've overlooked for a long time - but it is similarly quite possible for the material on those sources to be copyrighted by and/or licensed to some other entity and the human doing the copying into the final product is responsible for any copyright infringements.
Saying "I copied this from Stack Overflow" or "I found this on the web" as a defense is just as good as "Copilot generated this for me" or "Stable diffusion generated this when I asked for a mouse wearing red pants" and represents a similar dereliction on part of the person publishing this content.
Actually, people have been successfully sued for plagiarizing other works because they had internalized it and accidentally regurgitated it. So. The fact that content runs through a human brain doesn’t necessarily cleanse it from copyright concerns.
There is no "actually" because you are still addressing distribution. It wouldn't be hard to have another AI that analyzes outputs for copywriter infringement and culls them as necessary.
>You don’t get to make information publicly available. But not publicly available.
But we do? Open sourcing something with caveats is common. This code is public BUT not for commercial use. This code is public BUT you must display attribution etc.
Sure, blogposts are unlicensed (that I know) but the idea of something publicly available being held to restrictions is nothing new.
This is a fantastic point. I can legally go pick up any strictly copyrighted book at a store and read parts of it for free which I will then have learnt and have in my brain to share with to anyone else. If I happen to have a superintelligent brain I can potentially gain a lot more and make a lot more inferences from this one outing and consequently add a lot of value to others I share my info to.
But telling me it is illegal to share what I learnt because the original source is copyrighted... doesn't sit right with me.
Copyright just doesn't protect such cases. There's a funny exaggeration that is very illustrative: copyright protects the bugs in the code. I.e. the specific way in which code was written. Reading it and getting inspired was never meant to break copyright.
What protects particular solutions is patents. For example if someone were to obtain a patent for computing GCD of large integers the usual fast way, well then everyone else would have to use a different solution.
This analogy to someone reading a book, perhaps peppered with lots of legalese to the point of being hardly recognizable, will definitely be used in courts at some point. And I can't see how it wouldn't stand as a valid argument.
If you go read a book, memorize it, write it down later in a substantively similar form, and share it freely or sell it — yes, you might get into copyright trouble. It has happened before and it is at best a tricky gray area.
If you pick up a book and learn a fact, then yeah, you’re allowed to share that fact.
It’s weird that this topic keeps devolving into a form of “so what, it’s illegal for me to learn things?” Because: no, it’s not. And: You and a piece of software are treated differently under the law. You have a different set of rights than ChatGPT.
I mean in the floating point / quantized numbers and the connections that make the model? I'm not sure I follow, the analogy to the human brain has always been obvious, it's even in the name (artificial neural network) ...
The analogy is just that: an analogy, and a very imperfect, misleading one. The working of the brain may have motivated early research, but GPT (as instantiated in hardware) does not operate or learn in a way similar to a human brain.
Another day, another person on HN showing us how they don't understand the difference between Public Domain and Open Source or Copyleft etc.
And regardless -- the problem now is that expectations of how content can be consumed are now fundamentally violated by automation of content ingestion. People put stuff up on the Internet with the expectation of its consumption by human minds, which have inherent limitations on the speed and scale on which they can learn from and reproduce things, and those humans are also legally liable, socially/ethically obligated, etc.
Now we have machines which skirt the limits of legality, and are able to do so on massive scale and without responsibility to society as a whole.
> People put stuff up on the Internet with the expectation of its consumption by human minds
Then people obviously aren’t aware that bots have been indexing web pages and showing summarized information without going to the web page for
three decades.
I think it's a bit intellectually dishonest to claim an equivalence between content indexing for search engines and machine learning for LLMs. They might share an underlying harvesting technique, but their uses -- indexing for information accessibility vs automatic content production are qualitatively different.
Further, almost every site has had an e.g. robots.txt which has permitted content harvesting only for certain accepted purposes for a couple decades now. So clearly people already had a sense of how they wanted their content harvested and for what purposes.
Your blog which you posted online for anyone to download and read?
Don't get me wrong, this is a grey area where copyright laws and general consensus haven't caught up with new techonology. But if you voluntarily stick something up online with the intent that anyone can read it, it seems a bit mean to then say "wait no you can't do that" if someone finds a way to materially profit off it.
You’re right! Just like Disney+ did when I watched Star Wars the other day. I’m excited to know Disney has consented to me posting Star Wars in its entirety free online.
Can you make ChatGPT produce the content of your blog post "in its entirety?" You can share the URL to a ChatGPT conversation, so it should be easy to prove the copyright violation by replying to me with two links: one to your blog post, and one to the ChatGPT conversation containing an unauthorized copy of it.
It doesn't require humans to work for free — while that's been a common default MO since everyone looked at Google making a search index and thinking to themselves "if they're doing it surely do can we", there are data sets made by paying people.
There are such datasets, and AI companies absolutely pay to have data curated. But I suspect it would be just unimaginably expensive to create a dataset from scratch with enough tokens to feed a model with hundreds of billions of parameters, all the while paying every participant fairly.
"fair" is somewhat undefined, as the fair-looking number for being paid for effort can be very different to the fair-looking number for being paid for the resale value of the end product on an open market.
I wonder what would an LLM trained on Google code and internal documents look like?
So the technology is cool, but I'm firmly of the stance that they cut corners and trampled peoples' rights to get a product out the door. I wouldn't be entirely unhappy if this iteration of these products were sued into the ground and were forced to start over on this stuff The Right Way.