I mean, it ingested all of the content from my blog. Without my permission. It's...

blueridge · on June 29, 2023

One thing I've been thinking about: it's only a matter of time before your friends load an AI assistant on their phone, and it devours every text message you have ever sent to that person, every photo you've shared together, every record of an in-person meeting. This makes me really uncomfortable.

4ggr0 · on June 29, 2023

That's what bothers me for years now in the context of contacts on smartphones. Maybe I'm making a mistake when thinking about it, but - if I refuse to share my contacts with let's say Instagram, but all of my friends share their contacts list which includes me, does it really matter if I decline to share or not?

Another part which bothers me is that I have lots of different personalities online. On most sites I use different usernames, and I wonder if there will someday be an AI which can match all the different online profile to a single person, even if different username are being used etc.

dsaavy · on June 29, 2023

We're already on the way: https://www.rewind.ai/

Not on the phone yet, but on a Mac which could include iMessages.

taneq · on June 30, 2023

I wanna do that locally with an LLM, fine tune it on my entire sent email history and have it generate auto-responses to most of my emails. :D

zapataband1 · on June 29, 2023

* cuts to AT&T in the background hastily dumping texts into ChatGPT *

ChatGTP · on June 29, 2023

Every email you send to a gmail backed account is this.

munchler · on June 29, 2023

Anyone who reads your blog is "ingesting content" from it. That is presumably the purpose of your blog in the first place. Whether that content is used to train a human mind or an artificial one is probably not up to you as the author.

anktor · on June 29, 2023

This type of comments can be seen every single time a thread about LLM, or OpenAI or some such comes up.

And it adds nothing. I'm sorry but saying "Whether that content is used to train a human mind or an artificial one is probably not up to you" may be worse than saying nothing at all.

First because it shows enough doubt on whether it's up to the authors of content (IP laws, fair use, intent of the use, and many things I ignore), while giving no laws as an example or frame of reference.

And second because it's comparing a human mind that we know exist, to an artificial one, which implies:

1. An LLM is an artificial mind, or close to one, whatever that is (again, not defined).

2. If they were to exist, they would be both equivalent and treated the same as a human one.

The amount of jumps in a couple sentences, added to the uncertainty of how copyright would/will work, multiplied by the numer of times I/we read that type of comment every single time, it's getting tiresome. And it's adding noise to the noise-signal ratio.

munchler · on June 29, 2023

I think you’ve missed the point. Copyright laws prevent others from copying your work without permission. (Hence the name.) Copyright laws say nothing about who can read your work.

If you want to prevent a web spider from scraping your blog, use a captcha or robots.txt. Copyright law doesn’t apply to this scenario.

93po · on June 29, 2023

I disagree, and though the GP maybe didn't have this sentiment, my personal view is that intellectual property is a bunch of crap and just because there are laws around it in our capitalist society doesn't mean that the laws are moral/just/ethical/good. IP is constantly ingested and transformed which is exactly what LLMs are doing. The fact that ChatGPT can't even accurately reproduce data from its training (it gets basic facts/dates/quotes wrong all the time) really reinforces that it's not infringing on anyone's IP.

If you're tired of responding to these comments then stop. It's the internet, everyone is at different places in exploring topics and having discussions. Don't poo-poo on someone else's journey and instead move on with your day. There is no required reading (other than TFA) on hacker news.

scarface_74 · on June 29, 2023

You don’t get to make information publicly available. But not publicly available. If you want your blog to be restricted, put it behind a login

chasing · on June 29, 2023

Yes I do. I own the work I create, even if it's publicly available. I do get to decide what happens with it.

crazygringo · on June 29, 2023

> I do get to decide what happens with it.

No. Both legally and practically, you absolutely do not.

The only thing copyright law gives you is an exclusive right to sell it for a limited period of time, as a whole in its original form or similar -- and to transfer that right.

Regardless of your desires, anyone can reuse it under the conditions of fair use. They can copy parts of it for parody purposes. If they're not selling anything or taking away from your sales*, they can reproduce it verbatim for private purposes. And even if they are selling something, they can summarize it, quote from it, rephrase it, and so forth.

And you don't actually get to decide any of that.

* Edit: added "or..."

chasing · on June 29, 2023

So you’re saying I’m right except in some narrowly carved-out situations. And I agree with you.

crazygringo · on June 29, 2023

Nope. You said:

> I wasn't asked and I don't really care to donate work to large corporations like that... I do get to decide what happens with it.

And I said:

> No. Both legally and practically, you absolutely do not.

You think you get to decide whether large corporations can train on your work. I'm saying the the law suggests you very much don't get to decide that.

chasing · on June 29, 2023

Read the comments you're replying to. I didn't comment on the legality of ChatGPT training on my content, I said I didn't like it. Regardless, the act of posting content publicly does not mean I give up my copyright claim. Yes, there are fair use situations. Training ChatGPT might be one of them, but I'm not seeing lot of concrete information one way or the other and I am seeing arguments that ChatGPT could be considered a derivative work, which would place OpenAI in violation of my copyright.

Send some links if you see some definitive case law sorting this stuff out.

meithecatte · on June 29, 2023

You are claiming that piracy is legal.

6bb32646d83d · on June 29, 2023

Anyone can read your blog and then post their own blog post using knowledge they learned while reading yours. ChatGPT "learned" from your blog that same way

mrtranscendence · on June 29, 2023

Since the way GPT "learns" is not materially similar to how a human learns, I don't see why this talking point is particularly relevant. Nothing stops the courts from distinguishing between an AI and a human with regard to what may be permissible.

happypumpkin · on June 29, 2023

I agree, it seems like all the arguments that the use of data by AI should have no more restrictions than the use of data by humans hinge on the implicit (or sometimes explicit) assumption that human learning and machine learning are identical. While there are parallels, there also seem to be significant differences not only in how the learning is done, but also in outcomes for the person whose data is being used. And since a major purpose of IP, copyright, etc. is at least ostensibly to protect the creators of information from negative outcomes, I don't think the outcomes can be ignored when comparing human learning to ML.

myself248 · on June 29, 2023

Anthropomorphizing that it "learned" is disingenuous and I expect better from the HN crowd.

If ChatGPT regurgitates verbatim or nearly verbatim, something it slurped up from OP's blog, is that not plagiarism? Where do you draw the line? Where would a reasonable person draw the line?

judge2020 · on June 29, 2023

A human is both capable of reciting things from memory in an infringing manner, and learning from experiences to create something new. Maybe we should tape people's mouth shut if they dare to violate copyright by reciting a copyrighted book word for word or put them in a straight jacket if they recreate a copyrighted painting from memory.

cmrdporcupine · on June 29, 2023

Actually I fear that people that say this are doing worse than anthropomorphizing.

Often rather than claiming human aspects to the machine, they are going further, and claiming machine aspects to the human.

Using mechanistic analogies for explaining the human body or mind isn't new, but as machines become better and better at imitating humans, those analogies become more seductive.

That's my rant; the danger with 'AI' isn't so much that humans are enslaved by machines, but that we enslave each other -- or dehumanize each other -- with machines.

tucnak · on June 29, 2023

Like with everything in law, "intent" is paramount. Obviously it's not the trainer's, nor the end-user's goal to reproduce training set data verbatim; quite contrary, overfitting as such is undesirable.

mrtranscendence · on June 29, 2023

Intent only goes so far. If I continually but unintentionally reproduce copyrighted works verbatim, I could still face consequences, particularly if I did not show due diligence in preventing it from happening in the first place.

scarface_74 · on June 29, 2023

But ChatGPT doesn’t spit out verbatim from the blog.

tumult · on June 29, 2023

Computers aren't people. Software isn't humans.

Workaccount2 · on June 29, 2023

There is a difference between learning from your work and copying your work.

You are entitled to control it's distribution and use. You are not entitled to control it's influence and effects.

tumult · on June 29, 2023

I think you've made up an irrelevant argument. The work has been incorporated into a commercial product, intentionally, under the control of someone else. Software isn't humans that pay taxes, appear in court, have rights, etc.

Workaccount2 · on June 29, 2023

No, the work has not been. The impression that the work leaves on a neural network has been though.

AIs are not massive repositories of harvested data. The models are relatively small (<20GB).

tumult · on June 29, 2023

A resized, smaller, or encoded version of an image is still subject to copyright. Calling an encoding an 'impression' is deceitful.

shagie · on June 29, 2023

Not always.

https://www.pinsentmasons.com/out-law/news/google-thumbnails...

> A US court ruled this week that Google's creation and display of thumbnail images does not infringe copyright. It also said that Google was not responsible for the copyright violations of other sites which it frames and links to.

tumult · on June 29, 2023

Part of this ruling is about how the images are used -- Fair use -- not just that they were stored in a particular way. If Google was using the smaller versions of the images (thumbnails) in other ways, it could have been infringing.

> The Court said that Google did claim fair use, and that whether or not use was fair depended on four factors: the purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; the nature of the copyrighted work; the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and the effect of the use upon the potential market for or value of the copyrighted work.

shagie · on June 30, 2023

My take on copyrights and AI models...

Taking copyrighted material and using it to train a model is not a copyright infringement - it is sufficiently transformative and has a different use than the original images.

Note that AI models can be used for different things. A model trained to identify objects in an image has never had uproar about the output of "squirrel" showing up in the output text.

The model also, as a purely mathematical transformation on the original source material does not get a copyright. If it needs to be protected, trade secrets are the tools to use to protect it. A model is no more copyright worthy than tanking an image and applying `gray = .299 red + .587 green + .114 blue` to it.

The output of a model is ineligible for copyright protection (in the US - and most other places).

The output of a model may fall into being a derivative work of the original content used to train the model.

It is up to the human, with agency in asking the model to generate certain output to be responsible for verifying that it does not infringe upon other works if it is published.

Note that the responsibility of the human publishing the work is not anything new with an AI model. It is the same responsibility if they were to copy something from Stack Overflow or commission a random person on Fiverr... its just that those we've overlooked for a long time - but it is similarly quite possible for the material on those sources to be copyrighted by and/or licensed to some other entity and the human doing the copying into the final product is responsible for any copyright infringements.

Saying "I copied this from Stack Overflow" or "I found this on the web" as a defense is just as good as "Copilot generated this for me" or "Stable diffusion generated this when I asked for a mouse wearing red pants" and represents a similar dereliction on part of the person publishing this content.

Workaccount2 · on June 29, 2023

It's none of the those things, these models train on petabytes of data. They store relationships of objects to each other, not objects themselves.

chasing · on June 29, 2023

Actually, people have been successfully sued for plagiarizing other works because they had internalized it and accidentally regurgitated it. So. The fact that content runs through a human brain doesn’t necessarily cleanse it from copyright concerns.

Workaccount2 · on June 29, 2023

There is no "actually" because you are still addressing distribution. It wouldn't be hard to have another AI that analyzes outputs for copywriter infringement and culls them as necessary.

Would that satisfy you?

blfr · on June 29, 2023

To some extent. Others can ingest your work, quote it, talk about it, criticize it, summarize, etc.

spaceman_2020 · on June 29, 2023

If I read your blog and used its data along with my own knowledge to create a course, would that be plagiarism or copyright violation?

Levitz · on June 29, 2023

>You don’t get to make information publicly available. But not publicly available.

But we do? Open sourcing something with caveats is common. This code is public BUT not for commercial use. This code is public BUT you must display attribution etc.

Sure, blogposts are unlicensed (that I know) but the idea of something publicly available being held to restrictions is nothing new.

H8crilA · on June 29, 2023

Do you allow commercial employees to read the code and incorporate knowledge obtained from the code into their brains?

DirkH · on June 29, 2023

This is a fantastic point. I can legally go pick up any strictly copyrighted book at a store and read parts of it for free which I will then have learnt and have in my brain to share with to anyone else. If I happen to have a superintelligent brain I can potentially gain a lot more and make a lot more inferences from this one outing and consequently add a lot of value to others I share my info to.

But telling me it is illegal to share what I learnt because the original source is copyrighted... doesn't sit right with me.

H8crilA · on June 29, 2023

Copyright just doesn't protect such cases. There's a funny exaggeration that is very illustrative: copyright protects the bugs in the code. I.e. the specific way in which code was written. Reading it and getting inspired was never meant to break copyright.

What protects particular solutions is patents. For example if someone were to obtain a patent for computing GCD of large integers the usual fast way, well then everyone else would have to use a different solution.

This analogy to someone reading a book, perhaps peppered with lots of legalese to the point of being hardly recognizable, will definitely be used in courts at some point. And I can't see how it wouldn't stand as a valid argument.

chasing · on June 29, 2023

If you go read a book, memorize it, write it down later in a substantively similar form, and share it freely or sell it — yes, you might get into copyright trouble. It has happened before and it is at best a tricky gray area.

If you pick up a book and learn a fact, then yeah, you’re allowed to share that fact.

It’s weird that this topic keeps devolving into a form of “so what, it’s illegal for me to learn things?” Because: no, it’s not. And: You and a piece of software are treated differently under the law. You have a different set of rights than ChatGPT.

DirkH · on June 30, 2023

Everything ChatGPT seems gray area and might which is probably why we are where we are.

H8crilA · on June 29, 2023

> You have a different set of rights than ChatGPT.

Gods, no. Where did you get that from?

Levitz · on June 29, 2023

Are you a human being? A citizen of some country? If so you definitely have a different set of rights than ChatGPT.

Those might not be a problem regarding this specific case, but the case can easily be made that it ought to be.

DirkH · on June 30, 2023

I don't think ChatGPT has any rights yet... And a person using it has the exact same rights as someone not using it.

H8crilA · on June 30, 2023

?

I don't understand your point. Do you think it makes any difference whether I use my laptop, or a pen, or ChatGPT to violate copyright?

cool_dude85 · on June 29, 2023

Show me where ChatGPT's brain is and your comparison will become relevant.

H8crilA · on June 29, 2023

I mean in the floating point / quantized numbers and the connections that make the model? I'm not sure I follow, the analogy to the human brain has always been obvious, it's even in the name (artificial neural network) ...

mrtranscendence · on June 29, 2023

The analogy is just that: an analogy, and a very imperfect, misleading one. The working of the brain may have motivated early research, but GPT (as instantiated in hardware) does not operate or learn in a way similar to a human brain.

Levitz · on June 29, 2023

Yes, it's completely unfeasible to make a license to control that.

On the other hand, it's completely feasible to make a license that stops someone from training their model with some piece of info, is it not?

jacquesm · on June 29, 2023

Why is it that people keep on flogging dead horses?

sigmoid10 · on June 29, 2023

That's not how copyright works.

cmrdporcupine · on June 29, 2023

Another day, another person on HN showing us how they don't understand the difference between Public Domain and Open Source or Copyleft etc.

And regardless -- the problem now is that expectations of how content can be consumed are now fundamentally violated by automation of content ingestion. People put stuff up on the Internet with the expectation of its consumption by human minds, which have inherent limitations on the speed and scale on which they can learn from and reproduce things, and those humans are also legally liable, socially/ethically obligated, etc.

Now we have machines which skirt the limits of legality, and are able to do so on massive scale and without responsibility to society as a whole.

Different game now.

scarface_74 · on June 29, 2023

> People put stuff up on the Internet with the expectation of its consumption by human minds

Then people obviously aren’t aware that bots have been indexing web pages and showing summarized information without going to the web page for three decades.

cmrdporcupine · on June 29, 2023

I think it's a bit intellectually dishonest to claim an equivalence between content indexing for search engines and machine learning for LLMs. They might share an underlying harvesting technique, but their uses -- indexing for information accessibility vs automatic content production are qualitatively different.

Further, almost every site has had an e.g. robots.txt which has permitted content harvesting only for certain accepted purposes for a couple decades now. So clearly people already had a sense of how they wanted their content harvested and for what purposes.

scarface_74 · on June 30, 2023

How is it not content production when I search for something on Google and get a box with similar questions and summarizes the answer.

So you’re okay with Google making money off of your content. But not OpenAI?

taneq · on June 30, 2023

Your blog which you posted online for anyone to download and read?

Don't get me wrong, this is a grey area where copyright laws and general consensus haven't caught up with new techonology. But if you voluntarily stick something up online with the intent that anyone can read it, it seems a bit mean to then say "wait no you can't do that" if someone finds a way to materially profit off it.

TechBro8615 · on June 29, 2023

You sent your content to them in response to their HTTP requests. That sure looks like affirmative consent to me.

chasing · on June 29, 2023

You’re right! Just like Disney+ did when I watched Star Wars the other day. I’m excited to know Disney has consented to me posting Star Wars in its entirety free online.

TechBro8615 · on June 29, 2023

Can you make ChatGPT produce the content of your blog post "in its entirety?" You can share the URL to a ChatGPT conversation, so it should be easy to prove the copyright violation by replying to me with two links: one to your blog post, and one to the ChatGPT conversation containing an unauthorized copy of it.

archontes · on June 29, 2023

If you put your content on a billboard, what expectation should you have that you can control who reads it?

andsoitis · on June 29, 2023

That’s the economics of (non-symbolic) AI. To work, it needs humans to create stuff for free.

Putting it more bluntly, it is somewhere between a parasite and a slave driver.

ben_w · on June 29, 2023

It doesn't require humans to work for free — while that's been a common default MO since everyone looked at Google making a search index and thinking to themselves "if they're doing it surely do can we", there are data sets made by paying people.

mrtranscendence · on June 29, 2023

There are such datasets, and AI companies absolutely pay to have data curated. But I suspect it would be just unimaginably expensive to create a dataset from scratch with enough tokens to feed a model with hundreds of billions of parameters, all the while paying every participant fairly.

ben_w · on June 29, 2023

"fair" is somewhat undefined, as the fair-looking number for being paid for effort can be very different to the fair-looking number for being paid for the resale value of the end product on an open market.

I wonder what would an LLM trained on Google code and internal documents look like?