> it's the resulting content that matters, not how it's presented
What a wild thing to say. If you had a coworker who was brilliant and taught you many great things, but only screamed instead of talking, would you feel the same way?
> If a person anthropomorphizes an LLM in their mind (rather than just in their speech patterns), then they probably have pre-existing mental problems.
Correct, and that's why these tools should be built responsibly under the assumption that people with mental problems are going to use them. It's clear in the article I linked (and my wording linking to it) that it can exacerbate issues for people. Chatgpt told him that he's sane and his mom was trying to kill him. He didn't understand what an LLM actually was.
I'm not claiming the purpose of this prompt is to get better information. Yes, it's just a prompt.
You're asserting quite a lot of bias when you say "What most people want are useful results." Maybe in our circles of software engineers or lawyers, but many people are using AI for companionship. Even if they're not seeking companionship, unless you have a very clear understanding of how LLMs work, it's very easy to get caught up thinking that the chatbot you're talking to is "thinking" or "feeling". I feel companies that offer chatbots should be more responsible with this as it can be very dangerous.
Can someone with actual fundamental understanding of LLMs explain to me why they think it's perfectly legal to train models on copyrighted material? I don't know enough about this. Please don't answer by asking chatgpt.
Consider how commercial search engines are fine to show text snippets, thumbnails and site caches.
AI developers will most likely rely on a Fair Use defense. I think this has a reasonable chance of success since, while the use of a given copyrighted work may affect the market for that work (in this case NYT's article), it can be argued to be highly transformative usage. As in Campbell v. Acuff-Rose Music: "The more transformative the new work, the less will be the significance of other factors", defined as "whether the new work merely 'supersede[s] the objects' of the original creation [...] or instead adds something new".
There's also potential for an "implied license", as in Field v. Google Inc for rehosting a snapshot of a site, where "Google reasonably interpreted absence of meta-tags as permission to present 'Cached' links to the pages of Field's site". As far as I can tell in this case, NYT's robots.txt of the time was obeyed, which permitted automated processing of all but one specific article for some reason.
Why do you think it is legal to train students on copyrighted material? Copyright is supposed to protect from unauthorized reproduction, not unauthorized learning. That the NY Times is able to show some verbatim reproduction, it is a real legal issue, but that should not be extended to training generally.
Students are humans. LLMs are not. Machine "learning" is a metaphor, not what's actually happening. Stop anthropomorphizing, and show some loyalty to your species.
The loyalty argument does sound somewhat bizarre, but I think the overarching point is about whether technology use benefits humans in society or not. We should not implicitly treat LLMs owned by corporations with the same rights as humans. LLMs without some form of legislation is looking like it will benefit corporations that are salivating at profits and the prospect of reducing or eliminating the number of creative workers they need.
Why would I want to quantify it? The burden of proof is on the thief.
I have a gadget that will, with some probability, steal your life's savings. It operates through a process that is analogous to a human chewing. When engineering it, we just say for simplicity that the gadget "chews". Of course, that's only a metaphor -- machines can't chew.
But (and here's where your argument gets ridiculous), unless you can quantify the fact that my gadget can't chew, then I will steal your savings. Good luck.
I think your question is incorrect. It’s very likely no-one thinks it’s perfectly legal. There probably are many people who think it’s not a big deal, though. Try coming up with a dataset that doesn’t have any copyrighted material in them. Like seriously try. You can’t use pretty much anything newer than a century old. Everything is copyrighted by default. Very few new things are explicitly in public domain or licensed in a way that would allow usage. Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?
> Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?
Not sure about the rest of the world, but at least for US content I don't think any company would publish that LLM.
That's like 40 years before the civil rights movement, and right about the time of the Tulsa massacre.
It's right around when women got the right to vote.
Trying to get it to not say anything horrible under modern standards seems fraught with issues. I don't know if it would even understand something like "don't be racist", given the context it was trained on.
Exactly. Copyright terms are so long that most material with expired copyright is not useful for modern uses of LLMs and looking for modern non-copyrighted materials is too hard to do quickly and its usefulness is unclear. So people who grew up with Internet and are used to making memes with copyrighted material are not exactly averse to do it on a bigger scale.
1. Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music, and later to write a book about music which likely includes some of the concepts you earlier learned.
2. Neither the LLM nor the output text contain sufficient elements of the copyrighted work to qualify for copyright protection. Just like if you turned old library books into compost and sold the compost, you wouldn't expect to pay authors of those books a royalty for the compost sales.
> Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music
If you learn a little too hard though, and reproduce the original textbook in it's entirety, you'll get in trouble.
My guess is that courts will determine that the training itself will not be found illegal, but either the AI companies, or the users, will be found liable for reproducing copywrighted work in output, and no one will want to hold liability for that.
If the work goes beyond fair use, it is a copyright violation. It doesn't matter if it was created by a person or an AI.
Technology that makes copyright violations easier/quicker have typically been found legal if "the technology in question had significant non-infringing uses".
This makes sense. It was allowed for the content to be read and used in certain ways (e.g. search engines or as references) without substantial reproduction. The NYT would then have to flag specific generated content as infringing a specific work which could then be judged as fair use or not on a case-by-case basis. If a particular site/company was repeatedly and/or primarily using substantial content then perhaps it could be 'delisted' as search engines do for links to pirated copies of works.
It really hinges on substantially similar. If I copy Harry Potter and change every instance of Harry Potter to Michael Rose surely it's infringing. If I write a coming of age story set in a magical land I'm probably OK. Which do you think LLM produce?
It's likely not possible of literally giving you Harry Potter. If you specify it narrowly enough that it qualifies as fan fic its probably exactly what you were going for. After all your word processor is capable of producing infringing works but is not itself an infringing work.
Fair use, probably. How many news pieces have you read that amount to, "The New York Times reports..." followed by a summary of the Times' article? It's not illegal to use copyrighted works at a source, as inspiration, or to guide style.
Surely. Remember when the VCR came out and some parties absolutely freaked out and Jack Valenti said
"I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone."
Then we invented from whole cloth reasons why they were perfectly OK because there was a ton of money to be made and everyone would actually be better off if the VCR was a thing and everyone knew it because it ended up argued after millions of VCRs were already in households.
Read about the 'fair use' doctrine and put yourself in the shoes of someone who is training a model, and see if you can argue, from their perspective, why it should be allowed.
I hold a degree from a small, regional university and I am better for it IMO. That wasn't my point... I'm more curious where and what you can actually study.
aren't a lot of unaccredited places like, for profit institutions generally lower quality and more expensive? They're trying to squeeze you for student loan bucks.