Zuckerberg Approved AI Training on Pirated Books, Filings Say

dekhn · 2025-01-10T01:53:27 1736474007

I believe it was already known that anything trained on The Pile contained references to copyrighted material from scihub. It seems unlikely that folks who chose to use these sources were completely unaware of the nature of the data. Presumably, given the urgency in the last 2-3 years to be a leader in this space, a number of shortcuts were taken.

nobrains · 2025-01-10T08:12:14 1736496734

Zuck did a calculation: "Does the risk of lawsuits and bad PR outweigh the benefits of being early?".

If u remove morals from the equation, nearly every CEO would have made that same decision if in that position.

throw5959 · 2025-01-10T09:34:04 1736501644

You talk about morals, but did you consider that they are releasing the model as open source, and given that OpenAI and others do the same, Zuck is really the only current option to have a reasonably comparable open source model? Also, did you consider that it might be more moral to create an AI model than to uphold copyright law, which actually many on this site deem immoral?

IMHO this is a moral win on Zuck side.

Havoc · 2025-01-10T01:56:15 1736474175

Stripping out the copyrights is quite damning.

There is wrongdoing and there is obvious evidence that you known what you’re doing is wrong. That really limits options on Defence

dudek1337 · 2025-01-12T00:06:37 1736640397

When I read about this I immediately thought about Aaron Swartz. He was persecuted for downloading copyrighted stuff, while Meta and other Corps will get a slap on a wrist. Such a sad story. And it's his death's anniversary today... RIP

asdefghyk · 2025-01-10T01:07:04 1736471224

Its also reported elsewhere ( in media articles linked to by Hacker News ) they torrented copyright material. AMAZING

vivzkestrel · 2025-01-10T03:49:00 1736480940

Stupid question: I have 400000 ebooks (yup pirated ones) what happens if I build an LLM with this?

rcakebread · 2025-01-10T05:16:21 1736486181

You'd still ask stupid questions?

blitzar · 2025-01-10T09:09:37 1736500177

You would have a net worth of 1bn

fooker · 2025-01-10T11:12:44 1736507564

Depends on the parameter count.

Too high? Straight to jail.

Too low? Believe it or not, straight to jail.

anothername12 · 2025-01-10T08:01:07 1736496067

You’ll be fine. It’s like laundering money.

covofeee · 2025-01-10T10:26:49 1736504809

You also need $100m to train it

wil421 · 2025-01-10T11:35:45 1736508945

Build Chappie.

stuckkeys · 2025-01-10T04:13:39 1736482419

Nothing.

gooosle · 2025-01-10T07:22:00 1736493720

You go to jail forever.

solumunus · 2025-01-10T04:42:16 1736484136

What do you imagine could happen?

palata · 2025-01-10T15:09:56 1736521796

Same old story: Meta is too big to care. What will happen? A fine? Sure, they can pay.

ungreased0675 · 2025-01-10T01:09:53 1736471393

I would speculate this is true of all the leading commercial LLM models. Don’t have enough training data? Just steal some!

Havoc · 2025-01-10T01:54:45 1736474085

On true for all - you’d need to split it by era I think

During the early Llama 1 days The Pile dataset was in heavy use by many. Bit later people figured out that a subset of it - Books 3 - was especially problematic.

I’m guessing all the big houses threw that piece out in later models since it’s extra radioactive

archerx · 2025-01-10T06:37:25 1736491045

What was problematic about it?

Havoc · 2025-01-10T09:03:00 1736499780

Thousands of pirated copyrighted books

gooosle · 2025-01-10T07:21:31 1736493691

Copy some*

pera · 2025-01-10T13:29:04 1736515744

It depends who you are:

- if you are an individual then it's called "pirating copyrighted work"

- if you are a multi-billion dollar corporation then it's called "use of uncleared material for training"

throw5959 · 2025-01-10T17:40:49 1736530849

Intent matters. I'm very happy we don't live in a world where it doesn't matter.

taskforcegemini · 2025-01-12T07:16:46 1736666206

one is personal use, the other is making money off of it?

throw5959 · 2025-01-12T08:04:25 1736669065

Meta open sourced the result.

BSDobelix · 2025-01-10T09:32:40 1736501560

That's exactly the difference, one does not steal in the digital world. If i could download/copy a car i would do it ;)

dehrmann · 2025-01-10T07:27:11 1736494031

Courts have yet to decide on which it is, and it might depend on how well the model can transform vs. recite.

vidarh · 2025-01-10T11:29:03 1736508543

The point is that whatever courts decide, it is not theft. It may or may not be copyright infringement, but copyright infringement is not theft.

exe34 · 2025-01-10T11:44:12 1736509452

but muh shareholders!

bodiekane · 2025-01-10T14:27:04 1736519224

If we're going to use absurd hyperbole like "steal", I think we should just keep going further.

Zuckerberg murdered some old library books to train a model. Zuckerberg genocided training data!

Heck, everyone who read your comment here stole it. I'm so sorry for your loss.

aprilthird2021 · 2025-01-12T22:14:19 1736720059

It's not absurd hyperbole. If I took the text of, say the NYT Bestseller this week, stored the text along with various Projects Gutenberg books, then created a program to randomly deliver you a chapter from any such book. That would probably get me a lawsuit.

This is just that with lots of levels of indirection.

rurban · 2025-01-10T08:10:28 1736496628

Jail time? Or just multi-million fines.

Will he be allowed to lead Meta if convicted as criminal?

covofeee · 2025-01-10T10:25:44 1736504744

You saw the WP cartoon right?

exe34 · 2025-01-10T11:45:00 1736509500

he should run for election!

cma · 2025-01-10T10:25:43 1736504743

A book's copyright is no more valid than a website's

pointedAt · 2025-01-16T11:16:58 1737026218

ugh, that stuff from piratebay is full of errors and "pranks"

#dataPoisoning

htrp · 2025-01-10T09:46:53 1736502413

i was under the impression that almost everyone trained on books3

alightsoul · 2025-01-10T11:43:47 1736509427

Given how things are going, maybe it will be ruled as "fair use" whereas something like controlled digital lending at the internet archive was ruled as "infringing" disgusting. So AI might become the only "legal" way to access a lot of knowledge for free you otherwise wouldn't have access to.

musicale · 2025-01-10T04:29:31 1736483371

"I'm shocked, shocked to find out that piracy is going on here!"

"Your LLM, Captain Zuckerberg."

"Oh, thank you very much!"

udev4096 · 2025-01-10T05:57:45 1736488665

Everyone knows that LLMs are trained on shit ton of pirated content

atulvi · 2025-01-10T02:52:28 1736477548

Good. These laws are anti progress.

idiotsecant · 2025-01-10T05:42:43 1736487763

We're literally extracting, refining, and re-using the information, art, and thoughts of fellow humans to make billionaires money.

This isn't the 90s. Computing isn't about discovery, not in the big leagues. Its about grinding up authenticity and feeding it into a machine to convert it into shareholder value.

If they want the value, let them pay for it or release the models open source for all to benefit.

archerx · 2025-01-10T06:40:02 1736491202

They have released all the models for free so far unlike other companies like OpenAI who are most likely doing the same but keeping it private and proprietary.

ulfw · 2025-01-10T02:56:23 1736477783

What "progress"?

pizza · 2025-01-10T03:29:02 1736479742

Exfiltration of information from the economy

exe34 · 2025-01-10T11:46:03 1736509563

does the economy lose this information? are pages now missing from the books on your bookshelf?

pizza · 2025-01-11T09:49:57 1736588997

I didn't say it was correct, I was just saying what the position is

exe34 · 2025-01-11T14:33:10 1736605990

it's okay, you can say it's wrong!

covofeee · 2025-01-10T10:28:18 1736504898

Copyright is your friend.

horsawlarway · 2025-01-10T10:45:55 1736505955

No.

There is a theoretical implementation of copyright that is your friend.

The realities of the laws as implemented today are abusive and hostile.

palata · 2025-01-10T15:12:27 1736521947

Does it mean that they should be removed entirely? Surely we can agree on the fact that I should not be allowed to make a copy of a book, put my name on it instead of the real author, and sell it? Or even claim that I wrote it and put it on my resume?

horsawlarway · 2025-01-14T16:08:26 1736870906

> Does it mean that they should be removed entirely?

Maybe. I think it means we're at a spot where I'm not reasonably convinced that existing copyright laws are actually better than the free-for-all you're describing.

I'm definitely with you that it'd be ideal if we had a way to handle direct plagiarism like described in your comment (Although if you dropped the "put my name on it" part, I don't really see much issue).

But we also have all the fun today of companies using copyright to silence critique, shut out competitors, take educational information offline, demonetize videos they don't like, and otherwise absolutely abuse the hammers copyright law has given them (often - with no reciprocal hammer to stop this type of abuse).

And that's not even getting into the discussion of whether or not 70+ year old characters and stories should be available for modern authors to reuse and reinterpret. (even more egregious when you consider the vast majority of those tales are direct reinterpretations of older stories themselves...).

Or we can discuss whether it should really be legal to sell electronics hardware that has digital locks inside it that not only am I (the legal owner) not given the key to, but for which it is literally illegal for me to even attempt to open.

----

So basically - If I had to pick between "You'll own nothing and rent everything, and all public discourse is subject to DMCA strikes or other removal" vs "no copyright"... My vote is for "no copyright".

But the reality is I think we can strike a much better balance than those extremes, we just can't do it without upsetting large profit streams for existing, very wealthy and entrenched, entities... and usually that doesn't happen without tearing things down first.

bodiekane · 2025-01-10T14:30:33 1736519433

Copyright is the friend to the 1% and the enemy of the everyone else.

(Of course, I'm using "the 1%" rhetorically, it's really more like 0.01%)

As a society, we all clearly benefit from fair use far more than we benefit from members of the copyright cartel buying another mansion or private jet.