ebooks are a 1-2 mb each max. 81.7 TB are a lot of books, like 42-85 million boo...

weberer · 2025-02-07T11:54:04 1738929244

The article says they got datasets from Anna's Archive. It was most likely the scihub/libgen torrent which is 96.0 TB right now and contains 92,872,581 files. That's about 1 megabyte per file.

https://annas-archive.org/datasets

southernplaces7 · 2025-02-09T20:47:04 1739134024

Where does one find these torrent datasets? Did they download the books in bits and pieces or as a single huge multi-TB file?

thunkingdeep · 2025-02-07T11:38:20 1738928300

I’ve got 70-80mb pirated books, I think because of the illustrations. Guess it depends on the book.

mateus1 · 2025-02-07T11:39:42 1738928382

I don’t think they’re using picture heavy book for LLM training, no?

RIMR · 2025-02-07T11:58:25 1738929505

Just because the LLMs are trained on text doesn't mean that images we're a part of what they downloaded.

You clean up the data after you acquire it, not before.

littlestymaar · 2025-02-07T11:47:35 1738928855

Even if they didn't use the illustration(which isn't clear given multimodal models), they'd still make use the text in the books.

WithinReason · 2025-02-07T11:45:06 1738928706

Presumably they didn't create the torrent

rbanffy · 2025-02-07T11:48:40 1738928920

Whoever created it has a lot of spare hard disk space.

RIMR · 2025-02-07T11:59:03 1738929543

100TB is like 6 hard drives...

hulitu · 2025-02-07T12:26:31 1738931191

> 100TB is like 6 hard drives...

Discounted Seagates ? /s

rbanffy · 2025-02-07T14:06:48 1738937208

You can get recertified 18TB drives, but still it's a lot of disk space. I simply don't have enough data.

moralestapia · 2025-02-07T11:43:09 1738928589

Yes they do, there's multimodal models.

rbanffy · 2025-02-07T11:49:32 1738928972

I don't think they need to be selective. It's not like Meta can run out of storage.

mnsu · 2025-02-07T11:43:22 1738928602

For multi-modal models, why not? They would be probably some of the best data.

michaelt · 2025-02-07T12:03:09 1738929789

Sometimes the PDF of a book is big because the book's packed with important illustrations and charts - like a textbook or journal paper.

Other times a PDF of a book is big because someone scanned it and didn't have trustworthy OCR, so they figured distributing images of text at 1.5 MB per page was better than risking OCR errors.

hulitu · 2025-02-07T12:25:28 1738931128

Why not ? Do you think that AI doesn't enjoy porn ? /s

squigz · 2025-02-07T11:46:51 1738928811

It could be anywhere from a few million to a hundred million

https://annas-archive.org/datasets