The article says they got datasets from Anna's Archive. It was most likely the scihub/libgen torrent which is 96.0 TB right now and contains 92,872,581 files. That's about 1 megabyte per file.
Sometimes the PDF of a book is big because the book's packed with important illustrations and charts - like a textbook or journal paper.
Other times a PDF of a book is big because someone scanned it and didn't have trustworthy OCR, so they figured distributing images of text at 1.5 MB per page was better than risking OCR errors.