I think there’s a couple ways to improve it: 1. There’s a lot of variants of the...

throwup238 · 2025-06-04T09:57:50 1749031070

How are you going to download the top 100k? The only reasonable way to download that many books from AA or Libgen is to use the torrents, which are sorted sequentially by upload date.

I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.

notpushkin · 2025-06-05T09:23:51 1749115431

For each book they store its exact location in the torrent files. You can see on the book page, e.g.:

collection “ia” → torrent “annas-archive-ia-acsm-n.tar.torrent” → file “annas-archive-ia-acsm-n.tar” (extract) → file “notesonsynthesis0000unse.pdf”

But probably you should get it from the database dumps they provide instead of hammering the website.

So you come up with a list of books you want to prioritize, search the DB for torrent name and file to download, download only the files you need, and extract them. You’ll probably end up with quite a few more books, which you may index or skip for now, but it is certainly doable.

WillAdams · 2025-06-04T10:55:42 1749034542

The thing is, for an ISBN, that is one edition, by one publisher and one can easily have the same text under 3 different ISBNs from one publisher (hardcover, trade paperback, mass-market paperback).

I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:

https://tolkienlibrary.com/booksbytolkien/hobbit/editions.ph...

granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.

There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:

https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...

along with others.

notpushkin · 2025-06-05T09:25:45 1749115545

Hmmm, yeah, ISBN isn’t great for this. Is there a good way to deduplicate the books by their contents?

WillAdams · 2025-06-05T11:07:39 1749121659

LoC or Dewey Decimal with author and title (and edition?) should work.

I wish there was some better book cataloging/organizing scheme --- the Online Books Page uses LoC:

https://onlinebooks.library.upenn.edu/subjects.html

and is the most workable of the indices I've used.

palmfacehn · 2025-06-04T06:11:41 1749017501

There should be a way to leverage compression when storing multiple editions of the same book.

bawolff · 2025-06-04T08:05:54 1749024354

From a good search perspective though you probably dont want 500 different versions of the same book popping up for a query

palmfacehn · 2025-06-04T11:24:21 1749036261

Agreed. I would prefer to see a single result for a single title. The option of pursuing different editions should follow from there.

qingcharles · 2025-06-04T22:09:15 1749074955

And without some sort of weighting system, it wouldn't even know which one is the best one to show the user.

notpushkin · 2025-06-05T09:29:12 1749115752

We’ll also need to consider that some versions might be easier to index even though the user would prefer another version. E.g. if we have a TXT and EPub, we might want to index TXT (if it’s clean enough), but present user with EPub (with formatting and stuff).

But it’s not a huge problem actually: just link to the search page instead and let the user decide what they want to download.