How are you going to download the top 100k? The only reasonable way to download that many books from AA or Libgen is to use the torrents, which are sorted sequentially by upload date.
I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.
But probably you should get it from the database dumps they provide instead of hammering the website.
So you come up with a list of books you want to prioritize, search the DB for torrent name and file to download, download only the files you need, and extract them. You’ll probably end up with quite a few more books, which you may index or skip for now, but it is certainly doable.
The thing is, for an ISBN, that is one edition, by one publisher and one can easily have the same text under 3 different ISBNs from one publisher (hardcover, trade paperback, mass-market paperback).
I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:
granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.
There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:
We’ll also need to consider that some versions might be easier to index even though the user would prefer another version. E.g. if we have a TXT and EPub, we might want to index TXT (if it’s clean enough), but present user with EPub (with formatting and stuff).
But it’s not a huge problem actually: just link to the search page instead and let the user decide what they want to download.
1. There’s a lot of variants of the same book. We only need one for the index. Perhaps for each ISBN, select the format easiest to parse.
2. We can download, convert and index top 100K books first, launch with these, and then continue indexing and adding other books.