The index is tiny, not even a terabyte. Right now it's a few hundred gigabytes for ~20 million URLs. But it's stored in an extremely dense binary format.
Honestly you may just want to roll your own solution for storing a ton of files. If you don't need a general-purpose filesystem, but an append-only archive with extra metadata, then you can cut a lot of corners. Like if you have a file system that is fixed-size and append-only, you can build it in a way no off-the-shelf stuff can.
This line of thinking is a large part of why my index is so small and fast. I have a lot of special built data-structures that are built for their exact use case. Like a fixed size append-only hash map that uses mapped memory and can in theory be larger than the system memory. Very good for a search engine, absolutely useless almost everywhere else.
Honestly you may just want to roll your own solution for storing a ton of files. If you don't need a general-purpose filesystem, but an append-only archive with extra metadata, then you can cut a lot of corners. Like if you have a file system that is fixed-size and append-only, you can build it in a way no off-the-shelf stuff can.
This line of thinking is a large part of why my index is so small and fast. I have a lot of special built data-structures that are built for their exact use case. Like a fixed size append-only hash map that uses mapped memory and can in theory be larger than the system memory. Very good for a search engine, absolutely useless almost everywhere else.