Based on the fact that there are very few up-to-date English-language search indexes (Google, Bing, and Brave if you count it), it must be incredibly costly. I doubt they are maintaining their own.
I've been wondering can't this be done p2p? Didn't we solve most of the technical problems in the late 90s / early 2000s? And then just abandoned that entire way of thinking for some reason?
If many thousands of people care about having a free / private / distributed search engine, wouldn't it make sense for them to donate 1% of their CPU/storage/network to an indexer / db that they they then all benefit from?
Well, flesh it out more and it doesn't sound solved at all.
How do you make it trustless. How do you fetch/crawl the index when it's scattered across arbitrary devices. How do you index the decentralized index. What is actually stored on nodes. When you want to do something useful with the crawled info, what does that look like.
I think you could do it hierarchically, and with redundancy.
You'd figure out a replication strategy based on observed reliability (Lindy effect + uptime %).
It would be less "5 million flaky randoms" and more "5,000 very reliable volunteers".
Though for the crawling layer you can and should absolutely utilize 5 million flaky randoms. That's actually the holy grail of crawling. One request per random consumer device.
I think the actual issue wouldn't be the technical issue but the selection. How do you decide what's worth keeping.
You could just do it on a volunteer basis. One volunteer really likes Lizard Facts and volunteers to host that. Or you could dynamically generate the "desired semantic subspace" based on the search traffic...
Let me illustrate this with a more poetic example.
In 2015, I was working at a startup incubator hosted inside of an art academy.
I took a nap on the couch. I was the only person in the building, so my full attention was devoted to the strange sounds produced by the computers.
There were dozens of computers there. They were all on. They were all wasting hundreds of watts. They were all doing essentially nothing. Nothing useful.
I could feel the power there. I could feel, suddenly, all the computers in a thousand mile radius. All sitting there, all wasting time and energy.
perplexity added API today, got the following email:
> Dear API user,
We’re excited to launch the Perplexity Search API — giving developers direct access to the same real-time, high-quality web index that powers Perplexity’s answers.
Not particularly. Indexes are sort of like railroads. They're costly to build and maintain. They have significant external costs. (For railroads, in land use. For indexes, in crawler pressure on hosting costs.)
If you build an index, you should be entitled to a return on your investment. But you should also be required to share that investment with others (at a cost to them, of course).