Do they pull their own index like brave or are they using Bing/Google in the bac...

tripplyons · 2025-09-25T20:15:52 1758831352

Based on the fact that there are very few up-to-date English-language search indexes (Google, Bing, and Brave if you count it), it must be incredibly costly. I doubt they are maintaining their own.

throwaway12345t · 2025-09-25T20:31:46 1758832306

We need more indexes

tripplyons · 2025-09-25T21:00:16 1758834016

More competition in the space would be great for me as a consumer, but the problem is that the high fixed costs make starting an index difficult.

andai · 2025-09-25T23:28:03 1758842883

I've been wondering can't this be done p2p? Didn't we solve most of the technical problems in the late 90s / early 2000s? And then just abandoned that entire way of thinking for some reason?

If many thousands of people care about having a free / private / distributed search engine, wouldn't it make sense for them to donate 1% of their CPU/storage/network to an indexer / db that they they then all benefit from?

hombre_fatal · 2025-09-26T13:20:46 1758892846

Well, flesh it out more and it doesn't sound solved at all.

How do you make it trustless. How do you fetch/crawl the index when it's scattered across arbitrary devices. How do you index the decentralized index. What is actually stored on nodes. When you want to do something useful with the crawled info, what does that look like.

andai · 2025-09-26T22:28:19 1758925699

I think you could do it hierarchically, and with redundancy.

You'd figure out a replication strategy based on observed reliability (Lindy effect + uptime %).

It would be less "5 million flaky randoms" and more "5,000 very reliable volunteers".

Though for the crawling layer you can and should absolutely utilize 5 million flaky randoms. That's actually the holy grail of crawling. One request per random consumer device.

I think the actual issue wouldn't be the technical issue but the selection. How do you decide what's worth keeping.

You could just do it on a volunteer basis. One volunteer really likes Lizard Facts and volunteers to host that. Or you could dynamically generate the "desired semantic subspace" based on the search traffic...

andai · 2025-09-26T23:49:35 1758930575

Let me illustrate this with a more poetic example.

In 2015, I was working at a startup incubator hosted inside of an art academy.

I took a nap on the couch. I was the only person in the building, so my full attention was devoted to the strange sounds produced by the computers.

There were dozens of computers there. They were all on. They were all wasting hundreds of watts. They were all doing essentially nothing. Nothing useful.

I could feel the power there. I could feel, suddenly, all the computers in a thousand mile radius. All sitting there, all wasting time and energy.

ineedasername · 2025-09-25T20:58:52 1758833932

Do we know what OpenAI uses? Have they built their own, or piggy back on moneybags $MS and Bing?

tripplyons · 2025-09-25T21:01:03 1758834063

They use Bing: https://www.forbes.com/sites/katherinehamilton/2023/05/23/ch...

pzo · 2025-09-25T21:56:36 1758837396

perplexity added API today, got the following email:

> Dear API user, We’re excited to launch the Perplexity Search API — giving developers direct access to the same real-time, high-quality web index that powers Perplexity’s answers.

tripplyons · 2025-09-28T14:12:02 1759068722

This doesn't mean they run their own index. They are likely just reselling access to whatever index they are using for their product.

JumpCrisscross · 2025-09-25T20:42:07 1758832927

> We need more indexes

Not particularly. Indexes are sort of like railroads. They're costly to build and maintain. They have significant external costs. (For railroads, in land use. For indexes, in crawler pressure on hosting costs.)

If you build an index, you should be entitled to a return on your investment. But you should also be required to share that investment with others (at a cost to them, of course).