So all unimportant data? I mean sure you can squeeze insights out of that but if...

cdavid · on May 21, 2022

Search interactions data is one of the most valuable data in marketplaces. I never worked at Airbnb, but I worked at companies smaller than airbnb where improving ranking had many millions $ / year impact on revenue.

> And even then anything short of obsessive mouse tracking won't be that much data.

Consider tracking clicks and query / results. That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.

Dylan16807 · on May 22, 2022

> That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.

If we estimate a search input as 50 bytes and the results as 25 4-byte ID numbers, then multiply by 100 million, that's 15TB, one hard drive or a couple SSDs.

And a hard drive full of clicks can fit a trillion.

So even at 2-3 orders of magnitude over a billion, we're not looking at huge sums of data.

And it's quite questionable whether you need special systems dedicated to keeping that click data pristine at all times.

cdavid · on May 22, 2022

Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already. Being able to query those data across different segments, aggregate into different dimensions is already quite far beyond what you can do w/ off the shelve postgresql or sqlite.

And in general, in companies the size of airbnb, you don't control all the data sources to be super efficient, because that's organizationally impossible. So instead the data will be de normalized, etc.

There is a reason most companies with those problems use systems like BQ, snowflake, and co. If it were possible to do with sqlite, a lot of them would do it.

Dylan16807 · on May 23, 2022

> Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already.

Am I doing the math wrong? "1 billion" was supposed to be lifetime stays, but let's say it's per year. Here's the math using 'my' numbers:

1 billion stays per year * 100 searches per stay * 150 bytes per search = 15TB per year

1 billion stays per year * 1000 page loads per stay * 15 bytes per page load = 15TB per year

How are you getting petabytes? If 3 months is 1-2 hundred million stays, you'd need to store ten million bytes per stay to reach 1-2PB. (And images don't count, they wouldn't be in the database.)

cdavid · on May 26, 2022

> Am I doing the math wrong?

Your mistake is to assume you need to look at per stay. You should look at per query if you want to improve your search.

My math is based on the following:

- 1000 qps

- 1 query has on average 100 results

- for each query you want to at least get the query string, the list of document ids, and the interaction for each document (view/click/etc.).

So we're talking 1e3 (qps) * ~1e5 (seconds per day) * 1e2 = 10 billions events a day -> 1 peta events for 3 months.

Even assuming 1 byte per event, and 1000 qps being super conservative for airbnb I suspect.

Dylan16807 · on May 27, 2022

Your math is similar to mine, but you mixed up "tera" and "peta". 1000 billion events is 1 tera-event, and 0.001 peta-events.

cdavid · on May 28, 2022

you're right about TB vs PB ofc :) But then keep in mind the assumptions were super conservative:

* 1qps is likely off by at least half order magnitude

* 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.

* etc.

So you will reach the PB if not per a few month, but at least per year. sqlite or "simple" postgres really is not gonna cut it.

I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.

Dylan16807 · on May 28, 2022

Okay, if you're going to try to inflate your estimate by 200x so you can get back to petabyte range then I'll do a more detailed comparison.

> * 1qps is likely off by at least half order magnitude

Wasn't your math based on a thousand queries per second? I don't think that's unreasonably small.

And my math, in the "let's say it's 1 billion stays per year" version, assumes three thousand queries per second.

And then you're assuming 100 clicks per query, a hundred thousand page loads per second. And I'm assuming 10 clicks per query, thirty thousand page loads per second. Are those numbers way too small?

> * 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.

Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.

30 bytes per event * 30000 events per second * 1 year = only 2 hard drives worth of click data. And historically their stays were significantly less than they are today.

> I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.

Well there's a lot of factors here. Maybe your search results are a lot more complicated than "25-50 properties at a time". Maybe you're tracking more data than just clicks. Maybe you have very highly used pages that need hundreds of bytes of data to render. Maybe you're using UUIDs when much smaller IDs would work. Maybe you're storing large URLs when you don't need to. Maybe you're storing a copy of browser headers a trillion times in your database.

Add a bunch of those together and I could see a company storing massively more data per page. But I'm not convinced AirBnB needs it to track queries and clicks specifically. Or that they really need sub-click-resolution data.

cdavid · on May 29, 2022

I agree certain orders of magnitude are harder to guess accurately. My claim for 1k qps being conservative is mostly based on

* my current company, but I can't share more precise numbers. So I guess not very convincing :)

* however, at ~ 300 millions listings in 2021 for airbnb, that means i.e. ~10 listing per second. 1k qps implies ~ 100 queries per listing, which would be an extremely good ratio. Every time you change a parameter (price range, map range, type of place), that's a new query.

> Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.

I agree once you start thinking about optimization, such as interning uuids, etc. you could maybe go down to double digit or even single digit TB of data per month. E.g. by using parquet or other columnar format, use compression, etc.

But keep in mind the teams working on the event generation (mobile, web, etc.) and the teams consuming those events work completely separately. And those events have multiple uses, not just search, so they tend to be "denormalized" when stored at the source. A bit old but still relevant reference from twitter that explained how many companies do that part: https://arxiv.org/pdf/1208.4171.pdf.

tuckerman · on May 22, 2022

A search would contain a lot more than that. Structured events about search result pages will often contain a denormalized version of the results and how they were placed on the screen, experiment ids, information about the users session, standard tracking data for anti abuse.

You might use these data to make statements in legal documents, financial filings, etc. and therefore you’d want a good story about why those data are trustworthy.