So all unimportant data? I mean sure you can squeeze insights out of that but if a third of it disappeared overnight it wouldn't be a big deal.
And even then anything short of obsessive mouse tracking won't be that much data.
This isn't doing much to prove that the stuff in the article matters. Maybe it does but it's not self-evident and the criticism upthread makes sense.
(Please note that I am not ignorantly saying the job is easy. I'm mostly wondering if it affects revenue and satisfaction by more than a tiny sliver to do the hard job with all these different big data engines as opposed to doing a much simpler job.)
Search interactions data is one of the most valuable data in marketplaces. I never worked at Airbnb, but I worked at companies smaller than airbnb where improving ranking had many millions $ / year impact on revenue.
> And even then anything short of obsessive mouse tracking won't be that much data.
Consider tracking clicks and query / results. That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.
> That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.
If we estimate a search input as 50 bytes and the results as 25 4-byte ID numbers, then multiply by 100 million, that's 15TB, one hard drive or a couple SSDs.
And a hard drive full of clicks can fit a trillion.
So even at 2-3 orders of magnitude over a billion, we're not looking at huge sums of data.
And it's quite questionable whether you need special systems dedicated to keeping that click data pristine at all times.
Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already. Being able to query those data across different segments, aggregate into different dimensions is already quite far beyond what you can do w/ off the shelve postgresql or sqlite.
And in general, in companies the size of airbnb, you don't control all the data sources to be super efficient, because that's organizationally impossible. So instead the data will be de normalized, etc.
There is a reason most companies with those problems use systems like BQ, snowflake, and co. If it were possible to do with sqlite, a lot of them would do it.
> Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already.
Am I doing the math wrong? "1 billion" was supposed to be lifetime stays, but let's say it's per year. Here's the math using 'my' numbers:
1 billion stays per year * 100 searches per stay * 150 bytes per search = 15TB per year
1 billion stays per year * 1000 page loads per stay * 15 bytes per page load = 15TB per year
How are you getting petabytes? If 3 months is 1-2 hundred million stays, you'd need to store ten million bytes per stay to reach 1-2PB. (And images don't count, they wouldn't be in the database.)
you're right about TB vs PB ofc :) But then keep in mind the assumptions were super conservative:
* 1qps is likely off by at least half order magnitude
* 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
* etc.
So you will reach the PB if not per a few month, but at least per year. sqlite or "simple" postgres really is not gonna cut it.
I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Okay, if you're going to try to inflate your estimate by 200x so you can get back to petabyte range then I'll do a more detailed comparison.
> * 1qps is likely off by at least half order magnitude
Wasn't your math based on a thousand queries per second? I don't think that's unreasonably small.
And my math, in the "let's say it's 1 billion stays per year" version, assumes three thousand queries per second.
And then you're assuming 100 clicks per query, a hundred thousand page loads per second. And I'm assuming 10 clicks per query, thirty thousand page loads per second. Are those numbers way too small?
> * 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
30 bytes per event * 30000 events per second * 1 year = only 2 hard drives worth of click data. And historically their stays were significantly less than they are today.
> I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Well there's a lot of factors here. Maybe your search results are a lot more complicated than "25-50 properties at a time". Maybe you're tracking more data than just clicks. Maybe you have very highly used pages that need hundreds of bytes of data to render. Maybe you're using UUIDs when much smaller IDs would work. Maybe you're storing large URLs when you don't need to. Maybe you're storing a copy of browser headers a trillion times in your database.
Add a bunch of those together and I could see a company storing massively more data per page. But I'm not convinced AirBnB needs it to track queries and clicks specifically. Or that they really need sub-click-resolution data.
I agree certain orders of magnitude are harder to guess accurately. My claim for 1k qps being conservative is mostly based on
* my current company, but I can't share more precise numbers. So I guess not very convincing :)
* however, at ~ 300 millions listings in 2021 for airbnb, that means i.e. ~10 listing per second. 1k qps implies ~ 100 queries per listing, which would be an extremely good ratio. Every time you change a parameter (price range, map range, type of place), that's a new query.
> Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
I agree once you start thinking about optimization, such as interning uuids, etc. you could maybe go down to double digit or even single digit TB of data per month. E.g. by using parquet or other columnar format, use compression, etc.
But keep in mind the teams working on the event generation (mobile, web, etc.) and the teams consuming those events work completely separately. And those events have multiple uses, not just search, so they tend to be "denormalized" when stored at the source. A bit old but still relevant reference from twitter that explained how many companies do that part: https://arxiv.org/pdf/1208.4171.pdf.
A search would contain a lot more than that. Structured events about search result pages will often contain a denormalized version of the results and how they were placed on the screen, experiment ids, information about the users session, standard tracking data for anti abuse.
You might use these data to make statements in legal documents, financial filings, etc. and therefore you’d want a good story about why those data are trustworthy.
And even then anything short of obsessive mouse tracking won't be that much data.
This isn't doing much to prove that the stuff in the article matters. Maybe it does but it's not self-evident and the criticism upthread makes sense.
(Please note that I am not ignorantly saying the job is easy. I'm mostly wondering if it affects revenue and satisfaction by more than a tiny sliver to do the hard job with all these different big data engines as opposed to doing a much simpler job.)