Okay, if you're going to try to inflate your estimate by 200x so you can get back to petabyte range then I'll do a more detailed comparison.
> * 1qps is likely off by at least half order magnitude
Wasn't your math based on a thousand queries per second? I don't think that's unreasonably small.
And my math, in the "let's say it's 1 billion stays per year" version, assumes three thousand queries per second.
And then you're assuming 100 clicks per query, a hundred thousand page loads per second. And I'm assuming 10 clicks per query, thirty thousand page loads per second. Are those numbers way too small?
> * 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
30 bytes per event * 30000 events per second * 1 year = only 2 hard drives worth of click data. And historically their stays were significantly less than they are today.
> I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Well there's a lot of factors here. Maybe your search results are a lot more complicated than "25-50 properties at a time". Maybe you're tracking more data than just clicks. Maybe you have very highly used pages that need hundreds of bytes of data to render. Maybe you're using UUIDs when much smaller IDs would work. Maybe you're storing large URLs when you don't need to. Maybe you're storing a copy of browser headers a trillion times in your database.
Add a bunch of those together and I could see a company storing massively more data per page. But I'm not convinced AirBnB needs it to track queries and clicks specifically. Or that they really need sub-click-resolution data.
I agree certain orders of magnitude are harder to guess accurately. My claim for 1k qps being conservative is mostly based on
* my current company, but I can't share more precise numbers. So I guess not very convincing :)
* however, at ~ 300 millions listings in 2021 for airbnb, that means i.e. ~10 listing per second. 1k qps implies ~ 100 queries per listing, which would be an extremely good ratio. Every time you change a parameter (price range, map range, type of place), that's a new query.
> Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
I agree once you start thinking about optimization, such as interning uuids, etc. you could maybe go down to double digit or even single digit TB of data per month. E.g. by using parquet or other columnar format, use compression, etc.
But keep in mind the teams working on the event generation (mobile, web, etc.) and the teams consuming those events work completely separately. And those events have multiple uses, not just search, so they tend to be "denormalized" when stored at the source. A bit old but still relevant reference from twitter that explained how many companies do that part: https://arxiv.org/pdf/1208.4171.pdf.
> * 1qps is likely off by at least half order magnitude
Wasn't your math based on a thousand queries per second? I don't think that's unreasonably small.
And my math, in the "let's say it's 1 billion stays per year" version, assumes three thousand queries per second.
And then you're assuming 100 clicks per query, a hundred thousand page loads per second. And I'm assuming 10 clicks per query, thirty thousand page loads per second. Are those numbers way too small?
> * 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
30 bytes per event * 30000 events per second * 1 year = only 2 hard drives worth of click data. And historically their stays were significantly less than they are today.
> I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Well there's a lot of factors here. Maybe your search results are a lot more complicated than "25-50 properties at a time". Maybe you're tracking more data than just clicks. Maybe you have very highly used pages that need hundreds of bytes of data to render. Maybe you're using UUIDs when much smaller IDs would work. Maybe you're storing large URLs when you don't need to. Maybe you're storing a copy of browser headers a trillion times in your database.
Add a bunch of those together and I could see a company storing massively more data per page. But I'm not convinced AirBnB needs it to track queries and clicks specifically. Or that they really need sub-click-resolution data.