> Hive SQL, Spark SQL, Scala Spark, PySpark and Presto are widely used as different execution engines
This makes me think they're doing something very very wrong. AirBNB does not have data on the scale that would require these tools. They have 5.6 million listings, 150 million users, and 1 billion total person-stays. These numbers can easily be processed with Postgres or SQLite on single machines. Spark and Hive are for companies like Google and Facebook.
Have you ever worked in data engineering? They're using these systems for event data, data generated through transformations (multiplicative effect on base size), data used for ML, etc.
These events aren't just being generated per stay. A company like Airbnb will have events about logins, searches, site interactions, etc. You'll also be transforming the raw data and storing it again as higher level, materialized tables.
Disclaimer: Worked at Airbnb (not on a data engineering or data infra team)
So all unimportant data? I mean sure you can squeeze insights out of that but if a third of it disappeared overnight it wouldn't be a big deal.
And even then anything short of obsessive mouse tracking won't be that much data.
This isn't doing much to prove that the stuff in the article matters. Maybe it does but it's not self-evident and the criticism upthread makes sense.
(Please note that I am not ignorantly saying the job is easy. I'm mostly wondering if it affects revenue and satisfaction by more than a tiny sliver to do the hard job with all these different big data engines as opposed to doing a much simpler job.)
Search interactions data is one of the most valuable data in marketplaces. I never worked at Airbnb, but I worked at companies smaller than airbnb where improving ranking had many millions $ / year impact on revenue.
> And even then anything short of obsessive mouse tracking won't be that much data.
Consider tracking clicks and query / results. That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.
> That's already 2 orders of magnitude more data than suggested by the OP, even under very conservative assumptions.
If we estimate a search input as 50 bytes and the results as 25 4-byte ID numbers, then multiply by 100 million, that's 15TB, one hard drive or a couple SSDs.
And a hard drive full of clicks can fit a trillion.
So even at 2-3 orders of magnitude over a billion, we're not looking at huge sums of data.
And it's quite questionable whether you need special systems dedicated to keeping that click data pristine at all times.
Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already. Being able to query those data across different segments, aggregate into different dimensions is already quite far beyond what you can do w/ off the shelve postgresql or sqlite.
And in general, in companies the size of airbnb, you don't control all the data sources to be super efficient, because that's organizationally impossible. So instead the data will be de normalized, etc.
There is a reason most companies with those problems use systems like BQ, snowflake, and co. If it were possible to do with sqlite, a lot of them would do it.
> Even using your numbers, if you want to keep say only 3 months of data, we're talking about 1-2 PB already.
Am I doing the math wrong? "1 billion" was supposed to be lifetime stays, but let's say it's per year. Here's the math using 'my' numbers:
1 billion stays per year * 100 searches per stay * 150 bytes per search = 15TB per year
1 billion stays per year * 1000 page loads per stay * 15 bytes per page load = 15TB per year
How are you getting petabytes? If 3 months is 1-2 hundred million stays, you'd need to store ten million bytes per stay to reach 1-2PB. (And images don't count, they wouldn't be in the database.)
you're right about TB vs PB ofc :) But then keep in mind the assumptions were super conservative:
* 1qps is likely off by at least half order magnitude
* 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
* etc.
So you will reach the PB if not per a few month, but at least per year. sqlite or "simple" postgres really is not gonna cut it.
I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Okay, if you're going to try to inflate your estimate by 200x so you can get back to petabyte range then I'll do a more detailed comparison.
> * 1qps is likely off by at least half order magnitude
Wasn't your math based on a thousand queries per second? I don't think that's unreasonably small.
And my math, in the "let's say it's 1 billion stays per year" version, assumes three thousand queries per second.
And then you're assuming 100 clicks per query, a hundred thousand page loads per second. And I'm assuming 10 clicks per query, thirty thousand page loads per second. Are those numbers way too small?
> * 1 byte per even is obviously off by several order of magnitudes, let's say just 1.5 order of magnitude. You need to know if an even is click/buy/view/..., you need the doc id that will likely be a 16 bytes uuid, etc.
Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
30 bytes per event * 30000 events per second * 1 year = only 2 hard drives worth of click data. And historically their stays were significantly less than they are today.
> I work in search for a C2C marketplace that is smaller than airbnb, and w/o going into details, we reach those orders of magnitude in big query.
Well there's a lot of factors here. Maybe your search results are a lot more complicated than "25-50 properties at a time". Maybe you're tracking more data than just clicks. Maybe you have very highly used pages that need hundreds of bytes of data to render. Maybe you're using UUIDs when much smaller IDs would work. Maybe you're storing large URLs when you don't need to. Maybe you're storing a copy of browser headers a trillion times in your database.
Add a bunch of those together and I could see a company storing massively more data per page. But I'm not convinced AirBnB needs it to track queries and clicks specifically. Or that they really need sub-click-resolution data.
I agree certain orders of magnitude are harder to guess accurately. My claim for 1k qps being conservative is mostly based on
* my current company, but I can't share more precise numbers. So I guess not very convincing :)
* however, at ~ 300 millions listings in 2021 for airbnb, that means i.e. ~10 listing per second. 1k qps implies ~ 100 queries per listing, which would be an extremely good ratio. Every time you change a parameter (price range, map range, type of place), that's a new query.
> Sure, I think 30 bytes is reasonable for a click. When I said 15 I was squeezing a bit much. But timestamp, session ID, page template ID, property/comment/whatever ID, page number, click/buy/view byte if that isn't implied by page template... no need for that be more than 30 bytes total.
I agree once you start thinking about optimization, such as interning uuids, etc. you could maybe go down to double digit or even single digit TB of data per month. E.g. by using parquet or other columnar format, use compression, etc.
But keep in mind the teams working on the event generation (mobile, web, etc.) and the teams consuming those events work completely separately. And those events have multiple uses, not just search, so they tend to be "denormalized" when stored at the source. A bit old but still relevant reference from twitter that explained how many companies do that part: https://arxiv.org/pdf/1208.4171.pdf.
A search would contain a lot more than that. Structured events about search result pages will often contain a denormalized version of the results and how they were placed on the screen, experiment ids, information about the users session, standard tracking data for anti abuse.
You might use these data to make statements in legal documents, financial filings, etc. and therefore you’d want a good story about why those data are trustworthy.
Think about e.g. ranking in their search/recommendation engine. To be able to train ranking ML models, you would need to at least track the views, clicks, purchases, etc. done through their platform. For each search, you want to keep the query string and the results items ids.
Let's say, very conservatively, they have 1000 qps on average. We're talking about 100 of millions of events a day.
And this video from May 2017 mentions 1B daily events (for whatever they define as an event):
https://youtu.be/70luTZU-D3E?t=102
It wouldn't surprise me if they're storing calls between microservices as "events" and they're likely logging a lot of both user data and internal services data, but that's purely a guess.
It seems like more and more companies (looking at AWS and Netflix directly) seem to deploy ~1k micro services.
I work in a team where we manage ~14 micro services (per environment - dev, staging and production - max ~42) and find it complicated to manage and monitor...
It happens with technology sometimes, but all time the with finance. Having worked in hedge funds most of my career (as an engineer but I see enough of the business side) it's hilarious how clueless but confident people on HN are about anything that touches finance, trading, stocks, crypto, etc. Nothing wrong with being clueless, but the hilarious part is the confidence of the posters about what they are writing, which they probably got from Medium. If you don't know better you'd think they know what they're talking about.
This makes me think they're doing something very very wrong. AirBNB does not have data on the scale that would require these tools. They have 5.6 million listings, 150 million users, and 1 billion total person-stays. These numbers can easily be processed with Postgres or SQLite on single machines. Spark and Hive are for companies like Google and Facebook.
https://www.thezebra.com/resources/home/airbnb-statistics/#i...