> Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at
I like this comment because it gets to the heart of a misunderstanding. I'd further correct it to say "for something it wasn't trying to be good at". DeWitt and Stonebraker just didn't understand why anyone would want this, and I can see why: change was coming faster than it ever did, from many angles. Let's travel back in time to see why:
The decade after mapreduce appeared - when I came of age as a programmer - was a fascinating time of change:
The backdrop is the post-dotcom bubble when the hype cycle came to a close, and the web market somewhat consolidated in a smaller set of winners who now were more proven and ready to go all in on a new business model that elevates doing business on the web above all else, in a way that would truly threaten brick and mortar.
Alongside that we have CPU manufacturers struggling with escalating clock speeds and jamming more transistors into a single die to keep up with Moore's law and consumer demand, which leads to the first commodity dual and multi core CPUs.
But I remember that most non-scientific software just couldn't make use of multiple CPUs or cores effectively yet. So we were ripe for a programming model that engineers who've never heard of lamport before can actually understand and work with: threads and locks and socket programming in C and C++ were a rough proposition, and MPI was certainly a thing but the scientific computing people who were working on supercomputers, grids, and Beowulf clusters were not the same people as the dotcom engineers using commodity hardware.
Companies pushing these boundaries were wanting to do things that traditional DBMSes could not offer at a certain scale, at least for cheap enough. The RDBMS vendors and priesthood were defending that it's hard to offer that while also offering ACID and everything else a database offers, which was certainly not wrong: it's hard to support an OLAP use case with the OLTP-style System-R-ish design that dominated the market in those days. This was some of the most complicated and sophisticated software ever made, imbued with magiclike qualities from decades of academic research hardened by years of industrial use.
Then there was data warehouse style solutions that were "appliances" that were locked into a specific and expensive combination of hardware and software optimized to work well together and also to extract millions and billions of dollars from the fortune 500s that could afford them.
So the ethos at the booming post-dotcoms definitely was "do we really need all this crap that's getting in our way?", and we would soon find out. Couching it in formalism and calling it "mapreduce" made it sound fancier than what it really was: some glue that made it easy for engineers to declaratively define how to split work into chunks, shuffle them around and assemble them again across many computers, without having to worry about the pedestrian details of the glue in between. A corporate drone didn't have to understand /how/ it worked, just how to fill in the blanks for each step properly: a much more viable proposition than thousands of engineers writing software together that involves finnicky locks and semaphores.
The DBMS crowd thumbed their noses at this because it was truly SO primitive and wasteful compared to the sophisticated mechanisms built to preserve efficiency that dated back to the 70s: indexes, access patterns, query optimizers, optimized storage layouts. What they didn't get was that every million dollar you didn't waste on what was essentially the space shuttle of computer software - fabulously expensive and complicated - could now buy a /lot/ more cheapo computing power duct taped together. The question was how to leverage that. Plus, with things changing at the pace that they did back then, last year's CPU could be obsolete by next year, so how well could the vendors building custom hardware even keep up with that, after you paid them their hefty fees? The value proposition was "it's so basic that it will run on anything, and it's future proof" - the democratization aspect could be hard to understand for an observer at that point, because the tidal wave hadn't hit yet.
What came was the start a transition from datacenters to rack mounts in colos and dedicated hosts to virtualization and very soon after the first programmable commodity clouds: why settle for an administered unixlike timesharing environment when you can manage everything yourself and don't have to ask for permission? Why deal with buying and maintaining hardware? This lowered the barrier for smaller companies and startups who previously couldn't afford access to such things nor markets that required them, which unleashed what can only be described as a hunger for anything that could leverage that model.
So it's not so much that worse was better, but that worse was briefly more appropriate for the times. "Do we really need all this crap that's getting in our way?" really took hold for a moment, and programmers were willing to dump anything and everything that was previously sacred if they thought it'd buy them scalability, schemas and complex queries to start.
Soon after, people started figuring out how to maintain all the benefits they'd gained (democratized massively parallel commodity computing) while bringing back some of the good stuff from the past. Only 2 years later, Google itself published the BigTable paper where it described a more sophisticated storage mechanism which optimized accesses better, and admittedly was tailored for a different use case, but could work in conjunction with mapreduce. Academia and the VLDB / CIDR crowd was more interested now.
Some years after that came out the papers for F1 and Spanner, which added back in a SQL-like query engine, transactions, secondary indexes etc on top of a similar distributed model in the context of WAN-distributed datacenters. Everyone preached the end of nosql and document databases, whitepapers were written about "newsql", frustrated veterans complained about yet another fad cycle where what was old was new again.
Of course that's not what happened: the story here was how a software paradigm failed to adapt to the changing hardware climate and business needs, so capitalism had its guts ripped apart and slowly reassembled in a more context-applicable way. Instead of storage engines we got so many things it's hard to keep up with, but leveldb comes to mind as an ancestor. With locks we got was chubby and zookeeper. With log structures we got kafka and its ilk. With query optimizer engines we got presto. With in-memory storage we got arrow. We got a cambrian explosion of all kinds of variations and combinations of these, but eventually the market started to settle again and now we're in a new generation of "no, really, our product can do it all". It's the lifecycle of unbundling and rebundling. It will happen again. Very curious what will come next.
It's worth noting, in addition to looking down upon MapReduce, Stonebraker and the academic crowd were equally unimpressed by other commodity hardware scale-out practices of the time – including the practical-minded RDBMS sharding and caching strategies used by all the booming massive-scale startups.
In 2011, in an interview with GigaOm, Stonebraker famously called Facebook's use of sharded MySQL and Memcached "a fate worse than death" [1]. He also claimed the company should redesign their entire infrastructure, seemingly without clarifying what specific problem he thought needed to be solved.
The reader comments on that post are also quite interesting in tone.
Edit to add a disclosure: I joined Facebook's MySQL team a couple years after this, and quite enjoyed their database architecture, which certainly colors my opinion of this topic.
Thanks! :-) It's been brewing in my head for a while and I've written out shorter snippets of similar ideas over the course of the past few years.
I've taken some liberties and likely made some mistakes, but it's aping the kind of "history of technology in its social context" that I love to read about.
I like this comment because it gets to the heart of a misunderstanding. I'd further correct it to say "for something it wasn't trying to be good at". DeWitt and Stonebraker just didn't understand why anyone would want this, and I can see why: change was coming faster than it ever did, from many angles. Let's travel back in time to see why:
The decade after mapreduce appeared - when I came of age as a programmer - was a fascinating time of change:
The backdrop is the post-dotcom bubble when the hype cycle came to a close, and the web market somewhat consolidated in a smaller set of winners who now were more proven and ready to go all in on a new business model that elevates doing business on the web above all else, in a way that would truly threaten brick and mortar.
Alongside that we have CPU manufacturers struggling with escalating clock speeds and jamming more transistors into a single die to keep up with Moore's law and consumer demand, which leads to the first commodity dual and multi core CPUs.
But I remember that most non-scientific software just couldn't make use of multiple CPUs or cores effectively yet. So we were ripe for a programming model that engineers who've never heard of lamport before can actually understand and work with: threads and locks and socket programming in C and C++ were a rough proposition, and MPI was certainly a thing but the scientific computing people who were working on supercomputers, grids, and Beowulf clusters were not the same people as the dotcom engineers using commodity hardware.
Companies pushing these boundaries were wanting to do things that traditional DBMSes could not offer at a certain scale, at least for cheap enough. The RDBMS vendors and priesthood were defending that it's hard to offer that while also offering ACID and everything else a database offers, which was certainly not wrong: it's hard to support an OLAP use case with the OLTP-style System-R-ish design that dominated the market in those days. This was some of the most complicated and sophisticated software ever made, imbued with magiclike qualities from decades of academic research hardened by years of industrial use.
Then there was data warehouse style solutions that were "appliances" that were locked into a specific and expensive combination of hardware and software optimized to work well together and also to extract millions and billions of dollars from the fortune 500s that could afford them.
So the ethos at the booming post-dotcoms definitely was "do we really need all this crap that's getting in our way?", and we would soon find out. Couching it in formalism and calling it "mapreduce" made it sound fancier than what it really was: some glue that made it easy for engineers to declaratively define how to split work into chunks, shuffle them around and assemble them again across many computers, without having to worry about the pedestrian details of the glue in between. A corporate drone didn't have to understand /how/ it worked, just how to fill in the blanks for each step properly: a much more viable proposition than thousands of engineers writing software together that involves finnicky locks and semaphores.
The DBMS crowd thumbed their noses at this because it was truly SO primitive and wasteful compared to the sophisticated mechanisms built to preserve efficiency that dated back to the 70s: indexes, access patterns, query optimizers, optimized storage layouts. What they didn't get was that every million dollar you didn't waste on what was essentially the space shuttle of computer software - fabulously expensive and complicated - could now buy a /lot/ more cheapo computing power duct taped together. The question was how to leverage that. Plus, with things changing at the pace that they did back then, last year's CPU could be obsolete by next year, so how well could the vendors building custom hardware even keep up with that, after you paid them their hefty fees? The value proposition was "it's so basic that it will run on anything, and it's future proof" - the democratization aspect could be hard to understand for an observer at that point, because the tidal wave hadn't hit yet.
What came was the start a transition from datacenters to rack mounts in colos and dedicated hosts to virtualization and very soon after the first programmable commodity clouds: why settle for an administered unixlike timesharing environment when you can manage everything yourself and don't have to ask for permission? Why deal with buying and maintaining hardware? This lowered the barrier for smaller companies and startups who previously couldn't afford access to such things nor markets that required them, which unleashed what can only be described as a hunger for anything that could leverage that model.
So it's not so much that worse was better, but that worse was briefly more appropriate for the times. "Do we really need all this crap that's getting in our way?" really took hold for a moment, and programmers were willing to dump anything and everything that was previously sacred if they thought it'd buy them scalability, schemas and complex queries to start.
Soon after, people started figuring out how to maintain all the benefits they'd gained (democratized massively parallel commodity computing) while bringing back some of the good stuff from the past. Only 2 years later, Google itself published the BigTable paper where it described a more sophisticated storage mechanism which optimized accesses better, and admittedly was tailored for a different use case, but could work in conjunction with mapreduce. Academia and the VLDB / CIDR crowd was more interested now.
Some years after that came out the papers for F1 and Spanner, which added back in a SQL-like query engine, transactions, secondary indexes etc on top of a similar distributed model in the context of WAN-distributed datacenters. Everyone preached the end of nosql and document databases, whitepapers were written about "newsql", frustrated veterans complained about yet another fad cycle where what was old was new again.
Of course that's not what happened: the story here was how a software paradigm failed to adapt to the changing hardware climate and business needs, so capitalism had its guts ripped apart and slowly reassembled in a more context-applicable way. Instead of storage engines we got so many things it's hard to keep up with, but leveldb comes to mind as an ancestor. With locks we got was chubby and zookeeper. With log structures we got kafka and its ilk. With query optimizer engines we got presto. With in-memory storage we got arrow. We got a cambrian explosion of all kinds of variations and combinations of these, but eventually the market started to settle again and now we're in a new generation of "no, really, our product can do it all". It's the lifecycle of unbundling and rebundling. It will happen again. Very curious what will come next.