Shard Manager: A generic shard management framework for geo-distributed apps

gravypod · on Dec 31, 2021

Another body of research on a similar topic: https://research.google/pubs/pub46921/

neonate · on Dec 31, 2021

pdf: https://scontent.fyka2-1.fna.fbcdn.net/v/t39.8562-6/24690577...

tayo42 · on Dec 31, 2021

Some questions come to mind:

How does research at facebook work? Is there some group that site around researching then goes to teams and says "implement this!" How does that dynamic work? My company actually has that going on, I find it annoying, were often not aligned imo.

Why share this with a research paper format? It makes it feel unapproachable. Personally find it hard to follow like this.

I don't get how it handles the replica assignments? If you have 10 servers and 1000 shards, 2000 replicas, you would have to much overlap of replica ownership between the 10 servers. Somehow you would want the replica assigner to know how to do this efficiently (have replicas split between 5 and 5 with no overlap in this case) I think every operation requires moving shards otherwise

low effort but this feels a little like https://xkcd.com/927/

mathgladiator · on Jan 1, 2022

Here is how some of it worked. Some reasonably smart people build a system that works well enough to make progress, but then it has problems (like, world ending problems requiring constant care). These problems manifest in requiring the attention of really smart people under a decent manager to go forth and bring the system under control.

This was what I experienced having re-architected the real-time system not once, but twice. That's right, I redesigned the system twice for a variety of reasons.

The first time was to handle even more scale and reliability, and I talked about it here: https://www.usenix.org/conference/srecon17americas/program/p...

That approach worked well, but wow, did it become unwieldy as the new features, massive scale, and much better reliability resulted in more use-cases. We stuff so much into until it became a problem.

That problem required solving, and that was the basis for BladeRunner: https://dl.acm.org/doi/10.1145/3477132.3483572

So, to answer your question, the research generally manifests from a need. Sometimes it is accidental, but it is often driven by some kind of essential need. In FB's case, it is improving reliability, massive scale, and driving engineering pain down.

Edit: to add onto the XKCD, there is a tower of babel effect for many things. An area that interests me is protocol design for streams since we are very much in the dark ages.

tayo42 · on Jan 1, 2022

Thanks for taking the time to answer!

zdyn5 · on Dec 31, 2021

Re: #2, it’s published/presented at a research conference so it’s inherently in research paper format. The sharing via their website more publicly is probably a lower priority / afterthought. Only a select number of higher impact research at these companies gets enough resources to have a dedicated page describing the work in a more accessible fashion with pedagogical illustrations etc…

tayo42 · on Dec 31, 2021

Makes sense thanks!