> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.
It has to do with how correlated conditions are in your 100 places. Conditions on one square inch of your floor are very highly correlated with conditions on another square inch of your floor, so they wouldn't be able to experience a 1-in-100 event independently -- if one part of your floor did something unusual, the other parts probably would too. Conditions on the floor the next room over are also pretty highly correlated, but not quite as high (maybe a fissure could open up and swallow the kitchen, but not your office). So in order for the parent comment to be true, their thousand distinct geographic places would need to be statistically independent from each other.
Of course in practice, it's quite hard to know whether conditions in one location are independent from another, or whether there's some degree of correlation or an underlying causal factor. This is why we have climate scientists.
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.