Your intuition is correct - there are other ways to capture the non-stationary n...

Your intuition is correct - there are other ways to capture the non-stationary nature of this particular problem. We thought that the example age approach is neat because it is a general technique for removing bias inherent to any machine learning system. Since examples always come from the past, you often have to be careful to prevent any system from being overly biased towards historical behavior. You don't need any additional metadata about items (what's the age of a search query?) and it's more resilient to predicting in regions the model has never seen because you fix serving to the very end of the training window.

I tend to think the focus on recent behavior is an artifact of underfitting. Research into richer temporal modeling is needed and recurrent networks seem promising.

We debated internally whether to use the "deep" moniker - Alexnet was 8 layers, so maybe the threshold is 8? The depth seems sort of irrelevant since stacking layers is trivial once the basic architecture is in place.