Here's the TLDR: "As the authors succinctly put it, “Deep neural networks easily...

Here's the TLDR:

"As the authors succinctly put it, “Deep neural networks easily fit random labels.” Here are three key observations from this first experiment:

-The effective capacity of neural networks is sufficient for memorising the entire data set.

-Even optimisation on random labels remains easy. In fact, training time increases by only a small constant factor compared with training on the true labels.

-Randomising labels is solely a data transformation, leaving all other properties of the learning problem unchanged."

And conclusion

" This situation poses a conceptual challenge to statistical learning theory as traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks. We argue that we have yet to discover a precise formal measure under which these enormous models are simple. Another insight resulting from our experiments is that optimization continues to be empirically easy even if the resulting model does not generalize. This shows that the reasons for why optimization is empirically easy must be different from the true cause of generalization. "

This paper was pretty hyped when it came out for seeming to discuss general properties of deep learning, but the details of it are a little dissapointing - okay so sufficiently big/deep networks can overfit to training data, and that's exciting how?... it's a curious finding, but not one that's all that hard to believe or that is all that informative. Or so it seems to me. I don't see how they justify claiming that they "we show how these traditional approaches fail to explain why large neural networks generalize well in practice."

I suppose the notion is that memorizing random labels implies memorization should also work on non-random labels (and thereby no generalzation to test set is needed), but it seems intuitive that proper labels and gradients with regularization will find the answer that generalizes because that is the steepest optimization path available. I have not read it all that deeply and not in a while, so perhaps their arguments are stronger than it appears to me, though.