Understanding deep learning requires re-thinking generalization

mannigfaltig · on May 13, 2017

Previous discussions:

https://news.ycombinator.com/item?id=13566917

https://openreview.net/pdf?id=rJv6ZgHYg

https://www.reddit.com/r/MachineLearning/comments/6ailoh/r_u...

https://www.reddit.com/r/MachineLearning/comments/5kfs23/r_u...

https://www.reddit.com/r/MachineLearning/comments/5cw3lr/r_1...

andreyk · on May 13, 2017

Here's the TLDR:

"As the authors succinctly put it, “Deep neural networks easily fit random labels.” Here are three key observations from this first experiment:

-The effective capacity of neural networks is sufficient for memorising the entire data set.

-Even optimisation on random labels remains easy. In fact, training time increases by only a small constant factor compared with training on the true labels.

-Randomising labels is solely a data transformation, leaving all other properties of the learning problem unchanged."

And conclusion

" This situation poses a conceptual challenge to statistical learning theory as traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks. We argue that we have yet to discover a precise formal measure under which these enormous models are simple. Another insight resulting from our experiments is that optimization continues to be empirically easy even if the resulting model does not generalize. This shows that the reasons for why optimization is empirically easy must be different from the true cause of generalization. "

This paper was pretty hyped when it came out for seeming to discuss general properties of deep learning, but the details of it are a little dissapointing - okay so sufficiently big/deep networks can overfit to training data, and that's exciting how?... it's a curious finding, but not one that's all that hard to believe or that is all that informative. Or so it seems to me. I don't see how they justify claiming that they "we show how these traditional approaches fail to explain why large neural networks generalize well in practice."

I suppose the notion is that memorizing random labels implies memorization should also work on non-random labels (and thereby no generalzation to test set is needed), but it seems intuitive that proper labels and gradients with regularization will find the answer that generalizes because that is the steepest optimization path available. I have not read it all that deeply and not in a while, so perhaps their arguments are stronger than it appears to me, though.

Eridrus · on May 13, 2017

I haven't read the paper, but I think it only really makes sense in context:

a) The traditional view of generalization would argue that neural nets are "too complicated"/have too many parameters/etc to generalize well. And that for generalization you need more limited models.

b) To reconcile this with practical results of CNNs some people tried to argue that while neural nets had a lot of actual parameters, the structure of neural nets reduced the effective capacity of neural nets to not be as big as the parameter counts implied. Same argument for regularization.

c) This paper shows that those arguments are not satisfactory since they really can fit random labels.

stared · on May 13, 2017

As a point of reference, it is good to know that MNIST (handwritten digits) can be solved extremely well with nearest-neighbour classifier, see: http://yann.lecun.com/exdb/mnist/

It should be not surprising that strategies "memorize all input and interpolate between" are powerful, no matter if as a part of neural networks or other easy-to-overfit techniques (such as random forest).

taneq · on May 13, 2017

> Take the same training data, but this time randomly jumble the labels (i.e., such that there is no longer any genuine correspondence between the label and what’s in the image).

What's a 'genuine correspondence'? The network has clearly picked out some image features that correspond to the assigned labels. Just because they're not the features you're thinking of doesn't mean they don't exist.

robert_tweed · on May 13, 2017

If the labels are random they don't correspond to anything, so any "features" are essentially noise.

The classifier is, in effect, memorising every element in the training set. It's training a compression algorithm for storing that data.

It should be noted that "training a compression algorithm" isn't always a bad thing per se, because that's how autoencoders work, which is one of the main ways to do deep learning.

The key term in the article is "the effective capacity" of the model. If you have a big enough network, it can simply memorise everything you give it. This makes it difficult to know if such a model will generalise. A much smaller model won't overfit in the same way, but also might not perform as well as a larger, more sophisticated model. The problem in deep learning is nobody can tell how much of the training data has simply been saved somewhere in the model (in an obfuscated and compressed way).

There is some related research about reconstructing the training data from deep networks (which has privacy implications), but I don't have a link handy.

nabla9 · on May 13, 2017

'genuine correspondence' refers to the high level features that are similar in similarly labeled images. For randomly labeled images there is no similarity.

If there is no detectable patters from images to labels, the network does rote memorization. It learns a compact way to remember the label for each image.

argonaut · on May 13, 2017

Yes, the network has picked out some image features. But those image features are being trained towards outputing what are random predictions, so the features are essentially noise.

eriknstr · on May 13, 2017

They are jumbling the labels for the training data before giving it to the network. Thus they are misinforming the network about what it is supposed to learn.

throw_away_777 · on May 13, 2017

This statement: "Or in other words: the model, its size, hyperparameters, and the optimiser cannot explain the generalisation performance of state-of-the-art neural networks." is not true and very misleading. Careful selection of hyperparameters and the model can clearly improve generalization - the article is making a mistake in assuming that getting to zero training error is a good thing or a desirable thing. In fact a large part of hyperparameter optimization are choices that ensure generalization, and some of the fundamental choices such as early stopping and many others do determine how well the model generalizes. If your model has zero training error you have likely made poor choices.

l3robot · on May 13, 2017

Where does the article state that zero training error is a good thing? The authors only show that almost every modern neural network can reach 0 training error, even if the labels are randomized (generalization impossible). Hence, they can learn the dataset by hearth. The authors can, from that, use the testing error as a generalization indicator.

Indeed a careful hyperparameter choice is the only key now to have good generalization. As I understood it, the goal here is more to show that the correlation between the regularization of the network and its generalization power is far from being clear as it is for other ML algorithms like SVM.

In short, NN hyperparameters help to reach generalization, but cannot "explain" it. It's the key difference here between practice and theory.

mindcrime · on May 13, 2017

Then what about the following sentence?

This must be the case because the generalisation performance can vary significantly while they all remain unchanged.

Maybe it was just me, but I read an implied "alone" in the sentence you quoted, ie:

"Or in other words: the model, its size, hyperparameters, and the optimiser, alone, cannot explain the generalisation performance of state-of-the-art neural networks."

yters · on May 13, 2017

The VC dimension of neural networks is at least O(E), if not O(E^2) or worse. E is the number of edge parameters. With billion parameter networks trained on billion item datasets, there is no theoretical reason why deep learning should generalize. This means deep learning is just memorizing the training data. Evidence of this is the ease with which deep learning models are fooled.

sherjilozair · on May 13, 2017

It's at most O(E) not at least. The capacity of a deep network could be much smaller than the number of weights and this is where the VC theory stops being useful.

Deep networks can generalise to situations where even humans cannot. So the memorizing narrative doesn't survive any scrutiny.

yters · on May 13, 2017

Can you cite a source? It depends on the activation function, but as far as I know only the perceptron has a decent VC dimension due to its use of the sign function. The tanh and sigmoid result in O(E) and O(E^2) according to Wikipedia.

sherjilozair · on May 14, 2017

I don't really have a source, and am speaking from what is hearsay in the deep learning community. The results you cite are valid only for shallow networks. As you increase depth, you don't get the same increase in capacity, so even though millions of params are being used in deep networks, the capacity is not O(million).

The capacity of a million-sized shallow net might be O(million), but noone's using such a model.

yters · on May 15, 2017

I saw the formula in Abu-Mustafa's Learning from Data. I don't think it only applied to single hidden layer networks, but I may be wrong. Additionally, the book said the VC dimension is infinite in the general case.

I asked the question on CS stack exchange and no one took issue with the statement that DL had such a large VC dimension. The only counter response was that it didn't matter in practice due to DL's good error scores. But, that still doesn't mean DL is generalizing. Good error is only a necessary condition for generalization, not a sufficient condition.

hamilyon2 · on May 13, 2017

Is there any good research about VC dimension of deep neural nets and usefulness of it?

cs702 · on May 13, 2017

Hmm... it could very well be true.

Perhaps deep networks work well because they learn to memorize "the most common patterns of auto-correlation" they see in the training data at different levels of function composition.

In fact, we do this explicitly in convolutional layers, which by design learn to represent every input sample as a combination of a finite number of fixed-size square filters.

...and the reason why deep networks might be "generalizing" so well is because not all distributions of natural data are equally likely!

In practice, objects with the same or similar labels tend to lie on or close to lower-dimensional manifolds embedded in data space.

...and this concentration of natural data distributions might be a result of the laws of Physics of the universe in which we happen to live: https://arxiv.org/abs/1608.08225

So, yes, deep learning could very well be a really fancy form of memorization.

I hadn't thought of it in this way before. Very interesting :-)

21 · on May 13, 2017

Humans also memorize - what a cat is, what a dog is, a giraffe.

Sure, we can apply even higher level features, and we can generalize from a picture of a cat to a black&white or text description of it.

But we still memorize a lot.

yters · on May 19, 2017

Animals memorize too. But, humans generalize, and that's the secret to our success. That's why generalization is a big deal and not memorization.

yters · on May 13, 2017

Are NNs any better than a simple nearest neighbor algorithm then? It's hard for me to understand the hype if deep learning is just fancy memorization.

cs702 · on May 16, 2017

Yes, deep neural nets are better at many AI/cognitive tasks, as they learn to recognize (and perhaps only memorize) patterns in the data at multiple levels of function composition -- that is, at multiple levels of abstraction.

That last bit about "multiple levels" is key. Shallow models like kNN, SVMs, Gaussian Processes, etc. don't do that; they learn to recognize/memorize patterns at only one level.

yters · on May 19, 2017

Seems we could just layer any of these other techniques then. The big thing is layers, and neural networks just get traction because people think we're discovering something important about the brain and mind. So it's just PR at the end of the day. Not a true breakthrough.

cs702 · on May 21, 2017

First, yes, that's pretty much what deep neural networks are: layers of shallow machine models stacked atop each other, with each layer learning to recognize patterns at different scale; and we train all layers together end-to-end.

Second, it's not PR! This stacking of layers, when done right, can overcome the "curse of dimensionality." Shallow models like kNN, SVM, GP, etc. cannot overcome it; they perform poorly as the number of input features increases. For example, k-nearest-neighbors will not work with images that have millions of pixels each.

Third, I'm only scratching the surface here. There's a LOT more to deep learning than just stacking shallow models.

infinity0 · on May 13, 2017

Is the answer not just simply "the generalisation is encoded in the data"? And that is exactly why deep learning models need huge amounts of data.

The model, size, hyperparameters, optimiser, etc - all they do is convert the data into a form that can be used to make predictions.

asavinov · on May 13, 2017

> Understanding deep learning requires re-thinking generalization

Deep learning is called deep because it is based on multiple levels of features corresponding to a hierarchy of notions. Of course, it can change how generalization as well as other operations are performed but the way generalization is done not a specific feature of deep learning.

naiveattack · on May 13, 2017

Here's a lay thought.

The network must not have capacity to hold all the data. It must have a capacity proportional to the number of classes of data (instead of the number of samples).

Another way to arrive at this may be: take a trained network, run it in inference on the training set. Group the nodes of the network into equally sized groups. As inference happens train a smaller corresponding new group of nodes for each previous group by looking only at its inputs and outputs that are exercised. Put the new subnetworks together by looking purely at the edges between the previous subnetwork. The new network is now constructed.

I have not built this. But would something like this work?

andreyk · on May 13, 2017

People have demonstrated similar types of ideas are effective for optimizing network size - after training a highly redundant big model it's often possible to reduce it down to 1/10 of the parameters without significantly impacting performance by doing stuff like this (even simpler, I think pruning is often effective).

amelius · on May 13, 2017

http://www.jmlr.org/papers/volume15/srivastava14a.old/source...

Abtin88 · on May 13, 2017

I have this talk from ICLR17, I've just uploaded on youtube! https://youtu.be/kCj51pTQPKI

RichardHeart · on May 13, 2017

Would it be possible to use machine learning to do this job better, meaning, could the machines look at what other machines are doing, and better translate to us whats going on?

fooker · on May 13, 2017

Use a system we don't understand to understand another system we don't understand; what could go wrong?

jacquesm · on May 13, 2017

> what could go wrong?

That's already known: there are inputs to the network that do not make sense and yet will trigger strong responses. Think of them as inputs that have the same effect on NNs that optical illusions have on the human brain. We infer something that isn't there.

I suspect that as network architectures get better and parameter counts drop these will get harder and harder to construct.

Franciscouzo · on May 13, 2017

You could use a neural network to optimize the hyperparameters of another neural network, but afaik it's not a really efficient way, your best bet would be to use bayesian hyperparameter optimization.