The VC dimension of neural networks is at least O(E), if not O(E^2) or worse. E ...

sherjilozair · on May 13, 2017

It's at most O(E) not at least. The capacity of a deep network could be much smaller than the number of weights and this is where the VC theory stops being useful.

Deep networks can generalise to situations where even humans cannot. So the memorizing narrative doesn't survive any scrutiny.

yters · on May 13, 2017

Can you cite a source? It depends on the activation function, but as far as I know only the perceptron has a decent VC dimension due to its use of the sign function. The tanh and sigmoid result in O(E) and O(E^2) according to Wikipedia.

sherjilozair · on May 14, 2017

I don't really have a source, and am speaking from what is hearsay in the deep learning community. The results you cite are valid only for shallow networks. As you increase depth, you don't get the same increase in capacity, so even though millions of params are being used in deep networks, the capacity is not O(million).

The capacity of a million-sized shallow net might be O(million), but noone's using such a model.

yters · on May 15, 2017

I saw the formula in Abu-Mustafa's Learning from Data. I don't think it only applied to single hidden layer networks, but I may be wrong. Additionally, the book said the VC dimension is infinite in the general case.

I asked the question on CS stack exchange and no one took issue with the statement that DL had such a large VC dimension. The only counter response was that it didn't matter in practice due to DL's good error scores. But, that still doesn't mean DL is generalizing. Good error is only a necessary condition for generalization, not a sufficient condition.

hamilyon2 · on May 13, 2017

Is there any good research about VC dimension of deep neural nets and usefulness of it?

cs702 · on May 13, 2017

Hmm... it could very well be true.

Perhaps deep networks work well because they learn to memorize "the most common patterns of auto-correlation" they see in the training data at different levels of function composition.

In fact, we do this explicitly in convolutional layers, which by design learn to represent every input sample as a combination of a finite number of fixed-size square filters.

...and the reason why deep networks might be "generalizing" so well is because not all distributions of natural data are equally likely!

In practice, objects with the same or similar labels tend to lie on or close to lower-dimensional manifolds embedded in data space.

...and this concentration of natural data distributions might be a result of the laws of Physics of the universe in which we happen to live: https://arxiv.org/abs/1608.08225

So, yes, deep learning could very well be a really fancy form of memorization.

I hadn't thought of it in this way before. Very interesting :-)

21 · on May 13, 2017

Humans also memorize - what a cat is, what a dog is, a giraffe.

Sure, we can apply even higher level features, and we can generalize from a picture of a cat to a black&white or text description of it.

But we still memorize a lot.

yters · on May 19, 2017

Animals memorize too. But, humans generalize, and that's the secret to our success. That's why generalization is a big deal and not memorization.

yters · on May 13, 2017

Are NNs any better than a simple nearest neighbor algorithm then? It's hard for me to understand the hype if deep learning is just fancy memorization.

cs702 · on May 16, 2017

Yes, deep neural nets are better at many AI/cognitive tasks, as they learn to recognize (and perhaps only memorize) patterns in the data at multiple levels of function composition -- that is, at multiple levels of abstraction.

That last bit about "multiple levels" is key. Shallow models like kNN, SVMs, Gaussian Processes, etc. don't do that; they learn to recognize/memorize patterns at only one level.

yters · on May 19, 2017

Seems we could just layer any of these other techniques then. The big thing is layers, and neural networks just get traction because people think we're discovering something important about the brain and mind. So it's just PR at the end of the day. Not a true breakthrough.

cs702 · on May 21, 2017

First, yes, that's pretty much what deep neural networks are: layers of shallow machine models stacked atop each other, with each layer learning to recognize patterns at different scale; and we train all layers together end-to-end.

Second, it's not PR! This stacking of layers, when done right, can overcome the "curse of dimensionality." Shallow models like kNN, SVM, GP, etc. cannot overcome it; they perform poorly as the number of input features increases. For example, k-nearest-neighbors will not work with images that have millions of pixels each.

Third, I'm only scratching the surface here. There's a LOT more to deep learning than just stacking shallow models.