The VC dimension of neural networks is at least O(E), if not O(E^2) or worse. E is the number of edge parameters. With billion parameter networks trained on billion item datasets, there is no theoretical reason why deep learning should generalize. This means deep learning is just memorizing the training data. Evidence of this is the ease with which deep learning models are fooled.
It's at most O(E) not at least. The capacity of a deep network could be much smaller than the number of weights and this is where the VC theory stops being useful.
Deep networks can generalise to situations where even humans cannot. So the memorizing narrative doesn't survive any scrutiny.
Can you cite a source? It depends on the activation function, but as far as I know only the perceptron has a decent VC dimension due to its use of the sign function. The tanh and sigmoid result in O(E) and O(E^2) according to Wikipedia.
I don't really have a source, and am speaking from what is hearsay in the deep learning community. The results you cite are valid only for shallow networks. As you increase depth, you don't get the same increase in capacity, so even though millions of params are being used in deep networks, the capacity is not O(million).
The capacity of a million-sized shallow net might be O(million), but noone's using such a model.
I saw the formula in Abu-Mustafa's Learning from Data. I don't think it only applied to single hidden layer networks, but I may be wrong. Additionally, the book said the VC dimension is infinite in the general case.
I asked the question on CS stack exchange and no one took issue with the statement that DL had such a large VC dimension. The only counter response was that it didn't matter in practice due to DL's good error scores. But, that still doesn't mean DL is generalizing. Good error is only a necessary condition for generalization, not a sufficient condition.
Perhaps deep networks work well because they learn to memorize "the most common patterns of auto-correlation" they see in the training data at different levels of function composition.
In fact, we do this explicitly in convolutional layers, which by design learn to represent every input sample as a combination of a finite number of fixed-size square filters.
...and the reason why deep networks might be "generalizing" so well is because not all distributions of natural data are equally likely!
In practice, objects with the same or similar labels tend to lie on or close to lower-dimensional manifolds embedded in data space.
...and this concentration of natural data distributions might be a result of the laws of Physics of the universe in which we happen to live: https://arxiv.org/abs/1608.08225
So, yes, deep learning could very well be a really fancy form of memorization.
I hadn't thought of it in this way before. Very interesting :-)
Yes, deep neural nets are better at many AI/cognitive tasks, as they learn to recognize (and perhaps only memorize) patterns in the data at multiple levels of function composition -- that is, at multiple levels of abstraction.
That last bit about "multiple levels" is key. Shallow models like kNN, SVMs, Gaussian Processes, etc. don't do that; they learn to recognize/memorize patterns at only one level.
Seems we could just layer any of these other techniques then. The big thing is layers, and neural networks just get traction because people think we're discovering something important about the brain and mind. So it's just PR at the end of the day. Not a true breakthrough.
First, yes, that's pretty much what deep neural networks are: layers of shallow machine models stacked atop each other, with each layer learning to recognize patterns at different scale; and we train all layers together end-to-end.
Second, it's not PR! This stacking of layers, when done right, can overcome the "curse of dimensionality." Shallow models like kNN, SVM, GP, etc. cannot overcome it; they perform poorly as the number of input features increases. For example, k-nearest-neighbors will not work with images that have millions of pixels each.
Third, I'm only scratching the surface here. There's a LOT more to deep learning than just stacking shallow models.