The network must not have capacity to hold all the data. It must have a capacity proportional to the number of classes of data (instead of the number of samples).
Another way to arrive at this may be: take a trained network, run it in inference on the training set. Group the nodes of the network into equally sized groups. As inference happens train a smaller corresponding new group of nodes for each previous group by looking only at its inputs and outputs that are exercised. Put the new subnetworks together by looking purely at the edges between the previous subnetwork. The new network is now constructed.
I have not built this. But would something like this work?
People have demonstrated similar types of ideas are effective for optimizing network size - after training a highly redundant big model it's often possible to reduce it down to 1/10 of the parameters without significantly impacting performance by doing stuff like this (even simpler, I think pruning is often effective).
The network must not have capacity to hold all the data. It must have a capacity proportional to the number of classes of data (instead of the number of samples).
Another way to arrive at this may be: take a trained network, run it in inference on the training set. Group the nodes of the network into equally sized groups. As inference happens train a smaller corresponding new group of nodes for each previous group by looking only at its inputs and outputs that are exercised. Put the new subnetworks together by looking purely at the edges between the previous subnetwork. The new network is now constructed.
I have not built this. But would something like this work?