Actually, there's already a better way to perform initialization, based on the so called lottery ticket hypothesis [1]. I haven't gotten to the article, so I'll just regurgitate the abstract, but basically there frequently are subnetworks which may be exposed by pruning trained networks which perform on par with full sizes neural nets with ≈20% of the parameters and substantially quicker training time. It turns out that with some magic algorithm described in the paper, one can initialize weights to quickly find these "winning tickets" to drastically reduce neural network size and training time.
As far as I understand there is no quick magic algorithm to find them: you train the full architecture as usual the long and hard way, then you identify the right subnetwork and you can retrain faster from the architecture and initialization of just this subnetwork
>Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can be trained from scratch to the same test set performance of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression.
> We can divide by a number (scaling_factor) to scale down its magnitude to the right level
This argument bugs me a bit... since these numbers are represented using floating point, whose precision does not depend on their magnitude, what is the point of scaling them?
Furthermore, I do not believe his first example. Is torch really that bad? In octave:
x = randn(512, 1);
A = randn(512);
y = A^100 * x;
mean(p), std(p)
gives regular numbers ( 9.1118e+135 and 1.9190e+137 )
They are large, but far from overflowing. And this corresponds to a network of deep 100, which is not a realistic scenario.
Independently of the scaling, I wonder if someone tried to initialize a deep neural network with a low-discrepancy sequence (also called quasi-random numbers) instead of a uniform or gaussian distribution.
Better coverage of the space of weights (less clumps and holes).
If I understand the lottery ticket hypothesis [0] correctly, this would lead to a better exploration of the space and thus better results.
Beautiful article. I do not understand why he takes the time to do this in Python. I used 4 one-liners in Scilab for free on my laptop and understood better the intent :-)