Notes on Weight Initialization for Deep Neural Networks

higgy · on May 20, 2019

Helpful read. I find it amusing how prevalent guesswork is when it comes to neural nets. It seems like we might be taking the wrong approach.

axiom92 · on May 20, 2019

Thanks! Yes, it's very interesting to see how often the theory is fit with the practical solutions in an "after the fact" manner.

fromthestart · on May 20, 2019

Actually, there's already a better way to perform initialization, based on the so called lottery ticket hypothesis [1]. I haven't gotten to the article, so I'll just regurgitate the abstract, but basically there frequently are subnetworks which may be exposed by pruning trained networks which perform on par with full sizes neural nets with ≈20% of the parameters and substantially quicker training time. It turns out that with some magic algorithm described in the paper, one can initialize weights to quickly find these "winning tickets" to drastically reduce neural network size and training time.

1. https://arxiv.org/abs/1803.03635

iXce · on May 21, 2019

As far as I understand there is no quick magic algorithm to find them: you train the full architecture as usual the long and hard way, then you identify the right subnetwork and you can retrain faster from the architecture and initialization of just this subnetwork

L2R · on May 22, 2019

Based off of the results, you have to train a larger number of architectures to identify the right subnetwork.

hnaccy · on May 21, 2019

This paper had trouble getting this to work with lager models I believe.

https://arxiv.org/abs/1902.09574

p1esk · on May 21, 2019

You mean this one https://arxiv.org/abs/1903.01611 ?

hnaccy · on May 21, 2019

No. From my link:

>Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can be trained from scratch to the same test set performance of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression.

axiom92 · on May 21, 2019

It sounds very cool. This work also won the best paper award at ICLR 2019.

enriquto · on May 21, 2019

> We can divide by a number (scaling_factor) to scale down its magnitude to the right level

This argument bugs me a bit... since these numbers are represented using floating point, whose precision does not depend on their magnitude, what is the point of scaling them?

Furthermore, I do not believe his first example. Is torch really that bad? In octave:

    x = randn(512, 1);
    A = randn(512);
    y = A^100 * x;
    mean(p), std(p)

gives regular numbers ( 9.1118e+135 and 1.9190e+137 )

They are large, but far from overflowing. And this corresponds to a network of deep 100, which is not a realistic scenario.

KidComputer · on May 21, 2019

That's because octave is using doubles. You can do the exact same thing in PyTorch by passing in dtype=torch.float64 into torch.randn.

axiom92 · on May 21, 2019

> They are large, but far from overflowing.

Sure, but isn't large relative? Sure you can make them overflow in octave as well, given enough layers. Which brings us to next point :-)

> And this corresponds to a network of deep 100, which is not a realistic scenario.

Actually deep 100 is not unrealistic at all these days! https://arxiv.org/abs/1611.09326

L2R · on May 22, 2019

There are approaches to ensure parameters remain stable despite the depth (selu, for example).

nestorD · on May 21, 2019

Independently of the scaling, I wonder if someone tried to initialize a deep neural network with a low-discrepancy sequence (also called quasi-random numbers) instead of a uniform or gaussian distribution.

p1esk · on May 23, 2019

What would be the advantage of that?

nestorD · on May 28, 2019

Better coverage of the space of weights (less clumps and holes). If I understand the lottery ticket hypothesis [0] correctly, this would lead to a better exploration of the space and thus better results.

[0] https://arxiv.org/abs/1803.03635

fithisux · on May 21, 2019

Beautiful article. I do not understand why he takes the time to do this in Python. I used 4 one-liners in Scilab for free on my laptop and understood better the intent :-)

black_puppydog · on May 21, 2019

because python is the de-facto standard for deep learning code.