I am glad they evaluated this hypothesis using weight decay which is primarily thought of to induce a structured representation. My first thought was that the entire paper was useless if they didn't do this experiment.
I find it rather interesting that the structured representations go from sparse to full to sparse as a function of layer depth. I have noticed that applying weight decay penalty as an exponential function of layer depth gives improved results over using a global weight decay.
I find it rather interesting that the structured representations go from sparse to full to sparse as a function of layer depth. I have noticed that applying weight decay penalty as an exponential function of layer depth gives improved results over using a global weight decay.