PCA.reverse() of Convnet by partial initialisation.

There are so many approaches to analyse convnet. In this post, I’m probably adding a simple and new way to do it if I’m not reinventing this stuff (insuring for Schmidhuberisation).

I had these two questions:

  1. In convnet, which kernel is the most important?
  2. What is the optimum layer widths for N-layer convnet? all the same? gradually increasing?

, not sure if this reverse PCA would answer them.


If it’s a PCA, it should find the most important kernel after training. Instead,

  • layer width = N . Then train(), save_weights().
  • Then layer width = 2N, load_weights(), train(), save_weights().
    • When loading weights, there are only weights for N layers. For the other N layers I just randomly initialise.
  • Then layer width = 4N, load_weights(), train(), save_weights()
  • Then 8N..
  • 16N

So that the network will hopefully find the most important N kernels, and then second-important N kernels, then third-important 2N kernels, then 4N, 8N


I used this model and started with N=4.

I trained it on Cifar-10. The results are not shocking, probably some of you can make a guess. Yeah, probably you’re correct, um, it somehow works.










Blah Blah (=so-called discussions)

  • The pre-trained parts are quite preserved after loaded into a wider model.
    • The partially initialised parameters stay stable (=local minima?) and doesn’t change much even if the new learning rate is big.
  • In the training, the loss increases for a bit, but then starts to decrease
  • Probably we can assume that
    • how much the kernel is important == -1 * how much the kernel values are uniform == their entropy?
    • Look at the layer 1 on N=48.Those kernels on bottom look like doing nothing. (which does NOT mean they’re doing nothing.)
  • Probably, if so, we can do something to find the optimal widths? For example..
    • min(entropy(kernels of the layer) would be same for all layers?
    • (It’s not something we do by measuring the cost or any performance of the network, of course first we make sure it works in terms of performance.)

Experiment 2

What if we do the same stuff from scratch? How they look like?









Blah blah (2)

Are they different? Like how?

  • Unlike partially initialised, all the kernels look doing something (Again, which does not mean they really are doing something). Would it mean, for example – in the first layer when N=48, are they somehow redundant?
  • Does partial initialisation result in better performance? (hopefully?)
    • NO. Look at these.


  • All on validation set
  • x-axis: [epoch]
  • Solid lines: partially initialised.
  • Dashed lines: ‘from scratch’, and aligned on x-axis with partial initialisation results from same layer widths.
  • At the end, the results seems the same for same N .
    • So partial initialisation doesn’t either help or hurt.


it would be nice if we can spot those useless lazy kernels so that we can optimise the structure more computationally.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s