PCA.reverse() of Convnet by partial initialisation.

There are so many approaches to analyse convnet. In this post, I’m probably adding a simple and new way to do it if I’m not reinventing this stuff (insuring for Schmidhuberisation).

I had these two questions:

  1. In convnet, which kernel is the most important?
  2. What is the optimum layer widths for N-layer convnet? all the same? gradually increasing?

, not sure if this reverse PCA would answer them.

Reverse-PCA

If it’s a PCA, it should find the most important kernel after training. Instead,

  • layer width = N . Then train(), save_weights().
  • Then layer width = 2N, load_weights(), train(), save_weights().
    • When loading weights, there are only weights for N layers. For the other N layers I just randomly initialise.
  • Then layer width = 4N, load_weights(), train(), save_weights()
  • Then 8N..
  • 16N

So that the network will hopefully find the most important N kernels, and then second-important N kernels, then third-important 2N kernels, then 4N, 8N

Experiments

I used this model and started with N=4.

I trained it on Cifar-10. The results are not shocking, probably some of you can make a guess. Yeah, probably you’re correct, um, it somehow works.

N=4

ipca1

N=8

ipca2

N=16

ipca3

N=32ipca4

N=48

ipca5

Blah Blah (=so-called discussions)

  • The pre-trained parts are quite preserved after loaded into a wider model.
    • The partially initialised parameters stay stable (=local minima?) and doesn’t change much even if the new learning rate is big.
  • In the training, the loss increases for a bit, but then starts to decrease
  • Probably we can assume that
    • how much the kernel is important == -1 * how much the kernel values are uniform == their entropy?
    • Look at the layer 1 on N=48.Those kernels on bottom look like doing nothing. (which does NOT mean they’re doing nothing.)
  • Probably, if so, we can do something to find the optimal widths? For example..
    • min(entropy(kernels of the layer) would be same for all layers?
    • (It’s not something we do by measuring the cost or any performance of the network, of course first we make sure it works in terms of performance.)

Experiment 2

What if we do the same stuff from scratch? How they look like?

N=8

fig_inversepca_08ch_new

N=16

fig_inversepca_16ch_new

N=32

fig_inversepca_32ch_new

N=48

fig_inversepca_48ch_new

Blah blah (2)

Are they different? Like how?

  • Unlike partially initialised, all the kernels look doing something (Again, which does not mean they really are doing something). Would it mean, for example – in the first layer when N=48, are they somehow redundant?
  • Does partial initialisation result in better performance? (hopefully?)
    • NO. Look at these.

acc_loss

  • All on validation set
  • x-axis: [epoch]
  • Solid lines: partially initialised.
  • Dashed lines: ‘from scratch’, and aligned on x-axis with partial initialisation results from same layer widths.
  • At the end, the results seems the same for same N .
    • So partial initialisation doesn’t either help or hurt.

STILL,

it would be nice if we can spot those useless lazy kernels so that we can optimise the structure more computationally.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s