There are so many approaches to analyse convnet. In this post, I’m probably adding a simple and new way to do it if I’m not reinventing this stuff (insuring for Schmidhuberisation).

I had these two questions:

- In convnet, which kernel is the most important?
- What is the optimum layer widths for N-layer convnet? all the same? gradually increasing?

, not sure if this reverse *PCA* would answer them.

## Reverse-PCA

If it’s a PCA, it should find the most important kernel after training. Instead,

`layer width = N`

. Then`train()`

,`save_weights()`

.- Then
`layer width = 2N`

,`load_weights()`

,`train()`

,`save_weights()`

.- When loading weights, there are only weights for
`N`

layers. For the other`N`

layers I just randomly initialise.

- When loading weights, there are only weights for
- Then
`layer width = 4N`

,`load_weights()`

,`train()`

,`save_weights()`

- Then
`8N`

.. `16N`

…

So that the network will hopefully find the most important `N`

kernels, and then second-important `N`

kernels, then third-important `2N`

kernels, then `4N`

, `8N`

…

## Experiments

I used this model and started with `N=4`

.

I trained it on Cifar-10. The results are not shocking, probably some of you can make a guess. Yeah, probably you’re correct, um, it somehow works.

### N=4

### N=8

#### N=16

### N=32

### N=48

## Blah Blah (=so-called discussions)

- The pre-trained parts are quite
after loaded into a wider model.*preserved*- The partially initialised parameters stay stable (=local minima?) and doesn’t change much even if the new learning rate is big.

- In the training, the loss increases for a bit, but then starts to decrease
- Probably we can assume that
- how much the kernel is important == -1 * how much the kernel values are uniform == their entropy?
- Look at the layer 1 on
`N=48`

.Those kernels on bottom look like doing nothing. (which does NOT mean they’re doing nothing.)

- Probably, if so, we can do something to find the optimal widths? For example..
`min(entropy(kernels of the layer)`

would be same for all layers?- (It’s not something we do by measuring the cost or any performance of the network, of course first we make sure it works in terms of performance.)

## Experiment 2

What if we do the same stuff from scratch? How they look like?

### N=8

### N=16

### N=32

### N=48

## Blah blah (2)

Are they different? Like how?

- Unlike partially initialised, all the kernels look doing something (Again, which does not mean they really are doing something). Would it mean, for example – in the first layer when
`N=48`

, are they somehow redundant? - Does partial initialisation result in better performance? (hopefully?)
- NO. Look at these.

- All on validation set
- x-axis: [epoch]
- Solid lines: partially initialised.
- Dashed lines: ‘from scratch’, and aligned on x-axis with partial initialisation results from same layer widths.
- At the end, the results seems the same for same
`N`

.- So partial initialisation doesn’t either help or hurt.

## STILL,

it would be nice if we can spot those useless lazy kernels so that we can optimise the structure more computationally.