Recipe to Optimal Architecture for Convolutional Neural Networks

Roman Kazinnik
5 min readJun 26, 2018

I will talk about convolutional neural networks and how you can optimize their architecture while eliminating redundancy. Let’s get started.

https://github.com/romanonly/romankazinnik_blog/tree/master/CVision/CNN

Images and Convolutional Neural Networks

A convolutional neural network, also known as CNN, is a deep learning algorithm that takes in an image as an input and weighs the varied objects in the image to differentiate them from each other. What makes CNN a better algorithm is the fact that it requires no image features pre-processing. Instead, CNN automatically creates features that help to increase model accuracies.

CNN can also be seen as a process that creates an intuitive multi-scale representation of an image.

Essentially CNN is a sequence of Convolutions, and each convolution can be viewed as a projection to a lower-dimensional space. This projection can also be implemented as an embedding projection into higher-dimensional space.

Multi-scale approximation analysis is a classical area of mathematical analysis, and I apply this theory to share rules on how to avoid redundant model architectures and create optimal CNN model architectures.

As an illustration, I will compare the following three CNN architectures:

1. ‘Road sign’ CNN architecture
2. Remove the two concatenated sections denoted as ‘2x2 max pool’ and ‘4x4 max pool’ from the ‘flatten’ layer
3. Three-headed model: add two softmax outputs for each of the two concatenated sections in the ‘flatten’ layer denoted as ‘2x2 max pool’ and ‘4x4 max pool’.

I am going to review the multi-scale theory and decomposition into low-resolution and detail layers. I will illustrate how this theory directs to treat the ‘flatten’ layer in a special way.

Model Architecture: Optimal Versus Redundant

Accurate predicting can be regarded as a search for efficient data representation. As an illustration, one can think about a deep neural network binary classification model as a series of transformations with the goal to find an efficient data representation that separates the two classes. In data representation, redundancy is well defined in data representation that aims at finding the smallest basis sets. The final dense layer can be viewed as the representation of the data.

An example of redundancy in modeling is: if one model achieves 99% accuracy with a 10-long final dense layer, a model architecture that needs a longer representation for the same accuracy can be viewed as redundant.

Why Avoid Redundant Model Architectures?

Efficient models will require smaller data to achieve an equivalent accuracy. The amount of data needed to achieve certain accuracy can be viewed as another indicator of model architecture redundancy.

Let’s add more intuition to Projection, Multi-Scale, and Redundancy.

A Deep Neural Network Is A Sequence Of Multi-Scale Transforms.

The redundant representation can be illustrated using the following example in Figure 2. Figure-2 shows multi-scale decomposition, which employs a projection of a high-resolution object into two parts: the low-resolution coarse object and the details object.
Multi-scale decomposition is the sequence of transformations of an original high-resolution object, which is decomposed into a lower resolution version and its complement details part.
Reconstruction of the original high-resolution object is obtained from the lower resolution and the detailed version.

Multi-Scale Transformation From High To Low Resolution.

Here are the intuition and properties of Sequential transformations using Projections and Multi-scale.

  1. Projection operation is obtained using inner-product, such as convolution.
  2. A lower-resolution object will have some of the original high-resolution details smoothed out as a result of a projection onto lower dimensions subspace.
  3. Applied consequently, one gets a set of object multi-scale (multi-resolution) versions and the reminder levels, see a 1-D signal multi-scale decomposition example in Figure 3.

Let me highlight key observations:

  • Objects tend to look similar at near resolutions, exhibiting differences at the far distanced resolution levels.
  • The Low-resolution version of the object represents the least-squares projection.
  • Redundancy can be clearly seen at the objects from two consecutive resolution levels that look almost identical.
  • Concatenating two levels would be equivalent to duplication of the same object.
  • The reminders (‘details’) will have a sparse structure.

CNN and Redundant Representation: Road signs example

Figure 1 shows three CNN model architectures (Keras, road signs data set) and three training curves — Red, Green, Black (Blue and Green show the training of the same model architecture). Can you find the most redundant CNN architecture?

The lowest accuracy corresponds to redundant CNN architecture. Which architecture produced the Black accuracy (most efficient) and the least accuracy (Red).

The most optimal black convergence curve corresponds to the three-headed architecture. The ‘flat’ layer concatenates outputs of convolutions that carry redundant information. Introducing three independent softmax optimizers makes convolution outputs be optimized independently of each other.

The ‘red’ convergence plot produced by the least optimal architecture that applies a single softmax optimizer to the single output from the last convolutional layer.

The ‘green’ and ‘blue’ convergences were produced by training of the original single softmax architecture with the two convolution outputs concatenated.

Figure 4 shows that increasing the training data four times effectively removes the redundancy effect.

What are your views on this?

Let me know in the comments below. If you want to discuss this in detail, you can also choose to email me at roman.kazinnik@gmail.com

Originally published at https://www.romankazinnik.com on June 26, 2018.

--

--