Many great adventures start with naivety, sadly this blog was one of them. My intention was to provide a high-level overview of the advancements in Deep Neural Networks (DNN training) techniques during the last few years. Unfortunately, I struggle to pontificate about something I don’t comprehend well, so starting with the popular Convolutional Neural Networks I rapidly discovered a multitude of its variants. No worries, how hard can it be to research each new term/concept related to Coder/Decoder networks?

A week of spare time reading later and I’m suspicious that the data scientists, like Monty Python's Battle of Britain fighter pilots (below) chatting about “Cabbage crates over the Briny”, have their own insider dialect!

I now better understand convolution neural network training, but I won’t be programming one anytime soon. Certainly not before the end of the sailplane racing season.

Many years ago, as a sprog electronics student I had the option of doing a double solid-state physics or a numerical methods course. I knew and disliked Schrödinger's cat, so I learned all about using computers to solve mathematics problems. A few numerical method ideas permanently programmed my neurons:

- Zero and infinity are the trivual solutions to many problems and have zero value so should be ignored
- The concept of testing divergence (does the next iteration on the answer get closer or further away from an answer) early is important in numerical methods.

Apparently, these ideas also apply well to DNNs where gradient is very similar to divergence as I understand it. Combine this with the concept of mapping a mathematical problem into a different set of coordinate axes and solving it with linear mathematics and mapping it back, just an extension of logarithm tables concept, and you are prepared for a DNN adventure.

Here are some of the fascinating DNN training concepts and terminology I’ve researched so far. When training a DNN the parameters within its nodes are adjusted until its behaviour matches the desired result, perhaps recognising cats, using test sample image data. The adjustment of these parameters is monitored using several mathematical methods. The progression toward this optimum solution is depicted by the following graph. The bottom of the graph is the best solution and the starting point of the network the initial weight:

**“Initialisation scale”** - Best practice is to always initialise the DNN weights with small random numbers to break the symmetry between different units. However, if weights are initialised to very large numbers, then the non-linear sigmoid weights will saturate (tail regions), resulting into DNN nodes that never change value - dead neurons. If weights are very small, then gradients will also be small. Therefore, it’s preferable to choose weights in an intermediate range, such that these are distributed evenly around a mean value.

The inputs in the arrow regions of the sigmoid all result in the same output – adding little value to the DNN.

**“Gradient”** - The gradient, sometimes called gradient descent, is the slope of the successive iterations. The steeper the slope the faster the movement towards the solution and the bigger the iteration steps. When the solution is converged upon, the gradient flattens out.

**“Learning rate”** - The size of these iteration steps is called the learning rate. With a high learning rate, we can cover more ground each step, but we risk overshooting the lowest point since the slope is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, so it will take us a very long time to get to the bottom.

**“Momentum”** - In the real world the data training graphs are often far from the smooth curve shown above perhaps with flat spots and intermediate dips. Momentum is added to the steps to ensure that the iterations pass through these to converge on the most optimum solution.

**“Coder/Decoder”** - The input data is converted into an intermediate data representation (a vector) which is then converted into the required result be it a different natural language, image recognition, text translation, etc. The conversion is a non-linear process otherwise the data would just be scaled rather than transformed. This took me a while to digest:

**Encoder:** The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector
**Decoder:** The decoder is responsible for stepping through the output time steps while reading from the context vector.

**“Attenuation Networks”** - When working with variable length sequence data like natural language mapping the conversion to common intermediate format is complicated. The neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Attenuation networks have a series of techniques to address this.

**“Long Short-Term Memory Units (LSTMs)”** - LSTM units are cells responsible for "remembering" values over arbitrary time intervals. The value remembered is the result of the activation defines the output of that DNN node given an input or set of inputs. A standard computer chip circuit is a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is like the behaviour of the linear perceptron in neural networks. However, only non-linear activation functions allow such networks to compute non-trivial problems using only a small number of nodes.

The following x = input, y = output graphs compare a linear function (“Identity” straight line), multiplying by 1 in this case, to other non-linear ones:

Selecting the appropriate non-linear function for your dataset has a major positive impact on how long or if your DNN trains.

**“Training parameters”**

**One epoch** = one forward pass and one backward pass of all the training examples
**Batch size** = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need
**Number of iterations** = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

For example, if you have 1,000 training examples (perhaps images), and your batch size is 500, then it will take 2 iterations to complete 1 epoch. Below are some typical DNN training parameters from a Google Sequence to Sequence DNN training project, prior to 2016, used for English to French translation:

**Dataset:** WMT ’14, 340M French words & 303M English words (30% used for training)
**Unique input words:** 160,000
**Unique output words:** 80,000
**DNN:** 4 layers of 1,000 LSTMs (Different LSTMs for input and output)
**Parameters:** 384M
**Batch size:** 128
**Learning rate:** 0.7 of batch size
**Initialisation scale:** -0.08 > 0.08
**Normalisation of gradient clipped to:** 5
**Learning rate:** Halved every 5 epochs after 5
**Momentum:** None

Perhaps I can now ask intelligent questions at the “Rise of AI” event in Berlin later this month. I suspect that our marketing folks are in for a tough couple of weeks of preparation. If you are attending that event on the 17th May, please come and see us at our Verne Global stand - we would be delighted to meet you.

**Additional background reading:**

- http://rishy.github.io/ml/2017/01/05/how-to-train-your-dnn/
- https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks/
- https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
- https://www.youtube.com/watch?time_continue=826&v=-uyXE7dY5H0