$$\notag \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} $$

Fitting and initializing neural networks

Neural networks are almost always fitted with gradient based optimizers, such as variants of Stochastic Gradient Descent1. We defer how to compute the gradients to the next note.

1 Initialization

How do we set the initial weights before calling an optimizer? Don’t set all the weights to zero! If different hidden units (adaptable basis functions) start out with the same parameters, they will all compute the same function of the inputs. Each unit will then get the same gradient vector, and be updated in the same way. As each hidden unit remains the same, our neural network function can only depend on a single linear combination of the inputs.

Instead we usually initialize the weights randomly. Don’t simply set all the weights using randn() though! As a concrete example, if all your inputs were \(x_d\in\{-1,+1\}\) the activation \((\bw^{(k)})^\top\bx\) to hidden unit \(k\) would have zero mean, but typical size \(\sqrt{D}\) if there are \(D\) inputs. (See the review of random walks on the expectations notes.) If your units saturate, like the logistic sigmoid, most of the gradients will be close to zero, and it will be hard for the gradient optimizer to update the parameters to useful settings.

Summary: initialize a weight matrix that transforms \(K\) values to small random values, like 0.1*randn()/sqrt(K), assuming your input features are \({\sim}1\).

The MLP course points to Glorot and Bengio’s (2010) paper Understanding the difficulty of training deep feedforward networks, which suggests a scaling \(\propto 1/\sqrt{K^{(l)}+K^{(l-1)}}\), involving the number of hidden units in the layer after the weights, not just before. The argument involves the gradient computations, which we haven’t described in detail for neural networks yet, so we defer the interested reader to the paper or the MLP (2019) slides2.

Some specialized neural network architectures have particular tricks for initializing them. Do a literature search if you find yourself trying something other than a standard dense feedforward network: e.g., recurrent/recursive architectures, convolutional architectures, transformers, or memory networks. Alternatively, a pragmatic tip: if you are using a neural network toolbox, try to process your data to have similar properties to the standard datasets that are usually used to demonstrate that software. For example, similar dimensionality, means, variances, sparsity (number of non-zero features). Then any initialization tricks that the demonstrations use are more likely to carry over to your setting.

2 Local optima

The cost function for neural networks is not unimodal, and so is certainly not convex (a stronger property). We can see why by considering a neural network with two hidden units. Assume we’ve fitted the network to a (local) optimum of a cost function, so that any small change in parameters will make the network worse. Then we can find another parameter vector that will represent exactly the same function, showing that the optimum is only a local one.

To create the second parameter vector, we simply take all of the parameters associated with hidden unit one, and replace them with the corresponding parameters associated with hidden unit two. Then we take all of the parameters associated with hidden unit two and replace them with the parameters that were associated with hidden unit one. The network is really the same as before, with the hidden units labelled differently, so will have the same cost.

Models with “hidden” or “latent” representations of data, usually have many equivalent ways to represent the same model. When the goal of a machine learning system is to make predictions, it doesn’t matter whether the parameters are well-specified. However, it’s worth remembering that the values of individual parameters are often completely arbitrary, and can’t be interpreted in isolation.

In practice local optima don’t just correspond to permuting the hidden units. Some local optima will have better cost than others, and some will make predictions that generalize better than others. For small neural networks, one could fit many times and use the network that cross-validates the best. However, researchers pushing up against available computational resources will find it difficult to optimize a network many times.

One advantage of large neural networks is that fitting far more parameters than necessary tends to work better(!). One intuition is that there are many more ways to set the parameters to get low cost, so it’s less hard to find one good setting.3 Although it’s difficult to make rigorous statements on this issue. Understanding the difficulties that are faced in really high-dimensional optimization is an open area of research. (For example, https://arxiv.org/abs/1412.6544.)

3 Regularization by early stopping

We have referred to complex models that generalize poorly as “overfitted”. One idea to avoid “overfitting” is to fit less! That is, stop the optimization routine before it has found a local optimum of the cost function. This heuristic idea is often called “early stopping”.

The most common way to implement early stopping is to periodically monitor performance on a validation set. If the validation score is the best that we have seen so far, we save a copy of the network’s parameters. If the validation score fails to improve upon that cost over some number of future checks (say 20), we stop the optimization and return the weights we’ve saved.

David MacKay’s textbook mentions early stopping (Section 39.4, p479). This book points out that stopping the optimizer prevents the weights from growing too large. Goodfellow et al.’s deep learning textbook (Chapter 7) makes a more detailed link to L2 regularization. MacKay argued that adding a regularization term to the cost function to achieve a similar effect seems more appealing: if we have a well-defined cost function, we’re not tied to a particular optimizer, and it’s probably easier to analyse what we’re doing.

However, I’ve found it hard to argue with early stopping as a pragmatic, sensible procedure. The heuristic directly checks whether continuing to fit is improving predictions for held-out data, which is what we care about. And we might save a lot of computer time by stopping early. Moreover, we can still use a regularized cost function along with early stopping.

Question: This question doesn’t directly relate to early stopping, but still pertains to comparing performance, this time across models.

open above question in new tab (allows annotation)

4 Regularization corrupting the data or model

There are a whole family of methods for regularizing models that involve adding noise to the data or model during training. As with early-stopping, it can be hard to understand what objective we are fitting, and the effect of the regularizer can depend on which optimizer we are using. However, these methods are often effective…

Adding Gaussian noise to the inputs of a linear model during gradient training has the same average effect as L2 regularization4. We can also add noise when training neural networks. The procedure will still have a regularization effect, but one that’s harder to understand. We can also add noise to the weights or hidden units of a neural network. In some applications, adding noise has worked better than optimizing easy-to-define cost functions (like L2 regularizers).

Other regularization methods randomly replace some of the weights with zeros (“drop-out”5) or features with zeros (such as in “denoising auto-encoders”6 or a 2006 feature-dropping regularizer). These heuristics prevent the model from fitting delicate combinations of parameters, or fits that depend on careful combinations of features. If used aggressively, “masking noise” makes it hard to fit anything! Often large models are needed when using these heuristics.

5 Further Reading

Most textbooks are long out-of-date when it comes to recent practical wisdom on fitting neural networks and regularization strategies. However, https://www.deeplearningbook.org/ is still fairly recent, and is a good starting point. The MLP notes are also more detailed on practical tips for deep nets.

Murphy discusses initializations in 13.4.5, non-saturating activation functions in 13.4.3 and regularization techniques in 13.5.

If you were to read about one more trick, perhaps it should be Batch Normalization (or “batch norm”), which is (just) “old” enough to be covered in the deep learning textbook. Like most ideas, it doesn’t always improve things, so experiments are required. And variants are still being actively explored.

The discussion in this note about initialization pointed out that we don’t want to saturate hidden units. Batch normalization shifts and scales the activations for a unit across a training batch to have a target mean and variance. When applying this trick, gradient-based training of neural nets often works better. In hindsight, it’s surprising that Batch normalization is so recent: it’s a simple idea that someone could have come up with in an earlier decade, but didn’t.


  1. Adam (https://arxiv.org/abs/1412.6980) has now been popular for some time, although pure SGD is still in use too.

  2. https://www.inf.ed.ac.uk/teaching/courses/mlp/2019-20/lectures/mlp06-enc.pdf

  3. The high-level idea is old, but a recent (2018) analysis described the idea that some parts of a large network “get lucky” and identify good features as “The Lottery Ticket Hypothesis”, https://arxiv.org/abs/1803.03635

  4. Non-examinable: there’s a sketch in these slides: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec9.pdf. More detail in Bishop’s (1995) neural network textbook, section 9.3.

  5. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

  6. http://icml2008.cs.helsinki.fi/papers/592.pdf