The Data Science Lab

Neural Network Momentum Using Python

for j in range(self.ni):
  x_values[j] = trainData[idx, j]	
for j in range(self.no):
  t_values[j] = trainData[idx, j+self.ni]
self.computeOutputs(x_values) 

Note that there is an array-slicing shortcut idiom specific to Python you could use here. I avoided it in case you want to refactor the demo code to a language that doesn't support slicing. The first step in back-propagation is to compute the output node signals:

# 1. compute output node signals
for k in range(self.no):
  derivative = (1 - self.oNodes[k]) * self.oNodes[k]
  oSignals[k] = derivative * (self.oNodes[k] - t_values[k])

The value for the variable named derivative assumes you have a neural network classifier that uses softmax activation. To compute the signal, the demo assumes the goal is to minimize mean squared error and uses "output minus target" rather than "target minus output," which would change the sign of the weight delta. Next, the hidden-to-output weight gradients are computed:

# 2. compute hidden-to-output weight gradients
for j in range(self.nh):
  for k in range(self.no):
    hoGrads[j, k] = oSignals[k] * self.hNodes[j]

The output node signal is combined with the associated input from the associated hidden node to give the gradient. Next, the gradients for the output node biases are computed:

# 3. compute output node bias gradients
for k in range(self.no):
  obGrads[k] = oSignals[k] * 1.0

You can think of a bias as a special kind of weight that has a constant, dummy 1.0 associated input. Here I use an explicit 1.0 value only to emphasize that idea, so you can omit it if you wish. Next, the hidden node signals are computed:

# 4. compute hidden node signals
for j in range(self.nh):
  sum = 0.0
  for k in range(self.no):
    sum += oSignals[k] * self.hoWeights[j,k]
  derivative = (1 - self.hNodes[j]) * (1 + self.hNodes[j]) 
  hSignals[j] = derivative * sum

The demo neural network uses the tanh activation function for hidden nodes. The local variable named derivative holds the calculus derivative of the tanh function. So, if you change the hidden node activation function, you'd have to change the calculation of this derivative variable. Next, the input-to-hidden weight gradients and the hidden node bias gradients are calculated:


# 5. compute input-to-hidden weight gradients
for i in range(self.ni):
  for j in range(self.nh):
    ihGrads[i, j] = hSignals[j] * self.iNodes[i]

# 6. compute hidden node bias gradients
for j in range(self.nh):
  hbGrads[j] = hSignals[j] * 1.0 

As before, a gradient is composed of a signal and an associated input term, and the dummy 1.0 input value for the hidden biases can be dropped. Notice that the code hasn't used momentum yet. After the gradients have been calculated, the weight and bias values can be updated in any order. First, the demo updates the input-to-hidden node weights:

# 1. update input-to-hidden weights
for i in range(self.ni):
  for j in range(self.nh):
    delta = -1.0 * learnRate * ihGrads[i,j]
    self.ihWeights[i,j] += delta
    self.ihWeights[i,j] += momentum * ih_prev_weights_delta[i,j]
    ih_prev_weights_delta[i,j] = delta 

After the weight delta is calculated, the weight is updated. Then a second update is performed using the value of the previous delta. In the very first training iteration, the initial value of the weight delta will be 0.0, but that's fine. Before exiting the loop, the previous delta is replaced by the current delta value.

Next, the hidden node biases are updated in the same way:

# 2. update hidden node biases
for j in range(self.nh):
  delta = -1.0 * learnRate * hbGrads[j]
  self.hBiases[j] += delta
  self.hBiases[j] += momentum * h_prev_biases_delta[j]
  h_prev_biases_delta[j] = delta  

Next, the hidden-to-output weights and the output node biases are updated using the same pattern:

# 3. update hidden-to-output weights
for j in range(self.nh):
  for k in range(self.no):
    delta = learnRate * hoGrads[j,k]
    self.hoWeights[j,k] += delta
    self.hoWeights[j,k] += momentum * ho_prev_weights_delta[j,k]		
    ho_prev_weights_delta[j,k] = delta	
			
# 4. update output node biases
for k in range(self.no):
  delta = learnRate * obGrads[k]
  self.oBiases[k] += delta
  self.oBiases[k] += momentum * o_prev_biases_delta[k]
  o_prev_biases_delta[k] = delta

Notice that all updates use the same fixed learning rate. Advanced versions of back-propagation use adaptive learning rates that change. The main training loop finishes by updating the iteration counter and printing a progress message, and then method NeuralNetwork.train wraps up like so:

...
    epoch += 1
    if epoch % 10 == 0:
      mse = self.meanSquaredError(trainData)
      print("epoch = " + str(epoch) + " ms error = %0.4f " % mse)
  # end while

  result = self.getWeights()
  return result
# end train

Here, a progress message with the current mean squared error is displayed every 10 iterations. You might want to parameterize the interval. You can also print additional diagnostic information here. The final values of the weights and biases are fetched by class method getWeights and returned by method train as a convenience.

Wrapping Up
Momentum is almost always used when training a neural network with back-propagation. Exactly why momentum is so effective is a bit subtle. Suppose in the early stages of training a weight delta is calculated to be +4.32, and then on the next iteration the delta for the same weight is calculated to be +3.95. Training is "headed in the right direction" so to speak, and momentum adds a boost to the weight change, which makes training faster. Suppose in the middle stages of training the weight deltas start oscillating between positive and negative values. Using momentum will dampen the oscillation, making the resulting model more accurate.

Momentum isn’t a cure-all. When training a neural network, you must experiment with different momentum factor values. In some situations, using no momentum (or equivalently, a momentum factor of 0.0) leads to better results than using momentum. However, in most scenarios, using momentum gives you faster training and better predictive accuracy.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube