The Data Science Lab

Neural Network Back-Propagation Using Python

Listing 2: NeuralNetwork Class Structure
class NeuralNetwork:
  def __init__(self, numInput, numHidden, numOutput, seed): ... 
  def setWeights(self, weights): ... 
  def getWeights(self): ... 
  def initializeWeights(self): ... 
  def computeOutputs(self, xValues): ...
  def train(self, trainData, maxEpochs, learnRate): ... 
  def accuracy(self, tdata): ...
  def meanSquaredError(self, tdata): ...  

  @staticmethod
  def hypertan(x): ...

  @staticmethod	  
  def softmax(oSums): ...

  @staticmethod
  def totalWeights(nInput, nHidden, nOutput): ...

# end class NeuralNetwork
----------------------------------------------------------------------------------

The NeuralNetwork.train method implements the back-propagation algorithm. The definition begins:

def train(self, trainData, maxEpochs, learnRate):
  hoGrads = np.zeros(shape=[self.nh, self.no], dtype=np.float32)
  obGrads = np.zeros(shape=[self.no], dtype=np.float32)
  ihGrads = np.zeros(shape=[self.ni, self.nh], dtype=np.float32)
  hbGrads = np.zeros(shape=[self.nh], dtype=np.float32)
...

Each weight and bias has an associated gradient. The prefix "ho" stands for "hidden-to-output." Similarly, "ob" means "output bias," "ih" means "input-to-hidden" and "hb" means "hidden bias." Class members ni, nh, and no are the number of input, hidden, and output nodes, respectively. When working with neural networks, it's common, but not required, to work with the float32 rather than float64 data type.

Next, two scratch arrays are instantiated:

oSignals = np.zeros(shape=[self.no], dtype=np.float32)
hSignals = np.zeros(shape=[self.nh], dtype=np.float32)

Each hidden and output node has an associated signal that’s essentially a gradient without its input term. These arrays are mostly for coding convenience.

The main training loop is prepared like so:

epoch = 0
x_values = np.zeros(shape=[self.ni], dtype=np.float32)
t_values = np.zeros(shape=[self.no], dtype=np.float32)
numTrainItems = len(trainData)
indices = np.arange(numTrainItems)

The x_values and t_values arrays hold a set of feature values (sepal length and width, and petal length and width) and target values (such as 1, 0, 0), respectively. The indices array holds integers 0 through 119 and is used to shuffle the order in which training items are processed. The training loop begins with:

while epoch < maxEpochs:
  self.rnd.shuffle(indices)
  for ii in range(numTrainItems):
    idx = indices[ii]
...

The built-in shuffle function uses the Fisher-Yates mini-algorithm to scramble the order of the training indices. Therefore, variable idx points to the current training item being processed. Inside the main loop, the input and target values are peeled off the current training item, and then the output node values are computed using the input values and the current weights and bias values:

for j in range(self.ni):
  x_values[j] = trainData[idx, j]	
for j in range(self.no):
  t_values[j] = trainData[idx, j+self.ni]
self.computeOutputs(x_values) 

Note that instead of copying the input values from the matrix of training items into an intermediate x_values array and then transferring those values to the input nodes, you could copy the input values directly. The computeOutputs method stores and returns the output values, but the explicit rerun is ignored here.

The first step in back-propagation is to compute the output node signals:

# 1. compute output node signals
for k in range(self.no):
  derivative = (1 - self.oNodes[k]) * self.oNodes[k]
  oSignals[k] = derivative * (self.oNodes[k] - t_values[k])

Recall that the derivative variable holds the derivative of the softmax activation function. The oSignals variable includes that derivative and the output minus target value. Next, the hidden-to-output weight gradients are computed:

# 2. compute hidden-to-output weight gradients using output signals
for j in range(self.nh):
  for k in range(self.no):
    hoGrads[j, k] = oSignals[k] * self.hNodes[j]

The output node signal is combined with the associated input from the associated hidden node to give the gradient. As I mentioned earlier, the oSignals array is mostly for convenience and you can compute the values into the hoGrads matrix directly if you wish. Next, the gradients for the output node biases are computed:

# 3. compute output node bias gradients using output signals
for k in range(self.no):
  obGrads[k] = oSignals[k] * 1.0

You can think of a bias as a special kind of weight that has a constant, dummy 1.0 associated input. Here, I use an explicit 1.0 value only to emphasize that idea, so you can omit it if you wish. Next, the hidden node signals are computed:

# 4. compute hidden node signals
for j in range(self.nh):
  sum = 0.0
  for k in range(self.no):
    sum += oSignals[k] * self.hoWeights[j,k]
  derivative = (1 - self.hNodes[j]) * (1 + self.hNodes[j]) 
  hSignals[j] = derivative * sum

This is the trickiest part of back-propagation. The sum variable accumulates the product of output node signals and hidden-to-output weights. This isn't at all obvious. You can find a good explanation of how this works by reading the Wikipedia article on back-propagation. Recall that the NeuralNetwork class has a hardcoded tanh hidden node activation function. The derivative variable holds the calculus derivative of the tanh function. So, if you change the hidden node activation function to logistic sigmoid or ReLU, you'd have to change the calculation of this derivative variable.

Next, the input-to-hidden weight gradients, and the hidden node bias gradients are calculated:

# 5. compute input-to-hidden weight gradients using hidden signals
for i in range(self.ni):
  for j in range(self.nh):
    ihGrads[i, j] = hSignals[j] * self.iNodes[i]

# 6. compute hidden node bias gradients using hidden signals
for j in range(self.nh):
  hbGrads[j] = hSignals[j] * 1.0 

As before, a gradient is composed of a signal and an associated input term, and the dummy 1.0 input value for the hidden biases can be dropped.

If you imagine the input-to-output mechanism as going from left to right (input to output to hidden), the gradients must be computed from right to left (hidden-to-output gradients, then input-to-hidden gradients). After all the gradients have been computed, you can update the weights in either order. The demo program starts by updating the input-to-hidden weights:

# 1. update input-to-hidden weights
for i in range(self.ni):
  for j in range(self.nh):
    delta = -1.0 * learnRate * ihGrads[i,j]
    self.ihWeights[i, j] += delta 

The weight delta is the learning rate times the gradient. Here, I multiply by -1 and then add the delta because error is assumed to use (target - output)^2 and so the gradient has an (output - target) term. I used this somewhat awkward approach to follow the Wikipedia entry on back-propagation. Of course you can drop the multiply by -1 and just subtract the delta if you wish.

Next, the hidden node biases are updated:

# 2. update hidden node biases
for j in range(self.nh):
  delta = -1.0 * learnRate * hbGrads[j]
  self.hBiases[j] += delta  

If you look at the loop structure s carefully, you'll notice that you can combine updating the input-to-hidden weights and updating the hidden biases if you wish. Next, the hidden-to-output weights and the output node biases are updated using these statements:

# 3. update hidden-to-output weights
for j in range(self.nh):
  for k in range(self.no):
    delta = -1.0 * learnRate * hoGrads[j,k]
    self.hoWeights[j, k] += delta
			
# 4. update output node biases
for k in range(self.no):
  delta = -1.0 * learnRate * obGrads[k]
  self.oBiases[k] += delta 

Notice that all updates use the same learning rate. An advanced version of back-propagation called Adam ("adaptive moment estimation") was developed in 2015. Adam uses different learning rates and a few other tricks, and is considered state-of-the-art.

The main training loop finishes by updating the iteration counter and printing a progress message, and then method NeuralNetwork.train wraps up, like so:

...
    epoch += 1
	  
    if epoch % 10 == 0:
      mse = self.meanSquaredError(trainData)
      print("epoch = " + str(epoch) + " ms error = %0.4f " % mse)
  # end while

  result = self.getWeights()
  return result
# end train

Here, a progress message will be displayed every 10 iterations. You might want to parameterize the interval. You can also print additional diagnostic information here. The final values of the weights and biases are fetched by class method getWeights and returned by method train as a convenience.

Wrapping Up
The Python language is too slow to create serious neural networks from scratch. But implementing a neural network in Python gives you a complete understanding of what goes on behind the scenes when you use a sophisticated machine learning library like CNTK or TensorFlow. And the ability to implement a neural network from scratch gives you the ability to experiment with custom algorithms.

The version of back-propagation presented in this article is basic. In future articles I'll show you how to implement momentum and mini-batch training -- two important techniques that increase training speed. Another important variation is to use a different measure of error called cross entropy error.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube