Feedforward Neural Networks
Contents
Feedforward Neural Networks#
With all of the prior material established, I can now turn to the goal of this jupyter book: writing a general feedforward neural network (FNN) function. This function won’t be particularly optimised - indeed, it’s fairly slow in practice. It does however work.
Since neural networks by their nature are fairly modular, I’ll be taking a modular approach to programming this function. I’ll start by writing the code for various activation functions, cost functions, their derivatives, weight initialisation functions, and functions to mananage which ones to use given an input string.
From there, I’ll write an FNN function that uses all of these internally.
import numpy as np
import copy
from typing import Callable
import matplotlib.pyplot as plt
Activation Functions#
First, I’ll implement the activation functions and their derivatives.
Note that we don’t really need the softmax derivative as the derivative of the cost with respect to the output nodes’ weighted inputs is a known quantity. Because of this, there is a placeholder function that returns None (this is a little inefficient - but avoids me having to rewrite the whole network!):
def sigmoid(x: np.ndarray) -> np.ndarray:
"""
Element-wise sigmoid function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return 1.0 / (1.0 + np.exp(-x))
def deriv_sigmoid(x: np.ndarray) -> np.ndarray:
"""
Element-wise sigmoid derivative function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return sigmoid(x)*(1-sigmoid(x))
def softmax(x: np.ndarray) -> np.ndarray:
"""
Element-wise softmax function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
shiftx = x - np.max(x) #creates numerical stability w/out changing result
expx = np.exp(shiftx)
return expx / np.sum(expx, axis=0)
def deriv_softmax(x: np.ndarray) -> None:
"""
Placeholder function for softmax derivative. Not necessary because of
the formula for dCost/dZ for the output layer
"""
return None
def tanh(x: np.ndarray) -> np.ndarray:
"""
Element-wise tanh function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return (np.exp(x) - np.exp(-x))/(np.exp(x)+np.exp(-x))
def deriv_tanh(x: np.ndarray) -> np.ndarray:
"""
Element-wise tanh derivative function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return 1 - np.power(tanh(x),2)
def relu(x: np.ndarray) -> np.ndarray:
"""
Element-wise ReLU function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return x * (x>=0)
def deriv_relu(x: np.ndarray) -> np.ndarray:
"""
Element-wise ReLU derivative function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return (x>=0).astype("int")
def linear(x: np.ndarray) -> np.ndarray:
"""
Element-wise linear function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return x
def deriv_linear(x: np.ndarray) -> np.ndarray:
"""
Element-wise linear derivative function
Args:
x: array of values
Returns: array of transformed values of the same shape as x
"""
return np.ones(x.shape)
Cost Functions#
Next, the cost functions and their derivatives. We don’t actually need to compute the cost to estimate a neural network, but I think it’s useful to demonstrate how it can be done.
As with softmax, we also don’t need the derivative of cross-entropy as the derivative of the cost with respect to the output nodes’ weighted inputs is a known quantity.
def mse(yhat: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
MSE cost function
Args:
yhat: Array of predicted values
y: Array of true values
Returns: An array of costs
"""
return np.power(yhat-y,2)
def deriv_mse(yhat: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
MSE cost derivative function
Args:
yhat: Array of predicted values
y: Array of true values
Returns: An array of cost derviatives
"""
return 2*(yhat-y)
def binary_cross_entropy(yhat: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
Binary cross entropy cost function
Args:
yhat: Array of predicted values
y: Array of true values
Returns: An array of costs
"""
return -y * np.log(yhat) - (1-y)*np.log(1-yhat)
def deriv_binary_cross_entropy(yhat: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
Binary cross entropy cost derivatives function
Args:
yhat: Array of predicted values
y: Array of true values
Returns: An array of cost derivatives
"""
return -(y/yhat) + (1-y)/(1-yhat)
def cross_entropy(yhat: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
Cross entropy cost function
Args:
yhat: Array of predicted values
y: Array of true values
Returns: An array of costs
"""
return np.sum(-y*np.log(yhat),axis=0)
def cross_entropy_deriv(yhat: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
Placeholder cross entropy cost derivatives function. Not required
in practice as softmax dZ term is known
Args:
yhat: Array of predicted values
y: Array of true values
Returns: An array of cost derivatives
"""
return None
Managing Activation and Cost Function Choice#
To aid in managing all of these, I also wrote some functions that return the relevant functions based on an input string. This means the final FNN function will take a string as input to describe the hidden and output activations. I could also have passed the functions directly, but this helped render the code a little clearer.
def get_activation(name: str) -> Callable:
"""
A function to manage which activation function and
derivative to use given input string
Args:
name: the name of the activation function
Returns: The activation function and its derivative
"""
if name == "sigmoid":
return sigmoid, deriv_sigmoid
if name == "softmax":
return softmax, deriv_softmax
if name == "tanh":
return tanh, deriv_tanh
if name == "relu":
return relu, deriv_relu
if name == "linear":
return linear, deriv_linear
def get_cost(name: str) -> Callable:
"""
A function to manage which cost function and
derivative to use given input string
Args:
name: the name of the cost function
Returns: The cost function and its derivative
"""
if name == "mse":
return mse, deriv_mse
if name == "bce":
return binary_cross_entropy, deriv_binary_cross_entropy
if name == "ce":
return cross_entropy, cross_entropy_deriv
Weight Initialisation#
An important step in setting up a neural network is initialising the weights. The network won’t be successfully estimated if we initialise the weights at 0, and it’s possible to initialise weights in such a manner that exploding or shrinking gradients are encouraged from the start. To avoid that outcome, I’ve therefore implemented some established weight initialisation algorithms here:
def xavier_init(shape: tuple[int]) -> np.ndarray:
"""
Xavier initialisation of weights
Args:
shape: tuple giving the shape of the weights
Returns: np.ndarray of initialised weights
"""
num_input = shape[1]
num_output = shape[0]
W = np.random.randn(num_output, num_input) / num_input
return W
def he_init(shape: tuple[int]):
"""
He initialisation of weights
Args:
shape: tuple giving the shape of the weights
Returns: np.ndarray of initialised weights
"""
num_input = shape[1]
num_output = shape[0]
W = np.random.randn(num_output, num_input) * np.sqrt(2/num_input)
return W
def linear_init(shape: tuple[int]) -> np.ndarray:
"""
Simple initialisation of weights
Args:
shape: tuple giving the shape of the weights
Returns: np.ndarray of initialised weights
"""
num_input = shape[1]
num_output = shape[0]
W = np.random.randn(num_output, num_input)/num_input
return W
def get_init(name: str) -> Callable:
"""
A function to manage which weights initalisation
to use
Args:
name: the name of the activation function
Returns: An initialisation function
"""
if name == "sigmoid" or name == "softmax" or name == "tanh":
return xavier_init
if name == "relu":
return he_init
if name == "linear":
return linear_init
Forward Propagation#
Here I implement the forward propagation algorithm. This takes the input features matrix X, the weights, the biases, and the activiation function, and propagates the data forward through the network to produce a prediction.
Note that since A_list stores X as its first entry for ease of computation for both the forward and backward passes, its indices correspond to the hidden layers (i.e. A[i] would correspond to the ith hidden layer). So, on the fisrt part of the loop (i=0), A[i] would be A[0] and thus correspond to X. While Z[i] = Z[0] would be for i+1 - i.e. the first layer’s weighted input.
def forward_pass(X: np.ndarray,
W_list: list[np.ndarray],
B_list: list[np.ndarray],
activation_hidden: Callable,
activation_output: Callable,
num_pass: int):
"""
A function that performs forward-propagation through a network
Args:
X: matrix of input features
weight_list: A list with num_features_out by num_features_in weight
matrices, where num_features_out is the number of outputs
to the layer and num_features_in is the number of inputs to
the layer
bias_list: A list with num_features_out by 1 bias matrices, where
num_features_out is the number of outputs to the layer
activation_hidden: Activation function for hidden layers
activation_output: Activation function for the output layer
num_pass: Number of forward passes
Returns: Z_list, a list of layer weighted inputs, and A_list, a list of
activated weighted inputs. Both contain X as their first entry
"""
Z_list = []
A_list = [X] #for use in backprop
for i in range(num_pass):
Z_list.append(np.dot(W_list[i],A_list[i]) + B_list[i])
if i < (num_pass-1):
A_list.append(activation_hidden(Z_list[i]))
elif i == (num_pass-1):
A_list.append(activation_output(Z_list[i]))
return Z_list, A_list
Back Propagation#
And finally, here I implement the backpropagation algorthim. Note the slightly different implementations for dCost/dZ for the output layer depending on if softmax is used or not.
As with forward propagation, the A_list contains X as an additional element - and so its indices behave slightly different to the other lists’ indices.
def back_pass(Y: np.ndarray,
W_list: list[np.ndarray],
Z_list: list[np.ndarray],
A_list: list[np.ndarray],
cost_deriv: Callable,
activation_hidden_deriv: Callable,
activation_output_deriv: Callable,
num_pass: int,
num_obs: int,
softmax: bool):
"""
A function to perform backpropagation through the network
Args:
Y: array of output variables
W_list: list of layer weights
Z_list: list of layer weighted inputs
A_list: list of layer activations
cost_deriv: Cost derivative function
activation_hidden_deriv: Hidden layer activation func derivative
activation_output_deriv: Output layer activation func derivative
num_pass: Number of forward/backward passes
num_obs: Number of training examples/observations
Returns: dW_list, a list of weight derivatives, and dB_list, a list of
bias derivatives
"""
if softmax:
dZ = A_list[num_pass] - Y
if not softmax:
dZ = np.multiply(cost_deriv(A_list[num_pass], Y), activation_output_deriv(Z_list[num_pass-1]))
dW_list = [np.dot(dZ, A_list[num_pass-1].T)/num_obs]
dB_list = [np.sum(dZ, axis=1, keepdims=True)/num_obs]
for i in range(num_pass-1, 0, -1):
dZ = np.multiply(np.dot(W_list[i].T, dZ), activation_hidden_deriv(Z_list[i-1]))
dW_list = [np.dot(dZ, A_list[i-1].T)/num_obs] + dW_list
dB_list = [np.sum(dZ, axis=1, keepdims=True)/num_obs] + dB_list
return dW_list, dB_list
A Feedforward Neural Network Function#
With all of the above functions impelemented, it’s time to put them all together. This function does a few things:
Puts data in desired format (num_features by num_observations, instead of vice versa)
Uses the functions above to prepare the activation functions, cost function, and their derivatives
Runs a training loop. During a single cycle of the loop, it will:
Forward-propagate the data to calculate the weighted inputs and activations
Back-propagate to compute the gradients of the weights and the biases
Update the weights according to gradient descent
This loop has two break conditions:
The max number of iterations is reached
The change in weights becomes sufficiently small
Finally, the function produces three outputs, two of which I’ll use in this jupyter book:
The network’s predicted values for the input data
The overall cost of the network
A function built from the forward propagation function above that will generate predictions for new data
There’s no reason I coudln’t have included other outputs such as the estimated weights, the final difference, or the final number of iterations. I don’t use the cost here, but since it’s not actually used in estimating the network I didn’t want it to be entirely superfluous to requirements.
def fnn(X,
Y,
hidden,
learn_rate=0.03,
hidden_activation="relu",
output_activation="linear",
loss="mse",
tol=0.0001,
max_iter=10000):
"""
A feedforward neural network function that allows an arbitary number of
hidden layers and
Args:
X: A num_obs by num_features input matrix
Y: A num_obs by num_outputs matrix
hidden: A list of length num_hidden_layers where the nth entry describes
the number of nodes for the nth hidden layer.
learn_rate: Learning rate of the network.
hidden_activation: String denoting activation function for hidden layers.
output_activation: String denoting activation function for output layer.
loss: String denoting the loss function
tol: Minimum difference for break criteria in the training loop
max_iter: Maximum number of training iterations.
"""
# Data will be aligned in a different format to the network
X = X.T
Y = Y.T
# Initialise network: core numbers
shape_X = X.shape
num_obs = shape_X[1]
indices = list(range(num_obs))
num_features = shape_X[0]
shape_Y = Y.shape
num_output = shape_Y[0]
num_hidden = len(hidden)
num_pass = num_hidden + 1 #passes between layers
# Initialise network: weights W:
hidden_init, output_init = get_init(hidden_activation), get_init(output_activation)
W_list = [hidden_init((hidden[0], num_features))]
for i in range(1, num_hidden):
W_list.append(hidden_init((hidden[i], hidden[i-1])))
W_list.append(output_init((num_output, hidden[num_hidden-1])))
# Initialise network: biases B
B_list = []
for i in range(num_hidden):
B_list.append(np.zeros((hidden[i],1)))
B_list.append(np.zeros((num_output, 1)))
# Initalise network: activation + cost funcs and their derivatives
activation_hidden, activation_hidden_deriv = get_activation(hidden_activation)
activation_output, activation_output_deriv = get_activation(output_activation)
cost_func, cost_deriv = get_cost(loss)
softmax_bool = output_activation == "softmax"
# Training loop
iter = 0
diff = np.inf
while (iter < max_iter and diff > tol):
# Compute forward pass
Z_list, A_list = forward_pass(X=X,
W_list=copy.deepcopy(W_list),
B_list=copy.deepcopy(B_list),
activation_hidden=activation_hidden,
activation_output=activation_output,
num_pass=num_pass)
# Backpropagate to find gradients
dW_list, dB_list = back_pass(cost_deriv=cost_deriv,
Y=copy.deepcopy(Y),
activation_output_deriv=activation_output_deriv,
activation_hidden_deriv=activation_hidden_deriv,
W_list=copy.deepcopy(W_list),
A_list=copy.deepcopy(A_list),
Z_list=copy.deepcopy(Z_list),
num_pass=num_pass,
num_obs=num_obs,
softmax=softmax_bool)
# Update params (w_new = w_old - learn_rate * gradient(w_old))
W_old = copy.deepcopy(W_list)
B_old = copy.deepcopy(B_list)
new_diff = 0
for i in range(num_pass):
W_list[i] = W_list[i] - (learn_rate * dW_list[i])
B_list[i] = B_list[i] - (learn_rate * dB_list[i])
new_diff += np.sum(np.abs(W_list[i] - W_old[i])) + np.sum(np.abs(B_list[i] - B_old[i]))
diff = new_diff
# Increment iterations
iter += 1
# Predicted values of the network
prediction = A_list[num_pass].T
# Cost of the network
cost = np.sum(cost_func(A_list[num_pass], Y)) / num_obs
# Prediction function for new data
def fnn_pred_func(newdata):
new_X = newdata.T
_, A_list = forward_pass(X=new_X,
W_list=W_list,
B_list=B_list,
activation_hidden=activation_hidden,
activation_output=activation_output,
num_pass=num_pass)
return A_list[num_pass].T
# Return Yhat
return prediction, cost, fnn_pred_func
Example 1: Learning XOR#
With the network function implemented, it’s time to put it to the test. For the first test, I’ll run it on XOR. If you’re unfamiliar with XOR, it’s a logical operation read as ‘exclusive OR’. It takes two binary inputs. If one but not the other is true, then it returns true. If both inputs are false, or if both inputs are true, it returns false.
X = np.array([[0,0],
[0,1],
[1,0],
[1,1]])
Y = np.array([[0],[1],[1],[0]])
print(X)
print(Y)
[[0 0]
[0 1]
[1 0]
[1 1]]
[[0]
[1]
[1]
[0]]
The difficulty of XOR is that it’s plainly non-linear to the inputs. Without using interactions, it would be impossible to model in any kind of linear model. The power of neural networks however is that they can approximate any function, including non-linear functions.
In this case, all we need is a single hidden layer with 2 neurons. With this, we should have enough to estimate a reasonably accurate network.
Since none of the inputs are negative, ReLU would be a poor choice of activation function, as with negative weights a neuron would become ‘dead’ (unless a different initialisation which prevented negative weights was used). I therefore chose a sigmoid activation for the hidden layer.
Similarly, since the output layer is binary, I used a sigmoid output activation and binary cross-entropy loss function.
Since all the possible feature combinations of XOR are known, I examined only the predicted output for the input data in this case.
np.random.seed(42)
pred, _, _ = fnn(X=X,
Y=Y,
hidden=[2],
learn_rate=0.9,
hidden_activation="sigmoid",
output_activation="sigmoid",
loss="bce")
print(np.round(pred,2))
[[0.]
[1.]
[1.]
[0.]]
As you can see, the network successfully estimates the correct values for XOR. While this demonstrates the network’s ability to approximate non-linear functions through the use of hidden layers and appropriate activation functions.
Example 2: Digit Classification#
XOR is a however fairly simple example - so let’s turn to something more involved. I’m now going to test my FNN function on one of the sklearn toy datasets. In particular, I’m going to test it on the handwritten digits toy dataset.
This is a dataset of 8x8 pixel images of digits ranging from 0 to 9. I’ll attempt to classify the images based on the individual pixels. First, I’ll download the data split it into training and test data:
# Download data
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
data, target = load_digits(n_class=10, return_X_y=True, as_frame=False)
target = target.reshape((target.shape[0],1))
# Split data
indices = list(range(data.shape[0]))
train_indices, test_indices = train_test_split(indices, test_size=0.2, stratify=target)
train_X = data[train_indices,:]
train_Y = target[train_indices,:]
test_X = data[test_indices,:]
test_Y = target[test_indices,:]
# One-hot encode the targets
num_classes = 10
def one_hot(vec):
oh = np.eye(num_classes)[vec].reshape((vec.shape[0],-1))
return oh
train_Y_oh = one_hot(train_Y)
test_Y_oh = one_hot(test_Y)
With the data prepared, I’ll now train a network on it. Unlike XOR, there is not a limited set of possible feature combinations, so this time the emphasis will be on how well the model predicts new data. I thus focus on the prediction function output.
Since this dataset is more complex (64 features, 10 classes), I used a more complex network setup. In this case, a network with two hidden layers. The first hidden layer contains 20 neurons, the second 10 neurons, for a total of 40 hidden layer neurons.
For similar reasons in the case of XOR, I used a sigmoid activation function for the hiddden layers. This time however, since our output is not one but many classes, I used a softmax output activation and cross-entropy loss function.
np.random.seed(3298)
_, _, pred_func = fnn(X=train_X,
Y=train_Y_oh,
hidden=[20,10],
learn_rate=0.9,
hidden_activation="sigmoid",
output_activation="softmax",
loss="ce",
max_iter=10000)
With the network estimated, I’ll now get the predictions from the network. Since I don’t want to compare against probabilities, I’ll use the value the network considers likeliest given the test data:
prediction = pred_func(test_X)
prediction_vec = prediction.argmax(1,keepdims=True)
To visualise this, let’s look at the first image in the test dataset:
img = test_X[0]
plt.figure(figsize=(2,2))
plt.imshow(img.reshape(8,8), cmap='gray')
<matplotlib.image.AxesImage at 0x13f556920>
And let’s compare this to our network’s prediction:
print(f"Predicted number: {prediction_vec[0][0]}")
Predicted number: 5
We got the correct value!
Of course, we don’t want to look only at a single value. So let’s compute the perentage of the labels that were correctly predicted:
np.sum(test_Y == prediction_vec) / len(test_Y)
0.95
95% of predictions are correct! Pretty good going as far as handwritten digit recognition goes.
This concludes the demonstration of the feedforward neural network function. It works for arbitrarily sized data and for arbitarily sized neural networks, although it will probably be fairly slow in practice for larger datasets and networks. The ultimate goal of the juptyer book has therefore been realised.