Categories
Offsites

Transfer learning in TensorFlow 2 tutorial

In this post, I’m going to cover the very important deep learning concept called transfer learning. Transfer learning is the process whereby one uses neural network models trained in a related domain to accelerate the development of accurate models in your more specific domain of interest. For instance, a deep learning practitioner can use one of the state-of-the-art image classification models, already trained, as a starting point for their own, more specialized, image classification task. In this tutorial, I’ll be showing you how to perform transfer learning using an advanced, pre-trained image classification model – ResNet50 – to improve a more specific image classification task – the cats vs dogs classification problem. In particular, I’ll be showing you how to do this using TensorFlow 2. The code for this tutorial, in a Google Colaboratory notebook format, can be found on this site’s Github repository here. This code borrows some components from the official TensorFlow tutorial.


Eager to build deep learning systems? Get the book here


What are the benefits of transfer learning?

Transfer learning has many benefits, these are:

  1. It speeds up learning: For state of the art results in deep learning, one often needs to build very deep networks with many layers. In order to train such networks, one needs lots of data, computational power and time. These three things are often not readily available.
  2. It needs less data: As will be shown, transfer learning usually only adds a few extra layers to the pre-trained model, and the weights in the pre-trained model are generally fixed. Therefore, during the fine tuning of the model, only those few extra layers, or a small subset of the total number of layers, is subjected to training. This requires much less data to get good results.
  3. You can leverage the expert tuning of state-of-the-art models: As anyone who has been involved in building deep learning systems can tell you, it requires a lot of patience and tuning of the models to get the best results. By utilizing pre-trained, state-of-the-art models, you can skip a lot of this arduous work and rely on the efforts of experts in the field.

For these reasons, if you are performing some image recognition task, it may be worth using some of the pre-trained, state-of-the-art image classification models, like ResNet, DenseNet, InceptionNet and so on. How does one use these pre-trained models?

How to create a transfer learning model

To create a transfer learning model, all that is required is to take the pre-trained layers and “bolt on” your own network. This could be either at the beginning or end of the pre-trained model. Usually, one disables the pre-trained layer weights and only trains the “bolted on” layers which have been added. For image classification transfer learning, one usually takes the convolutional neural network (CNN) layers from the pre-trained model and adds one or more densely connected “classification” layers at the end (for more on convolutional neural networks, see this tutorial).  The pre-trained CNN layers act as feature extractors / maps, and the classification layer/s at the end can be “taught” to “interpret” these image features. The transfer learning model architecture that will be used in this example is shown below:  

 

Transfer learning TensorFlow 2 architecture

ResNet50 transfer learning architecture

The full ResNet50 model shown in the image above, in addition to a Global Average Pooling (GAP) layer, contains a 1000 node dense / fully connected layer which acts as a “classifier” of the 2048 (4 x 4) feature maps output from the ResNet CNN layers. For more on Global Average Pooling, see my tutorial. In this transfer learning task, we’ll be removing these last two layers (GAP and Dense layer) and replacing these with our own GAP and dense layer (in this example, we have a binary classification task – hence the output size is only 1). The GAP layer has no trainable parameters, but the dense layer obviously does – these will be the only parameters trained in this example. All of this is performed quite easily in TensorFlow 2, as will be shown in the next section.

Transfer learning in TensorFlow 2

In this example, we’ll be using the pre-trained ResNet50 model and transfer learning to perform the cats vs dogs image classification task. I’ll also train a smaller CNN from scratch to show the benefits of transfer learning. To access the image dataset, we’ll be using the tensorflow_datasets package which contains a number of common machine learning datasets. To load the data, the following commands can be run:

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)

(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)

A few things to note about the code snippet above. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. The variables cat_train, cat_valid and cat_test are TensorFlow Dataset objects – to learn more about these, check out my previous post. In order to examine the images in the data set, the following code can be run:

import matplotlib.pylab as plt

for image, label in cat_train.take(2):
  plt.figure()
  plt.imshow(image)

This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:

IMAGE_SIZE = 100
def pre_process_image(image, label):
  image = tf.cast(image, tf.float32)
  image = image / 255.0
  image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
  return image, label

In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:

TRAIN_BATCH_SIZE = 64
cat_train = cat_train.map(pre_process_image).shuffle(1000).repeat().batch(TRAIN_BATCH_SIZE)
cat_valid = cat_valid.map(pre_process_image).repeat().batch(1000)

First, we’ll build a smaller CNN image classifier which will be trained from scratch.

A smaller CNN model

In the code below, a 3 x CNN layer head, a GAP layer and a final densely connected output layer is created. The Keras API, which is the encouraged approach for TensorFlow 2, is used in the model definition below. For more on Keras, see this and this tutorial.

head = tf.keras.Sequential()
head.add(layers.Conv2D(32, (3, 3), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(32, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(64, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

average_pool = tf.keras.Sequential()
average_pool.add(layers.AveragePooling2D())
average_pool.add(layers.Flatten())
average_pool.add(layers.Dense(1, activation='sigmoid'))

standard_model = tf.keras.Sequential([
    head, 
    average_pool
])

To train the model we run:

standard_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/standard_model', update_freq='batch')]

standard_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=7, 
               validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Note that the loss function is ‘binary cross-entropy’, due to the fact that the cats vs dogs image classification task is a binary classification problem (i.e. 0 = cat, 1 = dog or vice-versa). Running the code above, after 7 epochs, gives a training accuracy of around 89% and a validation accuracy of around 85%. Next we’ll see how this compares to the transfer learning case.

ResNet50 transfer learning example

To download the ResNet50 model, you can utilize the tf.keras.applications object to download the ResNet50 model in Keras format with trained parameters. To do so, run the following code:

IMG_SHAPE = (IMAGE_SIZE, IMAGE_SIZE, 3)
res_net = tf.keras.applications.ResNet50(weights='imagenet', include_top=False, input_shape=IMG_SHAPE)

The weights argument ‘imagenet’ denotes that the weights to be used are those generated by being trained on the ImageNet dataset. The include_top argument states that we only want the CNN-feature maps part of the ResNet50 model – not its final GAP and dense connected layers. Finally, we need to specify what input shape we want the model being setup to receive. Next, we need to disable the training of the parameters within this Keras model. This is performed really easily:

res_net.trainable = False

Next we create a Global Average Pooling layer, along with a final densely connected output layer with sigmoid activation. Then the model is combined using the Keras sequential framework where Keras models can be chained together:

global_average_layer = layers.GlobalAveragePooling2D()
output_layer = layers.Dense(1, activation='sigmoid')
tl_model = tf.keras.Sequential([
  res_net,
  global_average_layer,
  output_layer
])

That’s all that’s required – TensorFlow 2 and Keras make many deep learning tasks quite easy. Running tl_model.summary() gives the following output:

Layer (type)                 Output Shape              Param #   
=================================================================
resnet50 (Model)             (None, 4, 4, 2048)        23587712  
_________________________________________________________________
global_average_pooling2d (Gl (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2049      
=================================================================
Total params: 23,589,761
Trainable params: 2,049
Non-trainable params: 23,587,712
_________________________________________________________________

As can be observed, while the total number of parameters is large (i.e. 23 million) the number of trainable parameters, corresponding to the weights of the final output layer, is only 2,049. 

To train the model we run:

tl_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/transer_learning_model', update_freq='batch')]

tl_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=7, 
             validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Comparing the models

The graphs below from TensorBoard show the relative performance of the small CNN model trained from scratch and the ResNet50 transfer learning model:

Transfer learning TensorFlow 2 training accuracy comparison

Transfer learning training accuracy comparison (blue – ResNet50, pink – smaller CNN model)

Transfer learning TensorFlow 2 validation accuracy comparison

Transfer learning validation accuracy comparision (red – ResNet50 model, green – smaller CNN model)

The results above show that the ResNet50 model reaches higher levels of both training and validation accuracy much quicker than the smaller CNN model that was trained from scratch. This illustrates the benefit of using these powerful pre-trained models as a starting point for your more domain specific deep learning tasks. I hope this post has been a help and given you a good understanding of the benefits of transfer learning, and also how to implement it easily in TensorFlow 2.  


Eager to build deep learning systems? Get the book here

The post Transfer learning in TensorFlow 2 tutorial appeared first on Adventures in Machine Learning.

Categories
Offsites

Double Q reinforcement learning in TensorFlow 2

In previous posts (here and here), deep Q reinforcement learning was introduced. In these posts, examples were presented where neural networks were used to train an agent to act within an environment to maximize rewards. The neural network was trained using something called Q-learning. However, deep Q learning (DQN) has a flaw – it can be unstable due to biased estimates of future rewards, and this slows learning. In this post, I’ll introduce Double Q learning which can solve this bias problem and produce better Q-learning outcomes. We’ll be running a Double Q network on a modified version of the Cartpole reinforcement learning environment. We’ll also be developing the network in TensorFlow 2 – at the time of writing, TensorFlow 2 is in beta and installation instructions can be found here. The code examined in this post can be found here.


Eager to build deep learning systems in TensorFlow 2? Get the book here


A recap of deep Q learning

As mentioned above, you can go here and here to review deep Q learning. However, a quick recap is in order. The goal of the neural network in deep Q learning is to learn the function $Q(s_t, a_t; theta_t)$. At a given time in the game / episode, the agent will be in a state $s_t$. This state is fed into the network, and various Q values will be returned for each of the possible actions $a_t$ from state $s_t$. The $theta_t$ refers to the parameters of the neural network (i.e. all the weight and bias values).

The agent chooses an action based on an epsilon-greedy policy $pi$. This policy is a combination of randomly selected actions combined with the output of the deep Q neural network – with the probability of a randomly selected action decreasing over the training time. When the deep Q network is used to select an action, it does so by taking the maximum Q value returned over all the actions, for state $s_t$. For example, if an agent is in state 1, and this state has 4 possible actions which the agent can perform, it will output 4 Q values. The action which has the highest Q value is the action which will be selected. This can be expressed as:

$$a = argmax Q(s_t, a; theta_t)$$

Where the argmax is performed over all the actions / output nodes of the neural network. That’s how actions are chosen in deep Q learning. How does training occur? It occurs by utilising the Q-learning / Bellman equation. The equation looks like this:

$$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$

How does this read? For a given action a from state $s_{t}$, we want to train the network to predict the following:

  • The immediate reward for taking this action $r_{t+1}$, plus
  • The discounted reward for the best possible action in the subsequent state ($s_{t+1}$)

If we are successful in training the network to predict these values, the agent will consistently chose the action which gives the best immediate reward ($r_{t+1}$) plus the discounted future rewards of future states $gamma max_{{a}}Q(s_{t+1}, a;theta_t)$. The $gamma$ term is the discount term, which places less value on future reward than present rewards (but usually only marginally).

In deep Q learning, the game is repeatedly played and the states, actions and rewards are stored in memory as a list of tuples or an array – ($s_t$, $a$, $r_t$, $s_{t+1}$). Then, for each training step, a random batch of these tuples is extracted from memory and the $Q_{target}(s_t, a_t)$ is calculated and compared to the value produced from the current network $Q(s_t, a_t)$ – the mean squared difference between these two values is used as the loss function to train the neural network.

That’s a fairly brief recap of deep Q learning – for a more extended treatment see here and here. The next section will explain the problems with standard deep Q learning.

The problem with deep Q learning

The problem of deep Q learning has to do with the way it sets the target values:

$$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$

Namely, the issue is with the $max$ value. This part of the equation is supposed to estimate the value of the rewards for future actions if action is taken from the current state $s_t$. That’s a bit of a mouthful, but just consider it as trying to estimate the optimal future rewards $r_future$ if action a is taken.

The problem is that in many environments, there is random noise. Therefore, as an agent explores an environment, it is not directly observing or $r_future$, but something like $r + epsilon$, where $epsilon$ is the noise. In such an environment, after repeated playing of the game, we would hope that the network would learn to make unbiased estimates of the expected value of the rewards – so E[r]. If it can do this, we are in a good spot – the network should pick out the best actions for current and future rewards, despite the presence of noise.

This is where the $max$ operation is a problem – it produces biased estimates of the future rewards, not the unbiased estimates we require for optimal results. An example will help explain this better. Consider the environment below. The agent starts in state A and at each state can move left or right. The states C, D and F are terminal states – the game ends once these points are reached. The r values are the rewards the agent receives when transitioning from state to state.  

Deep Q network bias illustration - Double Q network tutorial

Deep Q network bias illustration

All the rewards are deterministic except for the rewards when transitioning from states B to C and B to D. The rewards for these transitions are randomly drawn from a normal distribution with a mean of 1 and a standard deviation of 4.

We know the expected rewards, E[r] from taking either action (B to C or B to D) is 1 – however, there is a lot of noise associated with these rewards. Regardless, on average, the agent should ideally learn to always move to the left from A, towards E and finally F where always equals 2.

Let’s consider the $Q_{target}$ expression for these cases. Let’s set $gamma$ to be 0.95. The $Q_target$ expression to move to the left from is: $Q_{target} = 0 + 0.95 * max([0, 2]) = 1.9$. The two action options from E are to either move right (r = 0) or left (r = 2). The maximum of these is obviously 2, and hence we get the result 1.9.

What about in the opposite direction, moving right from A? In this case, it is $Q_{target} = 0 + 0.95 * max([N(1, 4), N(1, 4)])$. We can explore the long term value of this “moving right” action by using the following code snippet:

import numpy as np

Ra = np.zeros((10000,))
Rc = np.random.normal(1, 4, 10000)
Rd = np.random.normal(1, 4, 10000)

comb = np.vstack((Ra, Rc, Rd)).transpose()

max_result = np.max(comb, axis=1)

print(np.mean(Rc))
print(np.mean(Rd))
print(np.mean(max_result))

Here a 10,000 iteration trial is created of what the $max$ term will yield in the long term of running a deep Q agent in this environment. Ra is the reward for moving back to the left towards A (always zero, hence np.zeros()). Rc and Rd are both normal distributions, with mean 1 and standard deviation of 4. Combining all these options together and taking the maximum for each trial gives us what the trial-by-trial $max$ term will be (max_result). Finally, the expected values (i.e. the means) of each quantity are printed. As expected, the mean of Rc and Rd are approximately equal to 1 – the mean which we set for their distributions. However, the expected value / mean from the $max$ term is actually around 3!

You can see the problem here. Because the $max$ term is always taking the maximum value from the random draws of the rewards, it tends to be positively biased and does not give a true indication of the expected values of the rewards for a move in this direction (i.e. 1). As such, an agent using the deep Q learning methodology will not chose the optimal action from (i.e. move left) but will rather tend to move right!

Therefore, in noisy environments, it can be seen that deep Q learning will tend to overestimate rewards. Eventually, deep Q learning will converge to a reasonable solution, but it is potentially much slower than it needs to be. A further problem occurs in deep Q learning which can cause instability in the training process. Consider that in deep Q learning the same network both choses the best action and determines the value of choosing said actions. There is a feedback loop here which can exacerbate the previously mentioned reward overestimation problem, and further slow down the learning process. This is clearly not ideal, and this is why Double Q learning was developed.

An introduction to Double Q reinforcement learning

The paper that introduced Double Q learning initially proposed the creation of two separate networks which predicted $Q^A$ and $Q^B$ respectively. These networks were trained on the same environment / problem, but were each randomly updated. So, say, 50% of the time, $Q^A$ was updated based on a certain random set of training tuples, and 50% of the time $Q^B$ was updated on a different random set of training tuples. Importantly, the update or target equation for network A had an estimate of the future rewards from network B – not itself. This new approach does two things:

  1. The A and B networks are trained on different training samples – this acts to remove the overestimation bias, as, on average, if network A sees a high noisy reward for a certain action, it is likely that network B will see a lower reward – hence the noise effects will cancel
  2. There is a decoupling between the choice of the best action and the evaluation of the best action

The algorithm from the original paper is as follows:

Original Double Q algorithm

Original Double Q algorithm

As can be observed, first an action is chosen from either $Q^A(s_t,.)$ or $Q^B(s_t,.)$ and the rewards, next state, action etc. are stored in the memory. Then either UPDATE(A) or UPDATE(B) is chosen randomly. Next, for the state $s_{t+1}$ (or s’ in the above) the predicted Q value for all actions from this state are taken from network A or B, and the action with the highest predicted Q value is chosen, a*. Note that, within UPDATE(A), this action is chosen from the output of the $Q^A$ network.

Next, you’ll see something interesting. Consider the update equation for $Q^A$ above – I’ll represent it in more familiar, neural network based notation below:

$$Q^A_{target} = r_{t+1} + gamma Q^B(s_{t+1}, a*)$$

Notice that, while the best action a* from the next state ($s_{t+1}$) is chosen from network A, the discounted reward for taking that future action is extracted from network B. This removes any bias associated with the $argmax$ from network A, and also decouples the choice of actions from the evaluation of the value of such actions (i.e. breaks the feedback loop). This is the heart of the Double Q reinforcement learning.

The Double DQN network

The same author of the original Double Q algorithm shown above proposed an update of the algorithm in this paper. This updated algorithm can still legitimately be called a Double Q algorithm, but the author called it Double DQN (or DDQN) to disambiguate. The main difference in this algorithm is the removal of the randomized back-propagation based updating of two networks A and B. There are still two networks involved, but instead of training both of them, only a primary network is actually trained via back-propagation. The other network, often called the target network, is periodically copied from the primary network. The update operation for the primary network in the Double DQN network looks like the following:

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; theta_t); theta^-_t)$$

Alternatively, keeping in line with the previous representation:

$$a* = argmax Q(s_{t+1}, a; theta_t)$$

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*); theta^-_t)$$

Notice that, as per the previous algorithm, the action a* with the highest Q value from the next state ($s_{t+1}$) is extracted from the primary network, which has weights $theta_t$. This primary network is also often called the “online” network – it is the network from which action decisions are taken. However, notice that, when determining $Q_{target}$, the discounted Q value is taken from the target network with weights $theta^-_t$. Therefore, the actions for the agent to take are extracted from the online network, but the evaluation of the future rewards are taken from the target network. So far, this is similar to the UPDATE(A) step shown in the previous Double Q algorithm.

The difference in this algorithm is that the target network weights ($theta^-_t$) are not trained via back-propagation – rather they are periodically copied from the online network. This reduces the computational overhead of training two networks by back-propagation. This copying can either be a periodic “hard copy”, where the weights are copied from the online network to the target network with no modification, or a more frequent “soft copy” can occur, where the existing target weight values and the online network values are blended. In the example which will soon follow, soft copying will be performed every training iteration, under the following rule:

$$theta^- = theta^- (1-tau) + theta tau$$

With $tau$ being a small constant (i.e. 0.05).

This DDQN algorithm achieves both decoupling between the action choice and evaluation, and it has been shown to remove the bias of deep Q learning. In the next section, I’ll present a code walkthrough of a training algorithm which contains options for both standard deep Q networks and Double DQNs.

A Double Q network example in TensorFlow 2

In this example, I’ll present code which trains a double Q network on the Cartpole reinforcement learning environment. This environment is implemented in OpenAI gym, so you’ll need to have that package installed before attempting to run or replicate. The code for this example can be found on this site’s Github repo.

First, we declare some constants and create the environment:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard'
MAX_EPSILON = 1
MIN_EPSILON = 0.01
LAMBDA = 0.0005
GAMMA = 0.95
BATCH_SIZE = 32
TAU = 0.08
RANDOM_REWARD_STD = 1.0

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

Notice the epsilon greedy policy parameters (MIN_EPSILON, MAX_EPSILON, LAMBDA) which dictate how long the exploration period of the training should last. GAMMA is the discount rate of future rewards. The final constant RANDOM_REWARD_STD will be explained later in more detail.

It can be observed that the CartPole environment has a state size of 4, and the number of actions available are extracted directly from the environment (there are only 2 of them). Next the primary (or online) network and the target network are created using the Keras Sequential API:

primary_network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions)
])

target_network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions)
])

primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')

The code above is fairly standard Keras model definitions, with dense layers and ReLU activations, and He normal initializations (for further information, see these posts: Keras, ReLU activations and initialization). Notice that only the primary network is compiled, as this is the only network which will be trained via the Adam optimizer.

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._samples = []

    def add_sample(self, sample):
        self._samples.append(sample)
        if len(self._samples) > self._max_memory:
            self._samples.pop(0)

    def sample(self, no_samples):
        if no_samples > len(self._samples):
            return random.sample(self._samples, len(self._samples))
        else:
            return random.sample(self._samples, no_samples)

    @property
    def num_samples(self):
        return len(self._samples)


memory = Memory(50000)

Next a generic Memory class object is created. This holds all the ($s_t$, a, $r_t$, $s_{t+1}$) tuples which are stored during training, and includes functionality to extract random samples for training. In this example, we’ll be using a Memory instance with a maximum sample buffer of 50,000 rows.

def choose_action(state, primary_network, eps):
    if random.random() < eps:
        return random.randint(0, num_actions - 1)
    else:
        return np.argmax(primary_network(state.reshape(1, -1)))

The function above executes the epsilon greedy action policy. As explained in previous posts on deep Q learning, the epsilon value is slowly reduced and the action selection moves from the random selection of actions to actions selected from the primary network. A final training function needs to be reviewed, but first we’ll examine the main training loop:

num_episodes = 1000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DoubleQ_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
double_q = False
steps = 0
for i in range(num_episodes):
    state = env.reset()
    cnt = 0
    avg_loss = 0
    while True:
        if render:
            env.render()
        action = choose_action(state, primary_network, eps)
        next_state, reward, done, info = env.step(action)
        reward = np.random.normal(1.0, RANDOM_REWARD_STD)
        if done:
            next_state = None
        # store in memory
        memory.add_sample((state, action, reward, next_state))

        loss = train(primary_network, memory, target_network if double_q else None)
        avg_loss += loss

        state = next_state

        # exponentially decay the eps value
        steps += 1
        eps = MIN_EPSILON + (MAX_EPSILON - MIN_EPSILON) * math.exp(-LAMBDA * steps)

        if done:
            avg_loss /= cnt
            print(f"Episode: {i}, Reward: {cnt}, avg loss: {avg_loss:.3f}, eps: {eps:.3f}")
            with train_writer.as_default():
                tf.summary.scalar('reward', cnt, step=i)
                tf.summary.scalar('avg loss', avg_loss, step=i)
            break

        cnt += 1

Starting from the num_episodes loop, we can observe that first the environment is reset, and the current state of the agent returned. A while True loop is then entered into, which is only exited when the environment returns the signal that the episode has been completed. The code will render the Cartpole environment if the relevant variable has been set to True.

The next line shows the action selection, where the primary network is fed into the previously examined choose_network function, along with the current state and the epsilon value. This action is then fed into the environment by calling the env.step() command. This command returns the next state that the agent has entered ($s_{t+1}$), the reward ($r_{t+1}$) and the done Boolean which signifies if the episode has been completed.

The Cartpole environment is completely deterministic, with no randomness involved except in the initialization of the environment. Because Double Q learning is superior to deep Q learning especially when there is randomness in the environment, the Cartpole environment has been externally transformed into a stochastic environment on the next line. Normally, the reward from the Cartpole environment is a deterministic value of 1.0 for every step the pole stays upright. Here, however, the reward is replaced with a sample from a normal distribution, with mean 1.0 and standard deviation equal to the constant RANDOM_REWARD_STD.

In the first pass – RANDOM_REWARD_STD is set to 0.0 to transform the environment back to a deterministic case, but this will be changed in the next example run.

After this, the memory is added to and the primary network is trained.

Notice that the target_network is only passed to the training function if the double_q variable is set to True. If double_q is set to False, the training function defaults to standard deep Q learning. Finally the state is updated, and if the environment has signalled the episode has ended, some logging is performed and the while loop is exited.

It is now time to review the train function, which is where most of the work takes place:

def train(primary_network, memory, target_network=None):
    if memory.num_samples < BATCH_SIZE * 3:
        return 0
    batch = memory.sample(BATCH_SIZE)
    states = np.array([val[0] for val in batch])
    actions = np.array([val[1] for val in batch])
    rewards = np.array([val[2] for val in batch])
    next_states = np.array([(np.zeros(state_size)
                             if val[3] is None else val[3]) for val in batch])
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = np.array(next_states).sum(axis=1) != 0
    batch_idxs = np.arange(BATCH_SIZE)
    if target_network is None:
        updates[valid_idxs] += GAMMA * np.amax(prim_qtp1.numpy()[valid_idxs, :], axis=1)
    else:
        prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
        q_from_target = target_network(next_states)
        updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    target_q[batch_idxs, actions] = updates
    loss = primary_network.train_on_batch(states, target_q)
    if target_network is not None:
        # update target network parameters slowly from primary network
        for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
            t.assign(t * (1 - TAU) + e * TAU)
    return loss

The first line is a bypass of this function if the memory does not contain more than 3 x the batch size – this is to ensure no training of the primary network takes place until there is a reasonable amount of samples within the memory.

Next, a batch is extracted from the memory – this is a list of tuples. The individual state, actions and reward values are then extracted and converted to numpy arrays using Python list comprehensions. Note that the next_state values are set to zeros if the raw next_state values are None – this only happens when the episode has terminated.

Next the sampled states ($s_t$) are passed through the network – this returns the values $Q(s_t, a; theta_t)$. The next line extracts the Q values from the primary network for the next states ($s_{t+1}$). Next, we want to start constructing our target_q values ($Q_target$). These are the “labels” which will be supplied to the primary network to train towards.

Note that the target_q values are the same as the prim_qt ($Q(s_t, a; theta_t)$) values except for the index corresponding to the action chosen. So, for instance, let’s say a single sample of the prim_qt values are [0.5, -0.5] – but the action chosen from $s_t$ was 0. We only want to update the 0.5 value  while training, the remaining values in target_q remain equal to prim_qt (i.e. [update, -0.5]). Therefore, in the next line, we create target_q by simply converting prim_qt from a tensor into its numpy equivalent. This is basically a copy of the values from prim_qt t0 target_q. We convert to numpy also, as it is easier to deal with indexing in numpy than TensorFlow at this stage.

To affect these updates, we create a new variable updates. The first step is to set the update values to the sampled rewards – the $r_{t+1}$ values are the same regardless of whether we are performing deep Q learning or Double Q learning. In the following lines, these update values will be added to in order to capture the discounted future reward terms. The next line creates an array called valid_idxs. This array is to hold all those samples in the batch which don’t include a case where next_state is zero. When next_state is zero, this means that the episode has terminated. In those cases, only the first term of the equation below remains ($r_{t+1}$):

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*); theta^-_t)$$

Seeing as update already includes the first term, any further additions to update need to exclude these indexes.

The next line, batch_idxs, is simply a numpy arange which counts out the number of samples within the batch. This is included to ensure that the numpy indexing / broadcasting to follow works properly.

The next line switches depending on whether Double Q learning has been enabled or not. If target_network is None, then standard deep Q learning ensures. In such a case, the following term is calculated and added to updates (which already includes the reward term):

$$gamma max Q(s_{t+1}, a; theta)$$

Alternatively, if target_network is not None, then Double Q learning is performed. The first line:

prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)

calculates the following equation shown earlier:

$$a* = argmax Q(s_{t+1}, a; theta_t)$$

The next line extracts the Q values from the target network for state $s_{t+1}$ and assigns this to variable q_from_target. Finally, the update term has the following added to it:

$$gamma Q(s_{t+1}, a*); theta^-_t)$$

Notice, that the numpy indexing extracts from q_from_target all the valid batch samples, and within those samples, all the highest Q actions drawn from the primary network (i.e. a*).

Finally, the target_q values corresponding to the actions from state $s_t$ are updated with the update array.

Following this, the primary network is trained on this batch of data using the Keras train_on_batch. The last step in the function involves copying the primary or online network values into the target network. This can be varied so that this step only occurs every X amount of training steps (especially when one is doing a “hard copy”). However, as stated previously, in this example we’ll be doing a “soft copy” and therefore every training step involves the target network weights being moved slightly towards the primary network weights. As can be observed, for every trainable variable in both the primary and target networks, the target network trainable variables are assigned new values updated via the previously presented formula:

$$theta^- = theta^- (1-tau) + theta tau$$

That (rather lengthy) explanation concludes the discussion of how Double Q learning can be implemented in TensorFlow 2. Now it is time to examine the results of the training.

Double Q results for a deterministic case

In the first case, we are going to examine the deterministic training case when RANDOM_REWARD_STD is set to 0.0. The TensorBoard graph below shows the results:

 

Double Q deterministic case

Double Q deterministic case (blue – Double Q, red – deep Q)

As can be observed, in both the Double Q and deep Q training cases, the networks converge on “correctly” solving the Cartpole problem – with eventual consistent rewards of 180-200 per episode (a total reward of 200 is the maximum available per episode in the Cartpole environment). The Double Q case shows slightly better performance in reaching the “solved” state than the deep Q network implementation. This is likely due to better stability in decoupling the choice and evaluation of the actions, but it is not a conclusive result in this rather simple deterministic environment.

However, what happens when we increase the randomness by elevated RANDOM_REWARD_STD > 0?

Double Q results for a stochastic case

The results below show the case when RANDOM_REWARD_STD is increased to 1.0 – in this case, the rewards are drawn from a random normal distribution of mean 1.0 and standard deviation of 1.0:

 

Double Q stochastic case

Double Q stochastic case (blue – Double Q, red – deep Q)

As can be seen, in this case, the Double Q network significantly outperforms the deep Q training methodology. This demonstrates the effect of biasing in the deep Q training methodology, and the advantages of using Double Q learning in your reinforcement learning tasks.

I hope this post was helpful in increasing your understanding of both deep Q and Double Q reinforcement learning. Keep an eye out for future posts on reinforcement learning.


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

 

The post Double Q reinforcement learning in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Introduction to ResNet in TensorFlow 2

In previous tutorials, I’ve explained convolutional neural networks (CNN) and shown how to code them. The convolutional layer has proven to be a great success in the area of image recognition and processing in machine learning. However, state of the art techniques don’t involve just a few CNN layers. Rather, they can be very deep, consisting of 10s to >100 numbers of layers. One of the most successful CNN architectures developed has been the ResNet architecture. It was first introduced in 2015 (see this paper) and won the ILSVRC 2015 image classification task. The winning ResNet consisted of a whopping 152 layers, and in order to successfully make a network that deep, a significant innovation in CNN architecture was developed for ResNet. This innovation will be discussed in this post, and an example ResNet architecture will be developed in TensorFlow 2 and compared to a standard architecture. Because of the training requirements for this task, I have developed the code in Google Colaboratory (which gives free GPU time – see my tutorial here), and the notebook can be found on this site’s Github repository.


Eager to build deep learning systems in TensorFlow 2? Get the book here


Introduction to the ResNet architecture

The degradation problem

The vanishing gradient problem was an initial barrier to making neural networks deeper and more powerful. However, as explained in this post, the problem has now largely been solved through the use of ReLU activations and batch normalization. Given this is true, and given enough computational power and data, we should be able to stack many CNN layers and dramatically increase classification accuracy, right? Well – to a degree. An early architecture, called the VGG-19 architecture, had 19 layers. However, this is a long way off the 152 layers of the version of ResNet that won the ILSVRC 2015 image classification task. The reason deeper networks were not successful prior to the ResNet architecture was due to something called the degradation problem. Note, this is not the vanishing gradient problem, but something else. It was observed that making the network deeper led to higher classification errors. One might think this is due to overfitting of the data – but not so fast, the degradation problem leads to higher training errors too! Consider the diagrams below from the original ResNet paper:

Illustration of degradation problem that ResNet solves

Illustration of degradation problem that ResNet solves

Note that the 56-layer network has higher test and training errors. Theoretically, this doesn’t make much sense. Let’s say the 20-layer network learns some mapping H(x) that gives a training error of 10%. If another 36 layers are added, we would expect that the error would at least not be any worse than 10%. Why? Well, the 36 extra layers, at worst, could just learn identity functions. In other words, the extra 36 layers could just learn to pass through the output from the first 20-layers of the network. This would give the same error of 10%. This doesn’t seem to happen though. It appears neural networks aren’t great at learning the identity function in deep architectures. Not only don’t they learn the identity function (and hence pass through the 20 layer error rate), they make things worse. Beyond a certain number of layers, they begin to degrade the performance of the network compared to shallower implementations. Here is where the ResNet architecture comes in.

The ResNet solution

The ResNet solution relies on making the identity function option explicit in the architecture, rather than relying on the network itself to learn the identity function where appropriate. It consists of building networks which consist of the following CNN blocks:

ResNet building block

ResNet building block from here

In the diagram above, the input tensor x enters the building block. This input then splits. On one path, the input is processed by two stacked convolutional layers (called a “weight layer” in the above). This path is the “standard” CNN processing part of the building block. The ResNet innovation is the “identity” path. Here, the input x is simply added to the output of the CNN component of the building block, F(x). The output from the block is then F(x) + x with a final ReLU activation applied at the end. This identity path in the ResNet building block allows the neural network to more easily pass through any abstractions learnt in previous layers. Alternatively, it can more easily build incremental abstractions on top of the abstractions learnt in the previous layers. What do I mean by this? The diagram below may help:

ResNet layers and abstractions

Layers and abstractions

Generally speaking, as CNN layers are added to a network, the network during training will learn lower level abstractions in the early layers (i.e lines, colours, corners, basic shapes etc.) and higher level abstractions in the later layers (groups of geometries, objects etc.). Let’s say that, when trying to classify an aircraft in an image, there are some mid-level abstractions which reliably signal that an aircraft is present. Say the shape of a jet engine near a wing (this is just an example). These abstractions might be able to be learnt in, say, 10 layers.  

However, if we add an additional 20 or more layers after these first 10 layers, these reliable signals may get degraded / obfuscated. The ResNet architecture gives the network a more explicit chance of muting further CNN abstractions on some filters by driving F(x) to zero, with the output of the block defaulting to its input x. Not only that, the ResNet architecture allows blocks to “tinker” more easily with the input. This is because the block only has to learn the incremental difference between the previous layer abstraction and the optimal output H(x). In other words, it has to learn F(x) = H(x) – x. This is a residual expression, hence the name ResNet. This, theoretically at least, should be easier to learn than the full expression H(x).

An (somewhat tortured) analogy might assist here. Say you are trying to draw the picture of a tree. Someone hands you a picture of a pencil outline of the main structure of the tree – the trunk, large branches, smaller branches etc. Now say you are somewhat proud, and you don’t want too much help in drawing the picture. So, you rub out parts of the pencil outline of the tree that you were handed. You then proceed to add some detail to the picture you were handed, but you have to redraw parts that you already rubbed out. This is kind of like the case of a standard non-ResNet network. Because layers seem to struggle to reproduce an identity function, at each subsequent layer they essentially erase or degrade some of the previous level abstractions and these need to be re-estimated (at least to an extent).

Alternatively, you, the artist, might not be too proud and you happily accept the pencil outline that you received. It is much easier to then add new details to what you have already been given. This is like what the ResNet blocks do – they take what they are give i.e. x and just make tweaks to it by adding F(x). This analogy isn’t perfect, but it should give you an idea of what is going on here, and how the ResNet blocks help the learning along.

A full 34-layer version of ResNet is (partially) illustrated below (from the original paper):

ResNet architecture

ResNet-34 architecture (partial)

The diagram above shows roughly the first half of the ResNet 34-layer architecture, along with the equivalent layers of the VGG-19 architecture and a “plain” version of the ResNet architecture. The “plain” version has the same CNN layers, but lacks the identity path previously presented in the ResNet building block. These identity paths can be seen looping around every second CNN layer on the right hand side of the ResNet (“residual”) architecture.

In the next section, I’m going to show you how to build a ResNet architecture in TensorFlow 2/Keras. In the example, we’ll compare both the “plain” and “residual” networks on the CIFAR-10 classification task. Note that for computational ease, I’ll only include 10 ResNet blocks.

Building ResNet in TensorFlow 2

As discussed previously, the code for this example can be found on this site’s Github repository. Importing the CIFAR-10 dataset can be performed easily by using the Keras datasets API:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import datetime as dt

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

We then perform some pre-processing of the training and test data. This pre-processing includes image renormalization (converting the data so it resides in the range [0,1]) and centrally cropping the image to 75% of it’s normal extents. Data augmentation is also performed by randomly flipping the image about the centre axis. This is performed using the TensorFlow Dataset API – more details on the code below can be found in this, this post and my book.

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(64).shuffle(10000)
train_dataset = train_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
train_dataset = train_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
train_dataset = train_dataset.map(lambda x, y: (tf.image.random_flip_left_right(x), y))
train_dataset = train_dataset.repeat()

valid_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(5000).shuffle(10000)
valid_dataset = valid_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
valid_dataset = valid_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
valid_dataset = valid_dataset.repeat()

In this example, to build the network, we’re going to use the Keras Functional API, in the TensorFlow 2 context. Here is what the ResNet model definition looks like:

inputs = keras.Input(shape=(24, 24, 3))
x = layers.Conv2D(32, 3, activation='relu')(inputs)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D(3)(x)

num_res_net_blocks = 10
for i in range(num_res_net_blocks):
    x = res_net_block(x, 64, 3)

x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(10, activation='softmax')(x)

res_net_model = keras.Model(inputs, outputs)

First, we specify the input dimensions to Keras. The raw CIFAR-10 images have a size of (32, 32, 3) – but because we are performing central cropping of 75%, the post-processed images are of size (24, 24, 3). Next, we create 2 standard CNN layers, with 32 and 64 filters respectively (for more on convolutional layers, see this post and my book). The filter window sizes are 3 x 3, in line with the original ResNet architectures. Next some max pooling is performed and then it is time to produce some ResNet building blocks. In this case, 10 ResNet blocks are created by calling the res_net_block() function:

def res_net_block(input_data, filters, conv_size):
  x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data)
  x = layers.BatchNormalization()(x)
  x = layers.Conv2D(filters, conv_size, activation=None, padding='same')(x)
  x = layers.BatchNormalization()(x)
  x = layers.Add()([x, input_data])
  x = layers.Activation('relu')(x)
  return x

The first few lines of this function are standard CNN layers with Batch Normalization, except the 2nd layer does not have an activation function (this is because one will be applied after the residual addition part of the block). After these two layers, the residual addition part, where the input data is added to the CNN output (F(x)), is executed. Here we can make use of the Keras Add layer, which simply adds two tensors together. Finally, a ReLU activation is applied to the result of this addition and the outcome is returned.

After the ResNet block loop is finished, some final layers are added. First, a final CNN layer is added, followed by a Global Average Pooling (GAP) layer (for more on GAP layers, see here). Finally, we have a couple of dense classification layers with a dropout layer in between. This model was trained over 30 epochs and then an alternative “plain” model was also created. This was created by taking the same architecture but replacing the res_net_block function with the following function:

def non_res_block(input_data, filters, conv_size):
  x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data)
  x = layers.BatchNormalization()(x)
  x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(x)
  x = layers.BatchNormalization()(x)
  return x

Note that this function is simply two standard CNN layers, with no residual components included. The training code is as follows:

callbacks = [
  # Write TensorBoard logs to `./logs` directory
  keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")), write_images=True),
]

res_net_model.compile(optimizer=keras.optimizers.Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])
res_net_model.fit(train_dataset, epochs=30, steps_per_epoch=195,
          validation_data=valid_dataset,
          validation_steps=3, callbacks=callbacks)

ResNet training and validation results

The accuracy results of the training of these two models can be observed below:

ResNet vs "plain" training accuracy

ResNet (red) vs “plain” (pink) training accuracy

 

ResNet vs "plain" testing accuracy

ResNet (blue) vs “plain” (green) training accuracy

As can be observed there is around a 5-6% improvement in the training accuracy from a ResNet architecture compared to the “plain” non-ResNet architecture. I have run this comparison a number of times and the 5-6% gap is consistent across the runs. These results illustrate the power of the ResNet idea, even for a relatively shallow 10 layer ResNet architecture. As demonstrated in the original paper, this effect will be more pronounced in deeper networks. Note that this network is not very well optimized, and the accuracy could be improved by running for more iterations. However, it is enough to show the benefits of the ResNet architecture. In future posts, I’ll demonstrate other ResNet-based architectures which can achieve even better results.  


Eager to build deep learning systems in TensorFlow 2? Get the book here

The post Introduction to ResNet in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Dueling Q networks in TensorFlow 2

In this post, we’ll be covering Dueling Q networks for reinforcement learning in TensorFlow 2. This reinforcement learning architecture is an improvement on the Double Q architecture, which has been covered here. In this tutorial, I’ll introduce the Dueling Q network architecture, it’s advantages and how to build one in TensorFlow 2. We’ll be running the code on the Open AI gym‘s CartPole environment so that readers can train the network quickly and easily. In future posts, I’ll be showing results on Atari environments which are more complicated. For an introduction to reinforcement learning, check out this post and this post. All the code for this tutorial can be found on this site’s Github repo.


Eager to build deep learning systems in TensorFlow 2? Get the book here


A recap of Double Q learning

As discussed in detail in this post, vanilla deep Q learning has some problems. These problems can be boiled down to two main issues:

  1. The bias problem: vanilla deep Q networks tend to overestimate rewards in noisy environments, leading to non-optimal training outcomes
  2. The moving target problem: because the same network is responsible for both the choosing of actions and the evaluation of actions, this leads to training instability

With regards to (1) – say we have a state with two possible actions, each giving noisy rewards. Action a returns a random reward based on a normal distribution with a mean of 2 and a standard deviation of 1 – N(2, 1). Action b returns a random reward from a normal distribution of N(1, 4). On average, action a is the optimal action to take in this state – however, because of the argmax function in deep Q learning, action will tend to be favoured because of the higher standard deviation / higher random rewards.

For (2) – let’s consider another state, state 1, with three possible actions a, b, and c. Let’s say we know that b is the optimal action. However, when we first initialize the neural network, in state 1, action tends to be chosen. When we’re training our network, the loss function will drive the weights of the network towards choosing action b. However, next time we are in state 1, the parameters of the network have changed to such a degree that now action is chosen. Ideally, we would have liked the network to consistently chose action in state 1 until it was gradually trained to chose action b. But now the goal posts have shifted, and we are trying to move the network from to instead of to b – this gives rise to instability in training. This is the problem that arises when you have the same network both choosing actions and evaluating the worth of actions.

To overcome this problem , Double Q learning proposed the following way of determining the target Q value: $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; theta_t); theta^-_t)$$ Here $theta_t$ refers to the primary network parameters (weights) at time t, and $theta^-_t$ refers to something called the target network parameters at time t. This target network is a kind of delayed copy of the primary network. As can be observed, the optimal action in state t + 1 is chosen from the primary network ($theta_t$) but the evaluation or estimate of the Q value of this action is determined from the target network ($theta^-_t$).

This can be shown more clearly by the equations below: $$a* = argmax Q(s_{t+1}, a; theta_t)$$ $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*; theta^-_t)$$ By doing this two things occur. First, different networks are used to chose the actions and evaluate the actions. This breaks the moving target problem mentioned earlier. Second, the primary network and the target network have essentially been trained on different samples from the memory bank of states and actions (the target network is “trained” on older samples than the primary network). Because of this, any bias due to environmental randomness should be “smoothed out”. As was shown in my previous post on Double Q learning, there is a significant improvement in using Double Q learning instead of vanilla deep Q learning. However, a further improvement can be made on the Double Q idea – the Dueling Q architecture, which will be covered in the next section.

Dueling Q introduction

The Dueling Q architecture trades on the idea that the evaluation of the Q function implicitely calculates two quantities:

  • V(s) – the value of being in state s
  • A(s, a) – the advantage of taking action in state s

These values, along with the Q function, Q(s, a), are very important to understand, so we will do a deep dive of these concepts here. Let’s first examine the generalised formula for the value function V(s): $$V^{pi}(s) = mathbb{E} left[ sum_{i=1}^T gamma^{i – 1}r_{i}right]$$ The formula above means that the value function at state s, operating under a policy $pi$, is the summation of future discounted rewards starting from state s. In other words, if an agent starts at s, it is the sum of all the rewards the agent collects operating under a given policy $pi$. The $mathbb{E}$ is the expectation operator.

Let’s consider a basic example. Let’s assume an agent is playing a game with a set number of turns. In the second-to-last turn, the agent is in state s. From this state, it has 3 possible actions, with a reward of 10, 50 and 100 respectively. Let’s say that the policy for this agent is a simple random selection. Because this is the last set of actions and rewards in the game, due to the game finishing next turn, there are no discounted future rewards. The value for this state and the random action policy is: $$V^{pi}(s) = mathbb{E} left[randomleft(10, 50, 100)right)right] = 53.333$$ Now, clearly this policy is not going to produce optimum outcomes. However, we know that for the optimum policy, the value of this state would be: $$V^*(s) = max (10, 50, 100) = 100$$ If you recall, from Q learning theory, the optimal action in this state is: $$a* = argmax Q(s_{t+1}, a)$$ and the optimal Q value from this action in this state would be: $$Q(s, a^*) = max (10, 50, 100) = 100$$ Therefore, under the optimal (deterministic) policy we have: $$Q(s,a^*) = V(s)$$ However, what if we aren’t operating under the optimal policy (yet)? Let’s return to the case where our policy is  simple random action selection. In such a case, the Q function at state s could be described as (remember there are no future discounted rewards, and V(s) = 53.333): $$Q(s, a) =  V(s) + (-43.33, -3.33,  46.67) = (10, 50, 100)$$ The term (-43.33, -3.33, 46.67) under such an analysis is called the Advantage function A(s, a). The Advantage function expresses the relative benefits of the various actions possible in state s.  The Q function can therefore be expressed as: $$Q(s, a) = V(s) + A(s, a)$$ Under the optimum policy we have $A(s, a^*) = 0$, $V(s) = 100$ and therefore: $$Q(s, a) = V(s) + A(s, a) = 100 + (-90, -50, 0) = (10, 50, 100)$$ Now the question becomes, why do we want to decompose the Q function in this way? Because there is a difference between the value of a particular state s and the actions proceeding from that state. Consider a game where, from a given state s*, all actions lead to the agent dying and ending the game. This is an inherently low value state to be in, and who cares about the actions which one can take in such a state? It is pointless for the learning algorithm to waste training resources trying to find the best actions to take. In such a state, the Q values should be based solely on the value function V, and this state should be avoided. The converse case also holds – some states are just inherently valuable to be in, regardless of the effects of subsequent actions.

Consider these images taken from the original Dueling Q paper – showing the value and advantage components of the Q value in the Atari game Enduro:

Dueling Q - Atari Enduro - value and advantage highlights

Atari Enduro – value and advantage highlights

In the Atari Enduro game, the goal of the agent is to pass as many cars as possible. “Running into” a car slows the agent’s car down and therefore reduces the number of cars which will be overtaken. In the images above, it can be observed that the value stream considers the road ahead and the score. However, the advantage stream, does not “pay attention” to anything much when there are no cars visible. It only begins to register when there are cars close by and an action is required to avoid them. This is a good outcome, as when no cars are in view, the network should not be trying to determine which actions to take as this is a waste of training resources. This is the benefit of splitting value and advantage functions.

Now, you could argue that, because the Q function inherently contains both the value and advantage functions anyway, the neural network should learn to separate out these components regardless. Indeed, it may do. However, this comes at a cost. If the ML engineer already knows that it is important to try and separate these values, why not build them into the architecture of the network and save the learning algorithm the hassle? That is essentially what the Dueling Q network architecture does. Consider the image below showing the original architecture:

Dueling Q architecture

Dueling Q architecture

First, notice that the first part of architecture is common, with CNN input filters and a common Flatten layer (for more on convolutional neural networks, see this tutorial). After the flatten layer, the network bifurcates – with separate densely connected layers. The first densely connected layer produces a single output corresponding to V(s). The second densely connected layer produces n outputs, where is the number of available actions – and each of these outputs is the expression of the advantage function. These value and advantage functions are then aggregated in the Aggregation layer to produce Q values estimations for each possible action in state s. These Q values can then be trained to approach the target Q values, generated via the Double Q mechanism i.e.: $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; theta_t); theta^-_t)$$ The idea is that, through these separate value and advantage streams, the network will learn to produce accurate estimates of the values and advantages, improving learning performance. What goes on in the aggregation layer? One might think we could just add the V(s) and A(s, a) values together like so: $$Q(s, a) = V(s) + A(s, a)$$ However, there is an issue here and it’s called the problem of identifiabilty. This problem in the current context can be stated as follows: given Q, there is no way to uniquely identify V or A. What does this mean? Say that the network is trying to learn some optimal Q value for action a. Given this Q value, can we uniquely learn a V(s) and A(s, a) value? Under the formulation above, the answer is no.

Let’s say the “true” value of being in state s is 50 i.e. V(s) = 50. Let’s also say the “true” advantage in state s for action is 10. This will give a Q value, Q(s, a) of 60 for this state and action. However, we can also arrive at the same Q value for a learned V(s) of, say, 0, and an advantage function A(s, a) = 60. Or alternatively, a learned V(s) of -1000 and an advantage A(s, a) of 1060. In other words, there is no way to guarantee the “true” values of V(s) and A(s, a) are being learned separately and uniquely from each other. The commonly used solution to this problem is to instead perform the following aggregation function: $$Q(s,a) = V(s) + A(s,a) – frac{1}{|a|}sum_{a’}A(s,a’)$$ Here the advantage function value is normalized with respect to the mean of the advantage function values over all actions in state s.

In TensorFlow 2.0, we can create a common “head” network, consisting of introductory layers which act to process the images or other environmental / state inputs. Then, two separate streams are created using densely connected layers which learn the value and advantage estimates, respectively. These are then combined in a special aggregation layer which calculates the equation above to finally arrive at Q values. Once the network architecture is specified in accordance with the above description, the training proceeds in the same fashion as Double Q learning. The agent actions can be selected either directly from the output of the advantage function, or from the output Q values. Because the Q values differ from the advantage values only by the addition of the V(s) value (which is independent of the actions), the argmax-based selection of the best action will be the same regardless of whether it is extracted from the advantage or the Q values of the network.

In the next section, the implementation of a Dueling Q network in TensorFlow 2.0 will be demonstrated.

Dueling Q network in TensorFlow 2

In this section we will be building a Dueling Q network in TensorFlow 2. However, the code will be written so that both Double Q and Dueling Q networks will be able to be constructed with the simple change of a boolean identifier. The environment that the agent will train in is Open AI Gym’s CartPole environment. In this environment, the agent must learn to move the cart platform back and forth in order to stop a pole falling too far below the vertical axis. While Dueling Q was originally designed for processing images, with its multiple CNN layers at the beginning of the model, in this example we will be replacing the CNN layers with simple dense connected layers. Because training reinforcement learning agents using images only (i.e. Atari RL environments) takes a long time, in this introductory post, only a simple environment is used for training the model. Future posts will detail how to efficiently train in Atari RL environments. All the code for this tutorial can be found on this site’s Github repo.

First of all, we declare some constants that will be used in the model, and initiate the CartPole environment:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard'
MAX_EPSILON = 1
MIN_EPSILON = 0.01
EPSILON_MIN_ITER = 5000
DELAY_TRAINING = 300
GAMMA = 0.95
BATCH_SIZE = 32
TAU = 0.08
RANDOM_REWARD_STD = 1.0

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

The MAX_EPSILON and MIN_EPSILON variables define the maximum and minimum values of the epsilon-greedy variable which will determine how often random actions are chosen. Over the course of the training, the epsilon-greedy parameter will decay from MAX_EPSILON gradually to MIN_EPSILON. The EPSILON_MIN_ITER value specifies how many training steps it will take before the MIN_EPSILON value is obtained. The DELAY_TRAINING constant specifies how many iterations should occur, with the memory buffer being filled, before training of the network is undertaken. The GAMMA value is the future reward discount value used in the Q-target equation, and TAU is the merging rate of the weight values between the primary network and the target network as per the Double Q learning algorithm. Finally, RANDOM_REWARD_STD is the standard deviation of the rewards that introduces some stochastic behaviour into the otherwise deterministic CartPole environment.

After the definition of all these constants, the CartPole environment is created and the state size and number of actions are defined.

Model definition

The next step in the code is to create a Keras model inherited class which defines the Double or Dueling Q network:

class DQModel(keras.Model):
    def __init__(self, hidden_size: int, num_actions: int, dueling: bool):
        super(DQModel, self).__init__()
        self.dueling = dueling
        self.dense1 = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.dense2 = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.adv_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.adv_out = keras.layers.Dense(num_actions,
                                          kernel_initializer=keras.initializers.he_normal())
        if dueling:
            self.v_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
            self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal())
            self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x))
            self.combine = keras.layers.Add()

    def call(self, input):
        x = self.dense1(input)
        x = self.dense2(x)
        adv = self.adv_dense(x)
        adv = self.adv_out(adv)
        if self.dueling:
            v = self.v_dense(x)
            v = self.v_out(v)
            norm_adv = self.lambda_layer(adv)
            combined = self.combine([v, norm_adv])
            return combined
        return adv

Let’s go through the above line by line. First, a number of parameters are passed to this model as part of its initialization – these include the size of the hidden layers of the advantage and value streams, the number of actions in the environment and finally a Boolean variable, dueling, to specify whether the network should be a standard Double Q network or a Dueling Q network. The first two model layers defined are simple Keras densely connected layers, dense1 and dense2. These layers have ReLU activations and use the He normal weight initialization. The next two layers defined adv_dense and adv_out pertain to the advantage stream of the network, provided we are discussing a Dueling Q network architecture. If in fact the network is to be a Double Q network (i.e. dueling == False), then these names are a bit misleading and will simply be a third densely connected layer followed by the output Q layer (adv_out). However, keeping with the Dueling Q terminology, the first dense layer associated with the advantage stream is simply another standard dense layer of size = hidden_size. The final layer in this stream, adv_out is a dense layer with only num_actions outputs – each of these outputs will learn to estimate the advantage of all the actions in the given state (A(s, a)).

If the network is specified to be a Dueling Q network (i.e. dueling == True), then the value stream is also created. Again, a standard densely connected layer of size = hidden_size is created (v_dense). Then a final, single node dense layer is created to output the single value estimation (V(s)) for the given state. These layers specify the advantage and value streams respectively. Now the aggregation layer is to be created. This aggregation layer is created by using two Keras layers – a Lambda layer and an Add layer. The Lambda layer allows the developer to specify some user-defined operation to perform on the inputs to the layer. In this case, we want the layer to calculate the following: $$A(s,a) – frac{1}{|a|}sum_{a’}A(s,a’)$$ This is calculated easily by using the lambda x: x – tf.reduce_mean(x) expression in the Lambda layer. Finally, we need a simple Keras addition layer to add this mean-normalized advantage function to the value estimation.

This completes the explanation of the layer definitions in the model. The call method in this model definition then applies these various layers to the state inputs of the model. The following two lines execute the Dueling Q aggregation function:

norm_adv = self.lambda_layer(adv) 
combined = self.combine([v, norm_adv])

Note that first the mean-normalizing lambda function is applied to the output from the advantage stream. This normalized advantage is then added to the value stream output to produce the final Q values (combined). Now that the model class has been defined, it is time to instantiate two models – one for the primary network and the other for the target network:

primary_network = DQModel(30, num_actions, True)
target_network = DQModel(30, num_actions, True)
primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')
# make target_network = primary_network
for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
    t.assign(e)

After the primary_network and target_network have been created, only the primary_network is compiled as only the primary network is actually trained using the optimization function. As per Double Q learning, the target network is instead moved slowly “towards” the primary network by the gradual merging of weight values. Initially however, the target network trainable weights are set to be equal to the primary network trainable variables, using the TensorFlow assign function.

Other functions

The next function to discuss is the target network updating which is performed during training. In Double Q network training, there are two options for transitioning the target network weights towards the primary network weights. The first is to perform a wholesale copy of the weights every N training steps. Alternatively, the weights can be moved towards the primary network gradually every training iteration as follows:

def update_network(primary_network, target_network):
    # update target network parameters slowly from primary network
    for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
        t.assign(t * (1 - TAU) + e * TAU)

As can be observed, the new target weight variables are a weighted average between the current weight values and the primary network weights – with the weighting factor equal to TAU. The next code snippet is the definition of the memory class:

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._samples = []

    def add_sample(self, sample):
        self._samples.append(sample)
        if len(self._samples) > self._max_memory:
            self._samples.pop(0)

    def sample(self, no_samples):
        if no_samples > len(self._samples):
            return random.sample(self._samples, len(self._samples))
        else:
            return random.sample(self._samples, no_samples)

    @property
    def num_samples(self):
        return len(self._samples)


memory = Memory(500000)

This class takes tuples of (state, action, reward, next state) values and appends them to a memory list, which is randomly sampled from when required during training. The next function defines the epsilon-greedy action selection policy:

def choose_action(state, primary_network, eps):
    if random.random() < eps:
        return random.randint(0, num_actions - 1)
    else:
        return np.argmax(primary_network(state.reshape(1, -1)))

If a random number sampled between the interval 0 and 1 falls below the current epsilon value, a random action is selected. Otherwise, the current state is passed to the primary model – from which the Q values for each action are returned. The action with the highest Q value, selected by the numpy argmax function, is returned.

The next function is the train function, where the training of the primary network takes place:

def train(primary_network, memory, target_network):
    batch = memory.sample(BATCH_SIZE)
    states = np.array([val[0] for val in batch])
    actions = np.array([val[1] for val in batch])
    rewards = np.array([val[2] for val in batch])
    next_states = np.array([(np.zeros(state_size)
                             if val[3] is None else val[3]) for val in batch])
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = np.array(next_states).sum(axis=1) != 0
    batch_idxs = np.arange(BATCH_SIZE)
    # extract the best action from the next state
    prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
    # get all the q values for the next state
    q_from_target = target_network(next_states)
    # add the discounted estimated reward from the selected action (prim_action_tp1)
    updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    # update the q target to train towards
    target_q[batch_idxs, actions] = updates
    # run a training batch
    loss = primary_network.train_on_batch(states, target_q)
    return loss

For a more detailed explanation of this function, see my Double Q tutorial. However, the basic operations that are performed are expressed in the following formulas: $$a* = argmax Q(s_{t+1}, a; theta_t)$$ $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*; theta^-_t)$$ The best action from the next state, a*,  is selected from the primary network (weights = $theta_t$). However, the Q value for this action in the next state ($s_{t+1}$) is extracted from the target network (weights = $theta^-_t$). A Keras train_on_batch operation is performed by passing a batch of states and subsequent target Q values, and the loss is finally returned from this function.

The main Dueling Q training loop

The main training loop which trains our Dueling Q network is shown below:

num_episodes = 1000000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQ_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
steps = 0
for i in range(num_episodes):
    cnt = 1
    avg_loss = 0
    tot_reward = 0
    state = env.reset()
    while True:
        if render:
            env.render()
        action = choose_action(state, primary_network, eps)
        next_state, _, done, info = env.step(action)
        reward = np.random.normal(1.0, RANDOM_REWARD_STD)
        tot_reward += reward
        if done:
            next_state = None
        # store in memory
        memory.add_sample((state, action, reward, next_state))

        if steps > DELAY_TRAINING:
            loss = train(primary_network, memory, target_network)
            update_network(primary_network, target_network)
        else:
            loss = -1
        avg_loss += loss

        # linearly decay the eps value
        if steps > DELAY_TRAINING:
            eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * 
                  (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else 
                MIN_EPSILON
        steps += 1

        if done:
            if steps > DELAY_TRAINING:
                avg_loss /= cnt
                print(f"Episode: {i}, Reward: {cnt}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}")
                with train_writer.as_default():
                    tf.summary.scalar('reward', cnt, step=i)
                    tf.summary.scalar('avg loss', avg_loss, step=i)
            else:
                print(f"Pre-training...Episode: {i}")
            break

        state = next_state
        cnt += 1

Again, this training loop has been explained in detail in the Double Q tutorial. However, some salient points are worth highlighting. First, Double and Dueling Q networks are superior to vanilla Deep Q networks especially in the cases where there is some stochastic component to the environment. As the CartPole environment is deterministic, some stochasticity is added in the reward. Normally, every time step in the episode results in a reward of 1 i.e. the CartPole has survived another time step – good job. However, in this case I’ve added a reward which is sampled from a normal distribution with a mean of 1.0 but a standard deviation of RANDOM_REWARD_STD. This adds the requisite uncertainty which makes Double and Dueling Q networks clearly superior to Deep Q networks – see my Double Q tutorial for a demonstration of this.

Another point to highlight is that training of the primary (and by extension target) network until DELAY_TRAINING steps have been exceeded. Also, the epsilon value for the epsilon-greedy action selection policy doesn’t decay until these DELAY_TRAINING steps have been exceeded.

Dueling Q vs Double Q results

A comparison of the training progress with respect to the deterministic reward of the agent in the CartPole environment under Double Q and Dueling Q architectures can be observed in the figure below, with the x-axis being the number of episodes:

Dueling Q (light blue) vs Double Q (dark blue) rewards in the CartPole environment

Dueling Q (light blue) vs Double Q (dark blue) rewards in the CartPole environment

As can be observed, there is a slightly higher performance of the Double Q network with respect to the Dueling Q network. However, the performance difference is fairly marginal, and may be within the variation arising from the random weight initialization of the networks. There is also the issue of the Dueling Q network being slightly more complicated due to the additional value stream. As such, on a fairly simple environment like the CartPole environment, the benefits of Dueling Q over Double Q may not be realized. However, in more complex environments like Atari environments, it is likely that the Dueling Q architecture will be superior to Double Q (this is what the original Dueling Q paper has shown). Future posts will demonstrate the Dueling Q architecture in Atari environments.


Eager to build deep learning systems in TensorFlow 2? Get the book here

The post Dueling Q networks in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Atari Space Invaders and Dueling Q RL in TensorFlow 2

In previous posts (here and here) I introduced Double Q learning and the Dueling Q architecture. These followed on from posts about deep Q learning, and showed how double Q and dueling Q learning is superior to vanilla deep Q learning. However, these posts only included examples of simplistic environments like the OpenAI Cartpole environment. These types of environments are good to learn on, but more complicated environments are both more interesting and fun. They also demonstrate better the complexities of implementing deep reinforcement learning in realistic cases. In this post, I’ll use similar code to that shown in my Dueling Q TensorFlow 2 but in this case apply it to the Open AI Atari Space Invaders environment. All code for this post can be found on this site’s Github repository. Also, as mentioned in the title, the example code for this post is written using TensorFlow 2. TensorFlow 2 is now released and installation instructions can be found here.


Eager to build deep learning systems in TensorFlow 2? Get the book here


Double and Dueling Q learning recap

Double Q recap

Double Q learning was created to address two problems with vanilla deep Q learning. These are:

  1. Using the same network to both choose the best action and evaluate the quality of that action is a source of feedback / learning instability.
  2. The max function used in calculating the target Q value (see formula below), which the neural network is to learn, tends to bias the network towards high, noisy, rewards. This again hampers learning and makes it more erratic

The problematic Bellman equation is shown below: $$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$ The Double Q solution to the two problems above involves creating another target network, which is initially created with weights equal to the primary network. However, during training the primary network and the target network are allowed to “drift” apart. The primary network is trained as per usual, but the target network is not. Instead, the target network weights are either periodically (but not frequently) set equal to the primary network weights, or they are only gradually “blended” with the primary network in a weighted average fashion. The benefit then comes from the fact that in Double Q learning, the Q value of the best action in the next state ($s_{t + 1}$) is extracted from the target network, not the primary network. The primary network is still used to evaluate what the best action will be, a*, by taking an argmax of the outputs from the primary network, but the Q value for this action is evaluated from the target network. This can be observed in the formulation below: $$a* = argmax Q(s_{t+1}, a; theta_t)$$ $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*; theta^-_t)$$ Notice the different weights involved in the formulas above – the best action, a*, is calculated from the network with $theta_t$ weights – this is the primary network weights. However the $Q_{target}$ calculation uses the target network, with weights $theta^-_t$, to estimate the Q value for this chosen action. This Double Q methodology decouples the choosing of an action from the evaluation of the Q value of such an action. This provides more stability to the learning – for more details and a demonstration of the superiority of the Double Q methodology over vanilla Deep Q learning, see this post.

Dueling Q recap

The Dueling Q architecture, discussed in detail in this post, is an improvement to the Double Q network. It uses the same methodology of a target and a primary network, with periodic updates or blending of the target network weights to the primary network weights. However, it builds two important concepts into the architecture of the network. These are the advantage and value functions:

  • Advantage function A(s, a): The advantage function is the relative benefit of choosing a certain action in state s over the other possible actions in state s
  • Value function V(s): The value function is the value of being in state s, independent of the relative benefits of the actions within that state

The Q function is the simple addition of these two functions: $$Q(s, a) = V(s) + A(s, a)$$ The motivation of splitting these two functions explicitly in the architecture is that there can be inherently good or bad states for the agent to be in, regardless of the relative benefit of any actions within that state. For instance, in a certain state, all actions may lead to the agent “dying” in a game – this is an inherently bad state to be in, and there is no need to waste computational resources trying to determine the best action in this state. The converse can also be true. Ideally, this “splitting” into the advantage function and value function should be learnt implicitly during training. However, the Dueling Q architecture makes this split explicit, which acts to improve training. The Dueling Q architecture can be observed in the figure below:  

Dueling Q architecture

Dueling Q architecture

It can be observed that in the Dueling Q architecture, there are common Convolutional Neural Network layers which perform image processing. The output from these layers is then flattened and the network then bifurcates into a Value function stream V(s) and an Advantage function stream A(s, a). The output of these separate streams are then aggregated in a special layer, before finally outputting Q values from the network. The aggregation layer does not perform a simple addition of the Value and Advantage streams – this would result in problems of identifiability (for more details on this, see the original Dueling Q post). Instead, the following aggregation function is performed: $$Q(s,a) = V(s) + A(s,a) – frac{1}{|a|}sum_{a’}A(s,a’)$$ In this post, I’ll demonstrate how to use the Dueling Q architecture to train an agent in TensorFlow 2 to play Atari Space Invaders. However, in this post I will concentrate on the extra considerations required to train the agent via an image stream from an Atari game. For more extra details, again, refer to the original Dueling Q post.

Considerations for training in an Atari environment

Training reinforcement learning agents on Atari environments is hard – it can be a very time consuming process as the environment complexity is high, especially when the agent needs to visually interpret objects direct from images. As such, each environment needs to be considered to determine legitimate ways of reducing the training burden and improving the performance. Three methods will be used in this post:

  1. Converting images to greyscale
  2. Reducing the image size
  3. Stacking frames

Converting Atari images to greyscale and reducing the image size

The first, relatively easy, step in reducing the computational training burden is to convert all the incoming Atari images from depth-3 RGB colour images to depth-1 greyscale images. This reduces the number of input CNN filters required in the first layer by 3. Another step which can be performed to reduce the size of the input CNN filters is to resize the image inputs to make them smaller. There is obviously a limit in the reduction of the image sizes before learning performance is affected, however, in this case, a halving of the image size by rescaling is possible without affecting performance too much. The original image sizes from the Atari Space Invaders game are (210, 160, 3) – after converting to greyscale and resizing by half, the new image size is (105, 80, 1). Both of these operations are easy enough to implement in TensorFlow 2:

def image_preprocess(image, new_size=(105, 80)):
    # convert to greyscale, resize and normalize the image
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image, new_size)
    image = image / 255
    return image

Stacking image frames

The next step that is commonly performed when training agents on Atari games is the practice of stacking image frames, and feeding all these frames into the input CNN layers. The purpose of this is to allow the neural network to get some sense of direction of the objects moving within the image. Consider a single, static image – examining such an image on its own will give no information about which direction any of the objects moving within this image are travelling (or their respective speeds). Therefore, for each sample fed into the neural network, a stack of frames is presented to the input – this gives the neural network both time and spatial information to work with. The input dimension to the network are not, then, of size (105, 80, 1) but rather (105, 80, NUM_FRAMES). In this case, we’ll use 3 frames to feed into the network i.e. NUM_FRAMES = 3. The specifics of how these stacked frames are stored, extracted and updated will be revealed as we step through the code in the next section. Additional steps can be taken to improve performance in complex Atari environment and similar cases. These include the skipping of frames and prioritised experience replay (PER). However, these have not been implemented in this example. A future post will discuss the benefits of PER and how to implement it.

Atari Space Invaders TensorFlow 2 implementation

The section below details the TensorFlow 2 implementation of training an agent on the Atari Space Invaders environment. In this post, comprehensive details of the Dueling Q architecture and training implementation will not be given – for a step by step discussion on these details, see my Dueling Q introductory post. However, detailed information will be given about the specific new steps required to train in the Atari environment. As stated at the beginning of the post, all code can be found on this site’s Github repository.

Model definition

First we define the Double/Dueling Q model class with its structure:

env = gym.make("SpaceInvaders-v0")
num_actions = env.action_space.n


class DQModel(keras.Model):
    def __init__(self, hidden_size: int, num_actions: int, dueling: bool):
        super(DQModel, self).__init__()
        self.dueling = dueling
        self.conv1 = keras.layers.Conv2D(16, (8, 8), (4, 4), activation='relu')
        self.conv2 = keras.layers.Conv2D(32, (4, 4), (2, 2), activation='relu')
        self.flatten = keras.layers.Flatten()
        self.adv_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.adv_out = keras.layers.Dense(num_actions,
                                          kernel_initializer=keras.initializers.he_normal())
        if dueling:
            self.v_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
            self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal())
            self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x))
            self.combine = keras.layers.Add()

    def call(self, input):
        x = self.conv1(input)
        x = self.conv2(x)
        x = self.flatten(x)
        adv = self.adv_dense(x)
        adv = self.adv_out(adv)
        if self.dueling:
            v = self.v_dense(x)
            v = self.v_out(v)
            norm_adv = self.lambda_layer(adv)
            combined = self.combine([v, norm_adv])
            return combined
        return adv

primary_network = DQModel(256, num_actions, True)
target_network = DQModel(256, num_actions, True)
primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')
# make target_network = primary_network
for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
    t.assign(e)

primary_network.compile(optimizer=keras.optimizers.Adam(), loss=tf.keras.losses.Huber())

In the code above, first the Space Invaders environment is created. After this, the DQModel class is defined as a keras.Model base class. In this model, you can observe that first a number of convolutional layers are created, then a flatten layer and dedicated fully connected layers to enact the value and advantage streams. This structure is then implemented in the model call function. After this model class has been defined, two versions of it are implemented corresponding to the primary_network and the target_network – as discussed above, both of these will be utilised in the Double Q component of the learning. The target_network weights are then set to be initially equal to the primary_network weights. Finally the primary_network is compiled for training using an Adam optimizer and a Huber loss function. As stated previously, for more details see this post.

The Memory class

Next we will look at the Memory class, which is to hold all the previous experiences of the agent. This class is a little more complicated in the Atari environment case, due to the necessity of dealing with stacked frames:

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._actions = np.zeros(max_memory, dtype=np.int32)
        self._rewards = np.zeros(max_memory, dtype=np.float32)
        self._frames = np.zeros((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], max_memory), dtype=np.float32)
        self._terminal = np.zeros(max_memory, dtype=np.bool)
        self._i = 0

In the class __init__ function, it can be observed that all the various memory buffers (for actions, rewards etc.) are initialized according to max_memory at the get-go. This is in opposition to a memory approach which involves appending to lists. This is performed so that it can be determined whether there will be a memory problem during training from the very beginning (as opposed to the code falling over after you’ve already been running it for 3 days!). It also increases the efficiency of the memory allocation process (as appending / growing memory dynamically is an inefficient process). You’ll also observe the creation of a counter variable, self._i. This is to record the present location of stored samples in the memory buffer, and will ensure that the memory is not overflowed. The next function within the class shows how samples are stored within the class:

def add_sample(self, frame, action, reward, terminal):
    self._actions[self._i] = action
    self._rewards[self._i] = reward
    self._frames[:, :, self._i] = frame[:, :, 0]
    self._terminal[self._i] = terminal
    if self._i % (self._max_memory - 1) == 0 and self._i != 0:
        self._i = BATCH_SIZE + NUM_FRAMES + 1
    else:
        self._i += 1

As will be shown shortly, for every step in the Atari environment, the current image frame, the action taken, the reward received and whether the state is terminal (i.e. the agent ran out of lives and the game ends) is stored in memory. Notice that nothing special as yet is being done with the stored frames – they are simply stored in order as the game progresses. The frame stacking process occurs during the sample extraction method to be covered next. One thing to notice is that once self._i reaches max_memory the index is reset back to the beginning of the memory buffer (but offset by the batch size and the number of frames). This reset means that, once the memory buffer reaches it’s maximum size, it will begin to overwrite the older samples. The next method in the class governs how random sampling from the memory buffer occurs:

def sample(self):
    if self._i < BATCH_SIZE + NUM_FRAMES + 1:
        raise ValueError("Not enough memory to extract a batch")
    else:
        rand_idxs = np.random.randint(NUM_FRAMES + 1, self._i, size=BATCH_SIZE)
        states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
                         dtype=np.float32)
        next_states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
                         dtype=np.float32)
        for i, idx in enumerate(rand_idxs):
            states[i] = self._frames[:, :, idx - 1 - NUM_FRAMES:idx - 1]
            next_states[i] = self._frames[:, :, idx - NUM_FRAMES:idx]
        return states, self._actions[rand_idxs], self._rewards[rand_idxs], next_states, self._terminal[rand_idxs]

First, a simple check is performed to ensure there are enough samples in the memory to actually extract a batch. If so, a set of random indices rand_idxs is selected. These random integers are selected from a range with a lower bound of NUM_FRAMES + 1 and an upper bound of self._i. In other words, it is possible to select any indices from the start of the memory buffer to the current filled location of the buffer – however, because NUM_FRAMES of images prior to the selected indices is extracted, indices less than NUM_FRAMES are not allowed. The number of random indices selected is equal to the batch size.

Next, some numpy arrays are initialised which will hold the current states and the next states – in this example, these are of size (32, 105, 80, 3) where 3 is the number of frames to be stacked (NUM_FRAMES). A loop is then entered into for each of the randomly selected memory indices. As can be observed, the states batch row is populated by the stored frames ranging from idx – 1 – NUM_FRAMES to idx – 1. In other words, it is the 3 frames including and prior to the randomly selected index idx – 1. Alternatively, the batch row for next_states is the 3 frames including and prior to the randomly selected index idx (think of a window of 3 frames shifted along by 1 position). These variables states and next_states are then returned from this function, along with the corresponding actions, rewards and terminal flags. The terminal flags communicate whether the game finished for during the randomly selected states. Finally, the memory class is instantiated with the memory size as the argument:

memory = Memory(200000)

The memory size should ideally be as large as possible, but considerations must be given to the amount of memory available on whatever computing platform is being used to run the training.

Miscellaneous functions

The following two functions are standard functions to choose the actions and update the target network:

def choose_action(state, primary_network, eps, step):
    if step < DELAY_TRAINING:
        return random.randint(0, num_actions - 1)
    else:
        if random.random() < eps:
            return random.randint(0, num_actions - 1)
        else:
            return np.argmax(primary_network(tf.reshape(state, (1, POST_PROCESS_IMAGE_SIZE[0],
                                                           POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)).numpy()))


def update_network(primary_network, target_network):
    # update target network parameters slowly from primary network
    for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
        t.assign(t * (1 - TAU) + e * TAU)

The choose_action function performs the  epsilon-greedy action selection policy, where a random action is selected if a random value falls below eps, otherwise it is selected by choosing the action with the highest Q value from the network. The update_network function slowly shifts the target network weights towards the primary network weights in accordance with the Double Q learning methodology. The next function deals with the “state stack” which is an array which holds the last NUM_FRAMES of the episode:

def process_state_stack(state_stack, state):
    for i in range(1, state_stack.shape[-1]):
        state_stack[:, :, i - 1].assign(state_stack[:, :, i])
    state_stack[:, :, -1].assign(state[:, :, 0])
    return state_stack

This function takes the existing state stack array, and the newest state to be added. It then shuffles all the existing frames within the state stack “back” one position. In other words, the most recent state, in this case, sitting in row 2 of the state stack, if shuffled back to row 1. The frame / state in row 1 is shuffled to row 0. Finally, the most recent state or frame is stored in the newly vacated row 2 of the state stack. The state stack is required so that it can be fed into the neural network in order to choose actions, and its updating can be observed in the main training loop, as will be reviewed shortly.

The Dueling Q / Double Q training function

Next up is the training function:

def train(primary_network, memory, target_network=None):
    states, actions, rewards, next_states, terminal = memory.sample()
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = terminal != True
    batch_idxs = np.arange(BATCH_SIZE)
    if target_network is None:
        updates[valid_idxs] += GAMMA * np.amax(prim_qtp1.numpy()[valid_idxs, :], axis=1)
    else:
        prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
        q_from_target = target_network(next_states)
        updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    target_q[batch_idxs, actions] = updates
    loss = primary_network.train_on_batch(states, target_q)
    return loss

This train function is very similar to the train function reviewed in my first Dueling Q tutorial. Essentially, it first extracts batches of data from the memory buffer. Next the Q values from the current state (states) and the following states (next_states) are extracted from the primary network – these values are returned in prim_qt and prim_qtp1 respectively (where qtp1 refers to the Q values for the time t + 1). Next, the target Q values are initialized from the prim_qt values. After this, the updates variable is created – this holds the target Q values for the actions. These target values will be the Q values which the network will “step towards” during the optimization step – hence the name “target” Q values. 

The variable valid_idxs specifies those indices which don’t include terminal states – obviously for terminal states (states where the game ended), there are no future rewards to discount from, so the target value for these states is the rewards value. For other states, which do have future rewards, these need to be discounted and added to the current reward for the target Q values. If no target_network is provided, it is assumed vanilla Q learning should be used to provide the discounted target Q values. If not, double Q learning is implemented.

According to that methodology, first the a* actions are selected which are those actions with the highest Q values in the next state (t + 1). These actions are taken from the primary network, using the numpy argmax function. Next, the Q values from the target network are extracted from the next state (t + 1). Finally, the updates value is incremented for valid indices by adding the discounted future Q values from the target network, for the actions a* selected from the primary network. Finally, the network is trained using the Keras train_on_batch function.

The main Atari training loop

Now it is time to review the main training loop:

num_episodes = 1000000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQSI_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
double_q = True
steps = 0
for i in range(num_episodes):
    state = env.reset()
    state = image_preprocess(state)
    state_stack = tf.Variable(np.repeat(state.numpy(), NUM_FRAMES).reshape((POST_PROCESS_IMAGE_SIZE[0],
                                                                            POST_PROCESS_IMAGE_SIZE[1],
                                                                            NUM_FRAMES)))
    cnt = 1
    avg_loss = 0
    tot_reward = 0
    if i % GIF_RECORDING_FREQ == 0:
        frame_list = []
    while True:
        if render:
            env.render()
        action = choose_action(state_stack, primary_network, eps, steps)
        next_state, reward, done, info = env.step(action)
        tot_reward += reward
        if i % GIF_RECORDING_FREQ == 0:
            frame_list.append(tf.cast(tf.image.resize(next_state, (480, 320)), tf.uint8).numpy())
        next_state = image_preprocess(next_state)
        state_stack = process_state_stack(state_stack, next_state)
        # store in memory
        memory.add_sample(next_state, action, reward, done)

        if steps > DELAY_TRAINING:
            loss = train(primary_network, memory, target_network if double_q else None)
            update_network(primary_network, target_network)
        else:
            loss = -1
        avg_loss += loss

        # linearly decay the eps value
        if steps > DELAY_TRAINING:
            eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * 
                  (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else 
                MIN_EPSILON
        steps += 1

        if done:
            if steps > DELAY_TRAINING:
                avg_loss /= cnt
                print(f"Episode: {i}, Reward: {tot_reward}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}")
                with train_writer.as_default():
                    tf.summary.scalar('reward', tot_reward, step=i)
                    tf.summary.scalar('avg loss', avg_loss, step=i)
            else:
                print(f"Pre-training...Episode: {i}")
            if i % GIF_RECORDING_FREQ == 0:
                record_gif(frame_list, i)
            break

        cnt += 1

This training loop is very similar to the training loop in my Dueling Q tutorial, so for a detailed review, please see that post. The main differences relate to how the frame stacking is handled. First, you’ll notice at the start of the loop that the environment is reset, and the first state / image is extracted. This state or image is pre-processed and then repeated NUM_FRAMES times and reshaped to create the first state or frame stack, of size (105, 80, 3) in this example. Another point to note is that a gif recording function has been created which is called every GIF_RECORDING_FREQ episodes. This function involves simply outputting every frame to a gif so that the training progress can be monitored by observing actual gameplay. As such, there is a frame list which is filled whenever each GIF_RECORDING_FREQ episode comes around, and this frame list is passed to the gif recording function. Check out the code for this tutorial for more details. Finally, it can be observed that after every state, the state stack is processed by shuffling along each recorded frame / state in that stack. 

Space Invader Atari training results

The image below shows how the training progresses through each episode with respect to the total reward received for each episode:    

Atari Space Invaders - Dueling Q training reward

Atari Space Invaders – Dueling Q training reward

As can be observed from the plot above, the reward steadily increases over 1500 episodes of game play. Note – if you wish to replicate this training on your own, you will need GPU processing support in order to reduce the training timeframes to a reasonable level. In this case, I utilised the Google Cloud Compute Engine and a single GPU. The gifs below show the progress of the agent in gameplay between episode 50 and episode 1450:

Atari Space Invaders - gameplay episode 50

Atari Space Invaders – gameplay episode 50

 

Atari Space Invaders - gameplay episode 1450

Atari Space Invaders – gameplay episode 1450

As can be observed, after 50 epsiodes the agent still moves around randomly and is quickly killed, achieving a score of only 60 points. However, after 1450 episodes, the agent can be seen to be playing the game much more effectively, even having learnt to destroy the occasional purple “master ship” flying overhead to gain extra points. 

This post has demonstrated how to effectively train agents to operate in Atari environments such as Space Invaders. In particular it has demonstrated how to use the Dueling Q reinforcement learning algorithm to train the agent. A future post will demonstrate how to make the training even more efficient using the Prioritised Experience Replay (PER) approach. 


Eager to build deep learning systems in TensorFlow 2? Get the book here

The post Atari Space Invaders and Dueling Q RL in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Policy Gradient Reinforcement Learning in TensorFlow 2

In a series of recent posts, I have been reviewing the various Q based methods of deep reinforcement learning (see here, here, here, here and so on). Deep Q based reinforcement learning operates by training a neural network to learn the Q value for each action of an agent which resides in a certain state s of the environment. The policy which guides the actions of the agent in this paradigm operates by a random selection of actions at the beginning of training (the epsilon greedy method), but then the agent will select actions based on the highest Q value predicted in each state s. The Q value is simply an estimation of future rewards which will result from taking action a. An alternative to the deep Q based reinforcement learning is to forget about the Q value and instead have the neural network estimate the optimal policy directly. Reinforcement learning methods based on this idea are often called Policy Gradient methods.

This post will review the REINFORCE or Monte-Carlo version of the Policy Gradient methodology. This methodology will be used in the Open AI gym Cartpole environment. All code used and explained in this post can be found on this site’s Github repository.  


Eager to build deep learning systems in TensorFlow 2? Get the book here


Policy Gradients and their theoretical foundation

This section will review the theory of Policy Gradients, and how we can use them to train our neural network for deep reinforcement learning. This section will feature a fair bit of mathematics, but I will try to explain each step and idea carefully for those who aren’t as familiar with the mathematical ideas. We’ll also skip over a step at the end of the analysis for the sake of brevity.  

In Policy Gradient based reinforcement learning, the objective function which we are trying to maximise is the following:

$$J(theta) = mathbb{E}_{pi_theta} left[sum_{t=0}^{T-1} gamma^t r_t right]$$
 
The function above means that we are attempting to find a policy ($pi$) with parameters ($theta$) which maximises the expected value of the sum of the discounted rewards of an agent in an environment. Therefore, we need to find a way of varying the parameters of the policy $theta$ such that the expected value of the discounted rewards are maximised. In the deep reinforcement learning case, the parameters $theta$ are the parameters of the neural network.
 
Note the difference to the deep Q learning case – in deep Q based learning, the parameters we are trying to find are those that minimise the difference between the actual Q values (drawn from experiences) and the Q values predicted by the network. However, in Policy Gradient methods, the neural network directly determines the actions of the agent – usually by using a softmax output and sampling from this. 
 
Also note that, because environments are usually non-deterministic, under any given policy ($pi_theta$) we are not always going to get the same reward. Rather, we are going to be sampling from some probability function as the agent operates in the environment, and therefore we are trying to maximise the expected sum of rewards, not the 100% certain, we-will-get-this-every-time reward sum.  
 
Ok, so we want to learn the optimal $theta$. The way we generally learn parameters in deep learning is by performing some sort of gradient based search of $theta$. So we want to iteratively execute the following:
 
$$theta leftarrow theta + alpha nabla J(theta)$$
 
So the question is, how do we find $nabla J(theta)$?

Finding the Policy Gradient

First, let’s make the expectation a little more explicit. Remember, the expectation of the value of a function $f(x)$ is the summation of all the possible values due to variations in x multiplied by the probability of x, like so:

$$mathbb{E}[f(x)] = sum_x P(x)f(x)$$
 
Ok, so what does the cashing out of the expectation in $J(theta)$ look like? First, we have to define the function which produces the rewards, i.e. the rewards equivalent of $f(x)$ above. Let’s call this $R(tau)$ (where
$R(tau) = sum_{t=0}^{T-1}r_t$, ignoring discounting for the moment).  The value $tau$ is the trajectory of the agent “moving” through the environment. It can be defined as:
 
$$tau = (s_0, a_0, r_0, s_1, a_1, r_1, ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$$
 
The trajectory, as can be seen, is the progress of the agent through an episode of a game of length T. This trajectory is the fundamental factor which determines the sum of the rewards – hence $R(tau)$. This covers the $f(x)$ component of the expectation definition. What about the $P(x)$ component? In this case, it is equivalent to $P(tau)$ – but what does this actually look like in a reinforcement learning environment? 
 
It consists of the two components – the probabilistic policy function which yields an action $a_t$ from states $s_t$ with a certain probability, and a probability that state $s_{t+1}$ will result from taking action $a_t$ from state $s_t$. The latter probabilistic component is uncertain due to the random nature of many environments. These two components operating together will “roll out” the trajectory of the agent $tau$. 
 
$P(tau)$ looks like:
 
$$P(tau) = prod_{t=0}^{T-1} P_{pi_{theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)$$
 
If we take the first step, starting in state $s_0$ – our neural network will produce a softmax output with each action assigned a certain probability. The action is then selected by weighted random sampling subject to these probabilities – therefore, we have a probability of action $a_0$ being selected according to $P_{pi_{theta}}(a_t|s_t)$. This probability is determined by the policy $pi$ which in turn is parameterised according to $theta$ (i.e. a neural network with weights $theta$.  The next term will be $P(s_1|s_0,a_0)$ which expresses any non-determinism in the environment.
 
(Note: the vertical line in the probability functions above are conditional probabilities. $P_{pi_{theta}}(a_t|s_t)$ refers to the probability of action $a_t$ being selected, given the agent is in state $s_t$). 
 
These probabilities are multiplied out over all the steps in the episode of length T to produce the trajectory $tau$. (Note, the probability of being in the first state, $s_0$, has been excluded from this analysis for simplicity). Now we can substitute $P(tau)$ and $R(tau)$ into the original expectation and take the derivative to get to $nabla J(theta)$ which is what we need to do the gradient based optimisation. However, to get there, we first need to apply a trick or two. 
 
First, let’s take the log derivative of $P(tau)$ with respect to $theta$ i.e. $nabla_theta$ and work out what we get:
 
 
$$nabla_theta log P(tau) = nabla log left(prod_{t=0}^{T-1} P_{pi_{theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)right) $$
$$ =nabla_theta left[sum_{t=0}^{T-1} (log P_{pi_{theta}}(a_t|s_t) + log P(s_{t+1}|s_t,a_t)) right]$$
$$ =nabla_theta sum_{t=0}^{T-1}log P_{pi_{theta}}(a_t|s_t)$$
 
The reason we are taking the log will be made clear shortly. As can be observed, when the log is taken of the multiplicative operator ($prod$) this is converted to a summation (as multiplying terms within a log function is equivalent to adding them separately). In the final line, it can be seen that taking the derivative with respect to the parameters ($theta$) removes the dynamics of the environment ($log P(s_{t+1}|s_t,a_t))$) as these are independent of the neural network parameters / $theta$. 
 
Let’s go back to our original expectation function, substituting in our new trajectory based functions, and apply the derivative (again ignoring discounting for simplicity):
 
$$J(theta) = mathbb{E}[R(tau)]$$
$$ = smallint P(tau) R(tau)$$
$$nabla_theta J(theta) = nabla_theta smallint P(tau) R(tau)$$
 
So far so good. Now, we are going to utilise the following rule which is sometimes called the “log-derivative” trick:
 
$$frac{nabla_theta p(X,theta)}{p(X, theta)} = nabla_theta log p(X,theta)$$
 
We can then apply the $nabla_{theta}$ operator within the integral, and cajole our equation so that we get the $frac{nabla_{theta} P{tau}}{P(tau)}$ expression like so:
 
$$nabla_theta J(theta)=int P(tau) frac{nabla_theta P(tau)}{P(tau)} R(tau)$$
 
Then, using the log-derivative trick and applying the definition of expectation, we arrive at:
 
$$nabla_theta J(theta)=mathbb{E}left[R(tau) nabla_theta logP(tau)right]$$
 
We can them substitute our previous derivation of $nabla_{theta} log P(tau)$ into the above to arrive at:
 
$$nabla_theta J(theta) =mathbb{E}left[R(tau) nabla_theta sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|s_t)right]$$
 
This is now close to the point of being something we can work with in our learning algorithm. Let’s take it one step further by recognising that, during our learning process, we are randomly sampling trajectories from the environment, and hoping to make informed training steps. Therefore, we can recognise that, to maximise the expectation above, we need to maximise it with respect to its argument i.e. we maximise:
 
$$nabla_theta J(theta) sim R(tau) nabla_theta sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|s_t)$$
 
Recall that $R(tau)$ is equal to $R(tau) = sum_{t=0}^{T-1}r_t$ (ignoring discounting). Therefore, we have two summations that need to be multiplied out, element by element. It turns out that after doing this, we arrive at an expression like so:
 
$$nabla_theta J(theta) sim left(sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|r_t)right)left(sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’} right)$$
 
As can be observed, there are two main components that need to be multiplied. However, one should note the differences in the bounds of the summation terms in the equation above – these will be explained in the next section.
 

Calculating the Policy Gradient

The way we compute the gradient as expressed above in the REINFORCE method of the Policy Gradient algorithm involves sampling trajectories through the environment to estimate the expectation, as discussed previously. This REINFORCE method is therefore a kind of Monte-Carlo algorithm. Let’s consider this a bit more concretely. 
 
Let’s say we initialise the agent and let it play a trajectory $tau$ through the environment. The actions of the agent will be selected by performing weighted sampling from the softmax output of the neural network – in other words, we’ll be sampling the action according to $P_{pi_{theta}}(a_t|r_t)$. At each step in the trajectory, we can easily calculate $log P_{pi_{theta}}(a_t|r_t)$ by simply taking the log of the softmax output values from the neural network. So, for the first step in the trajectory, the neural network would take the initial states $s_0$ as input, and it would produce a vector of actions $a_0$ with pseudo-probabilities generated by the softmax operation in the final layer. 
 
What about the second part of the $nabla_theta J(theta)$ equation – $sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’}$? We can see that the summation term starts at $t’ = t + 1 = 1$. The summation then goes from t=1 to the total length of the trajectory T – in other words, from t=1 to the total length of the episode. Let’s say the episode length was 4 states long – this term would then look like $gamma^0 r_1 + gamma^1 r_2 + gamma^2 r_3$, where $gamma$ is the discounting factor and is < 1. 
 
Straight-forward enough. However, you may have realised that, in order to calculate the gradient $nabla_theta J(theta)$ at the first step in the trajectory/episode, we need to know the reward values of all subsequent states in the episode. Therefore, in order to execute this method of learning, we can only take gradient learning steps after the full episode has been played to completion. Only after the episode is complete can we perform the training step. 
 
We are almost ready to move onto the code part of this tutorial. However, this is a good place for a quick discussion about how we would actually implement the calculations $nabla_theta J(theta)$ equation in TensorFlow 2 / Keras. It turns out we can just use the standard cross entropy loss function to execute these calculations. Recall that cross entropy is defined as (for a deeper explanation of entropy, cross entropy, information and KL divergence, see this post):
 
$$CE = -sum p(x) log(q(x))$$
 
Which is just the summation between one function $p(x)$ multiplied by the log of another function $q(x)$ over the possible values of the argument x. If we look at the source code of the Keras implementation of cross-entropy, we can see the following:
 
Keras output of cross-entropy loss function

Keras output of cross-entropy loss function

The output tensor here is simply the softmax output of the neural network, which, for our purposes, will be a tensor of size (num_steps_in_episode, num_actions). Note that the log of output is calculated in the above. The target value, for our purposes, can be all the discounted rewards calculated at each step in the trajectory, and will be of size (num_steps_in_episode, 1). The summation of the multiplication of these terms is then calculated (reduce_sum). Gradient based training in TensorFlow 2 is generally a minimisation of the loss function, however, we want to maximise the calculation as discussed above. The good thing is, the sign of cross entropy calculation shown above is inverted – so we are good to go. 

To call this training step utilising Keras, all we have to do is execute something like the following:

network.train_on_batch(states, discounted_rewards)

Here, we supply all the states gathered over the length of the episode, and the discounted rewards at each of those steps. The Keras backend will pass the states through network, apply the softmax function, and this will become the output variable in the Keras source code snippet above. Likewise, discounted_rewards is the same as target in the source code snippet above. 

Now that we have covered all the pre-requisite knowledge required to build a REINFORCE-type method of Policy Gradient reinforcement learning, let’s have a look at how this can be coded and applied to the Cartpole environment.

Policy Gradient reinforcement learning in TensorFlow 2 and Keras

In this section, I will detail how to code a Policy Gradient reinforcement learning algorithm in TensorFlow 2 applied to the Cartpole environment. As always, the code for this tutorial can be found on this site’s Github repository.

First, we define the network which we will use to produce $P_{pi_{theta}}(a_t|r_t)$ with the state as the input:

GAMMA = 0.95

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions, activation='softmax')
])
network.compile(loss='categorical_crossentropy',optimizer=keras.optimizers.Adam())

As can be observed, first the environment is initialised. Next, the network is defined using the Keras Sequential API. The network consists of 3 densely connected layers. The first 2 layers have ReLU activations, and the final layer has a softmax activation to produce the pseudo-probabilities to approximate $P_{pi_{theta}}(a_t|r_t)$. Finally, the network is compiled with a cross entropy loss function and an Adam optimiser. 

The next part of the code chooses the action from the output of the model:

def get_action(network, state, num_actions):
    softmax_out = network(state.reshape((1, -1)))
    selected_action = np.random.choice(num_actions, p=softmax_out.numpy()[0])
    return selected_action

As can be seen, first the softmax output is extracted from the network by inputing the current state. The action is then selected by making a random choice from the number of possible actions, with the probabilities weighted according to the softmax values. 

The next function is the main function involved in executing the training step:

def update_network(network, rewards, states, actions, num_actions):
    reward_sum = 0
    discounted_rewards = []
    for reward in rewards[::-1]:  # reverse buffer r
        reward_sum = reward + GAMMA * reward_sum
        discounted_rewards.append(reward_sum)
    discounted_rewards.reverse()
    discounted_rewards = np.array(discounted_rewards)
    # standardise the rewards
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= np.std(discounted_rewards)
    states = np.vstack(states)
    loss = network.train_on_batch(states, discounted_rewards)
    return loss

First, the discounted rewards list is created: this is a list where each element corresponds to the summation from t + 1 to T according to $sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’}$. The input argument rewards is a list of all the rewards achieved at each step in the episode. The rewards[::-1] operation reverses the order of the rewards list, so the first run through the for loop will deal with last reward recorded in the episode. As can be observed, a reward sum is accumulated each time the for loop is executed. Let’s say that the episode length is equal to 4 – $r_3$ will refer to the last reward recorded in the episode. In this case, the discounted_rewards list would look like:

[$r_3$, $r_2 + gamma r_3$, $r_1 + gamma r_2 + gamma^2 r_3$, $r_0 + gamma r_1 + gamma^2 r_2 + gamma^3 r_3$]

 

This list is in reverse to the order of the actual state value list (i.e. [$s_0$, $s_1$, $s_2$, $s_3$]), so the next line after the for loop reverses the list (discounted_rewards.reverse()). 

Next, the list is converted into a numpy array, and the rewards are normalised to reduce the variance in the training. Finally, the states list is stacked into a numpy array and both this array and the discounted rewards array are passed to the Keras train_on_batch function, which was detailed earlier. 

The next part of the code is the main episode and training loop:

num_episodes = 10000000
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PGCartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
for episode in range(num_episodes):
    state = env.reset()
    rewards = []
    states = []
    actions = []
    while True:
        action = get_action(network, state, num_actions)
        new_state, reward, done, _ = env.step(action)
        states.append(state)
        rewards.append(reward)
        actions.append(action)

        if done:
            loss = update_network(network, rewards, states, actions, num_actions)
            tot_reward = sum(rewards)
            print(f"Episode: {episode}, Reward: {tot_reward}, avg loss: {loss:.5f}")
            with train_writer.as_default():
                tf.summary.scalar('reward', tot_reward, step=episode)
                tf.summary.scalar('avg loss', loss, step=episode)
            break

        state = new_state

As can be observed, at the beginning of each episode, three lists are created which will contain the state, reward and action values for each step in the episode / trajectory. These lists are appended to until the done flag is returned from the environment signifying that the episode is complete. At the end of the episode, the training step is performed on the network by running update_network. Finally, the rewards and loss are logged in the train_writer for viewing in TensorBoard. 

The training results can be observed below:

Training progress of Policy Gradient RL on Cartpole environment

Training progress of Policy Gradient RL in Cartpole environment

As can be observed, the rewards steadily progress until they “top out” at the maximum possible reward summation for the Cartpole environment, which is equal to 200. However, the user can verify that repeated runs of this version of Policy Gradient training has a high variance in its outcomes. Therefore, improvements in the Policy Gradient REINFORCE algorithm are required and available – these improvements will be detailed in future posts.

 


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

 

The post Policy Gradient Reinforcement Learning in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Bayes Theorem, maximum likelihood estimation and TensorFlow Probability

A growing trend in deep learning (and machine learning in general) is a probabilistic or Bayesian approach to the problem. Why is this? Simply put – a standard deep learning model produces a prediction, but with no statistically robust understanding of how confident the model is in the prediction. This is important in the understanding of the limitations of model predictions, and also if one wants to do probabilistic modeling of any kind. There are also other applications, such as probabilistic programming and being able to use domain knowledge, but more on that in another post. The TensorFlow developers have addressed this problem by creating TensorFlow Probability. This post will introduce some basic Bayesian concepts, specifically the likelihood function and maximum likelihood estimation, and how these can be used in TensorFlow Probability for the modeling of a simple function.

The code contained in this tutorial can be found on this site’s Github repository.


Eager to build deep learning systems in TensorFlow 2? Get the book here

 

Bayes theorem and maximum likelihood estimation

Bayes theorem is one of the most important statistical concepts a machine learning practitioner or data scientist needs to know. In the machine learning context, it can be used to estimate the model parameters (e.g. the weights in a neural network) in a statistically robust way. It can also be used in model selection e.g. choosing which machine learning model is the best to address a given problem. I won’t be going in-depth into all the possible uses of Bayes theorem here, however, but I will be introducing the main components of the theorem.

Bayes theorem can be shown in a fairly simple equation involving conditional probabilities as follows:

$$P(theta vert D) = frac{P(D vert theta) P(theta)}{P(D)}$$

In this representation, the variable $theta$ corresponds to the model parameters (i.e. the values of the weights in a neural network), and the variable $D$ corresponds to the data that we are using to estimate the $theta$ values. Before I talk about what conditional probabilities are, I’ll just quickly point out three terms in this formula which are very important to familiarise yourself with, as they come up in the literature all the time. It is worthwhile memorizing what these terms refer to:

$P(theta vert D)$ – this is called the posterior

$P(D vert theta)$ – this is called the likelihood

$P(theta)$ – this is called the prior

I’m going to explain what all these terms refer to shortly, but first I’ll make a quick detour to discuss conditional probability for those who may not be familiar. If you are already familiar with conditional probability, feel free to skip this section.

Conditional probability

Conditional probability is an important statistical concept that is thankfully easy to understand, as it forms a part of our everyday reasoning. Let’s say we have a random variable called RT which represents whether it will rain today – it is a discrete variable and can take on the value of either 1 or 0, denoting whether it will rain today or not. Let’s say we are in a fairly dry environment, and by consulting some long-term rainfall records we know that RT=1 about 10% of the time, and therefore RT=0 about 90% of the time. This fully represents the probability function for RT which can be written as P(RT). Therefore, we have some prior knowledge of what P(RT) is in the absence of any other determining factors.

Ok, so what does P(RT) look like if we know it rained yesterday? Is it the same or is it different? Well, let’s say the region we are in gets most of its rainfall due to big weather systems that can last for days or weeks – in this case, we have good reason to believe that P(RT) will be different given the fact that it rained yesterday. Therefore, the probability P(RT) is now conditioned on our understanding of another random variable P(RY) which represents whether it has rained yesterday. The way of showing this conditional probability is by using the vertical slash symbol $vert$ – so the conditional probability that it will rain today given it rained yesterday looks like the following: $P(RT=1 vert RY = 1)$. Perhaps for this reason the probability that it will rain today is no longer 10%, but maybe will rise to 30%, so $P(RT=1 vert RY = 1) = 0.3$

We could also look at other probabilities, such as $P(RT=1 vert RY = 0)$ or $P(RT=0 vert RY = 1)$ and so on. To generalize this relationship we would just write $P(RT vert RY)$.

Now that you have an understanding of conditional probabilities, let’s move on to explaining Bayes Theorem (which contains two conditional probability functions) in more detail.

Bayes theorem in more detail

The posterior

Ok, so as I stated above, it is time to delve into the meaning of the individual terms of Bayes theorem. Let’s first look at the posterior term – $P(theta vert D)$. This term can be read as: given we have a certain dataset $D$, what is the probability of our parameters $theta$? This is the term we want to maximize when varying the parameters of a model according to a dataset – by doing so, we find those parameters $theta$ which are most probable given the model we are using and the training data supplied. The posterior is on the left-hand side of the equation of Bayes Theorem, so if we want to maximize the posterior we can do this by maximizing the right-hand side of the equation.

Let’s have a look at the terms on the right-hand side.

The likelihood

The likelihood is expressed as $P(D vert theta)$ and can be read as: given this parameter $theta$, which defines some process of generating data, what is the probability we would see this given set of data $D$? Let’s say we have a scattering of data-points – a good example might be the heights of all the members of a classroom full of kids. We can define a model that we assume is able to generate or represent this data – in this case, the Normal distribution is a good choice. The parameters that we are trying to determine in the Normal distribution is the tuple ($mu$, $sigma$) – the mean and variance of the Normal distribution.

So the likelihood $P(D vert theta)$ in this example is the probability of seeing this sample of measured heights given different values of the mean and variance of the Normal distribution function. There is some more mathematical precision needed here (such as the difference between a probability distribution and a probability density function, discrete samples etc.) but this is ok for our purposes of coming to a conceptual understanding.

I’ll come back to the concept of the likelihood shortly when we discuss maximum likelihood estimation, but for now, let’s move onto the prior.

The prior

The prior probability $P(theta)$, as can be observed, is not a conditioned probability distribution. It is simply a representation of the probability of the parameters prior to any other consideration of data or evidence. You may be puzzled as to what the point of this probability is. In the context of machine learning or probabilistic programming, it’s purpose is to enable us to specify some prior understanding of what the parameters should actually be, and the prior probability distribution it should be drawn from.

Returning to the example of the heights of kids in a classroom. Let’s say the teacher is a pretty good judge of heights, and therefore he or she can come to the problem with a rough prior estimate of what the mean height would be. Let’s say he or she guesses that the average height is around 130cm. He can then put a prior around the mean parameter $mu$ of, say, a normal distribution with a mean of 130cm.

The presence of the prior in the Bayes theorem allows us to introduce expert knowledge or prior beliefs into the problem, which aids the finding of the optimal parameters $theta$. These prior beliefs are then updated by the data collected $D$ – with the updating occurring through the action of the likelihood function.

The graph below is an example of the evolution of a prior distribution function exposed to some set of data:

The evolution of the prior - Bayes Theorem - Maximum likelihood estimation

The evolution of the prior distribution towards the evidence / data

Here we can see that, through the application of the Bayes Theorem, we can start out with a certain set of prior beliefs in the form of a prior distribution function, but by applying the evidence or data through the likelihood $P(D vert theta)$, the posterior estimate $P(theta vert D)$ moves closer to “reality”.

The data

The final term in Bayes Theorem is the unconditioned probability distribution of the process that generated the data $P(D)$. In machine learning applications, this distribution is often unknown – but thankfully, it doesn’t matter. This distribution acts as a normalization constant and has nothing to say about the parameters we are trying to estimate $theta$. Therefore, because we are trying to simply maximize the right-hand side of the equation, it drops out of any derivative calculation that is made in order to find the maximum. So in the context of machine learning and estimating parameters, this term can be safely ignored. Given this understanding, the form of Bayes Theorem that we are mostly interested in for machine learning purposes is as follows: $$P(theta vert D) propto P(D vert theta) P(theta)$$

Given this formulation, all we are concerned about is either maximizing the right-hand side of the equation or by simulating the sampling of the posterior itself (not covered in this post).

How to estimate the posterior

Now that we have reviewed conditional probability concepts and Bayes Theorem, it is now time to consider how to apply Bayes Theorem in practice to estimate the best parameters in a machine learning problem. There are a number of ways of estimating the posterior of the parameters in a machine learning problem. These include maximum likelihood estimation, maximum a posterior probability (MAP) estimation, simulating the sampling from the posterior using Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling, and so on. In this post, I will just be considering maximum likelihood estimation (MLE) with other methods being considered in future content on this site.

Maximum likelihood estimation (MLE)

What happens if we just throw our hands up in the air with regards to the prior $P(theta)$ and say we don’t know anything about the best parameters to describe the data? In that case, the prior becomes a uniform or un-informative prior – in that case, $P(theta)$ becomes a constant (same probability no matter what the parameter values are), and our Bayes Theorem reduces to:

$$P(theta vert D) propto P(D vert theta)$$

If this is the case, all we have to do is maximize the likelihood $P(D vert theta)$ and by doing so we will also find the maximum of the posterior – i.e. the parameter with the highest probability given our model and data – or, in short, an estimate of the optimal parameters. If we have a way of calculating $P(D vert theta)$ while varying the parameters $theta$, we can then feed this into some sort of optimizer to calculate:

$$underset{theta}{operatorname{argmax}} P(D vert theta)$$
 
Nearly always, instead of maximizing $P(D vert theta)$ the log of $P(D vert theta)$ is maximized. Why? If we were doing the calculations by hand, we would need to calculate the derivative of the product of multiple exponential functions (as probability functions like the Normal distribution have exponentials in them) which is tricky. Because logs are monotonically increasing functions, they have maximums at the same point as the non-log function. So in other words, the maximum likelihood will occur at the same parameter value as the maximum of the log likelihood. By taking the log of the likelihood, products turn into sums and this makes derivative calculations a whole lot easier.
 
Finally, some optimizers in machine learning packages such as TensorFlow only minimize loss functions, so we need to invert the sign of the loss function in order to maximize it. In that case, for maximum likelihood estimation, we would minimize the negative log likelihood, or NLL, and get the same result.
 
Let’s look at a simple example of maximum likelihood estimation by using TensorFlow Probability.

TensorFlow Probability and maximum likelihood estimation

For the simple example of maximum likelihood estimation that is to follow, TensorFlow Probability is overkill – however, TensorFlow Probability is a great extension of TensorFlow into the statistical domain, so it is worthwhile introducing MLE by utilizing it. The Jupyter Notebook containing this example can be found at this site’s Github repository. Note this example is loosely based on the TensorFlow tutorial found here. In this example, we will be estimating linear regression parameters based on noisy data. These parameters can obviously be solved using analytical techniques, but that isn’t as interesting. First, we import some libraries and generate the noisy data:

import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import matplotlib.pylab as plt
tfd = tfp.distributions

x_range = np.arange(0, 10, 0.1)
grad = 2.0
intercept = 3.0
lin_reg = x_range * grad + np.random.normal(0, 3.0, len(x_range)) + intercept

Plotting our noisy regression line looks like the following:

Noisy regression line - Maximum likelihood estimation

Noisy regression line

Next, let’s set up our little model to predict the underlying regression function from the noisy data:

model = tf.keras.Sequential([
  tf.keras.layers.Dense(1),
  tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x, scale=1)),
])

So here we have a simple Keras sequential model (for more detail on Keras and TensorFlow, see this post). The first layer is a Dense layer with one node. Given each Dense layer has one bias input by default – this layer equates to generating a simple line with a gradient and intercept: $xW + b$ where x is the input data, W is the single input weight and b is the bias weight. So the first Dense layer produces a line with a trainable gradient and y-intercept value.

The next layer is where TensorFlow Probability comes in. This layer allows you to create a parameterized probability distribution, with the parameter being “fed in” from the output of previous layers. In this case, you can observe that the lambda x, which is the output from the previous layer, is defining the mean of a Normal distribution. In this case, the scale (i.e. the standard deviation) is fixed to 1.0. So, using TensorFlow probability, our model no longer will just predict a single value for each input (as in a non-probabilistic neural network) – no, instead the output is actually a Normal distribution. In that case, to actually predict values we need to call statistical functions from the output of the model. For instance:

  • model(np.array([[1.0]])).sample(10) will produce a random sample of 10 outputs from the Normal distribution, parameterized by the input value 1.0 fed through the first Dense layer
  • model(np.array([[1.0]])).mean() will produce the mean of the distribution, given the input
  • model(np.array([[1.0]])).stddev() will produce the standard deviation of the distribution, given the input

and so on. We can also calculate the log probability of the output distribution, as will be discussed shortly. Next, we need to set up our “loss” function – in this case, our “loss” function is actually just the negative log likelihood (NLL):

def neg_log_likelihood(y_actual, y_predict):
  return -y_predict.log_prob(y_actual)

In the above, the y_actual values are the actual noisy training samples. The values y_predict are actually a tensor of parameterized Normal probability distributions – one for each different training input. So, for instance, if one training input is 5.0, the corresponding y_predict value will be a Normal distribution with a mean value of, say, 12. Another training input may have a value 10.0, and the corresponding y_predict will be a Normal distribution with a mean value of, say, 20, and so on. Therefore, for each y_predict and y_actual pair, it is possible to calculate the log probability of that actual value occurring given the predicted Normal distribution.

To make this more concrete – let’s say for a training input value 5.0, the corresponding actual noisy regression value is 8.0. However, let’s say the predicted Normal distribution has a mean of 10.0 (and a fixed variance of 1.0). Using the formula for the log probability / log likelihood of a Normal distribution:

$$ell_x(mu,sigma^2) = – ln sigma – frac{1}{2} ln (2 pi) – frac{1}{2} Big( frac{x-mu}{sigma} Big)^2$$

Substituting in the example values mentioned above:

$$ell_x(10.0,1.0) = – ln 1.0 – frac{1}{2} ln (2 pi) – frac{1}{2} Big( frac{8.0-10.0}{1.0} Big)^2$$

We can calculate the log likelihood from the y_predict distribution and the y_actual values. Of course, TensorFlow Probability does this for us by calling the log_prob method on the y_predict distribution. Taking the negative of this calculation, as I have done in the function above, gives us the negative log likelihood value that we need to minimize to perform MLE.

After the loss function, it is now time to compile the model, train it, and make some predictions:

model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.05), loss=neg_log_likelihood)
model.fit(x_range, lin_reg, epochs=500, verbose=False)

yhat = model(x_range)
mean = yhat.mean()

As can be observed, the model is compiled using our custom neg_log_likelihood function as the loss. Because this is just a toy example, I am using the full dataset as both the train and test set. The estimated regression line is simply the mean of all the predicted distributions, and plotting it produces the following:

plt.close("all")
plt.scatter(x_range, lin_reg)
plt.plot(x_range, mean, label='predicted')
plt.plot(x_range, x_range * grad + intercept, label='ground truth')
plt.legend(loc="upper left")
plt.show()
plt.close("all") plt.scatter(x_range, lin_reg) plt.plot(x_range, mean, label='predicted') plt.plot(x_range, x_range * grad + intercept, label='ground truth') plt.legend(loc="upper left") plt.show()

TensorFlow Probability based regression using maximum likelihood estimation

Another example with changing variance

Another, more interesting, example is to use the model to predict not only the mean but also the changing variance of a dataset. In this example, the dataset consists of the same trend but the noise variance increases along with the values:

def noise(x, grad=0.5, const=2.0):
  return np.random.normal(0, grad * x + const)

x_range = np.arange(0, 10, 0.1)
noise = np.array(list(map(noise, x_range)))
grad = 2.0
intercept = 3.0
lin_reg = x_range * grad + intercept + noise

plt.scatter(x_range, lin_reg)
plt.show()
linear regression with increasing noise variance

Linear regression with increasing noise variance

The new model looks like the following:

model = tf.keras.Sequential([
  tf.keras.layers.Dense(2),
  tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x[:, 0], scale=1e-3 + tf.math.softplus(0.3 * x[:, 1]))),
])

In this case, we have two nodes in the first layer, ostensibly to predict both the mean and standard deviation of the Normal distribution, instead of just the mean as in the last example. The mean of the distribution is assigned to the output of the first node (x[:, 0]) and the standard deviation / scale is set to be equal to a softplus function based on the output of the second node (x[:, 1]). After training this model on the same data and using the same loss as the previous example, we can predict both the mean and standard deviation of the model like so:

mean = yhat.mean()
upper = mean + 2 * yhat.stddev()
lower = mean - 2 * yhat.stddev()

In this case, the upper and lower variables are the 2-standard deviation upper and lower bounds of the predicted distributions. Plotting this produces:

plt.close("all")
plt.scatter(x_range, lin_reg)
plt.plot(x_range, mean, label='predicted')
plt.fill_between(x_range, lower, upper, alpha=0.1)
plt.plot(x_range, x_range * grad + intercept, label='ground truth')
plt.legend(loc="upper left")
plt.show()
Regression prediction with increasing variance

Regression prediction with increasing variance

As can be observed, the model is successfully predicting the increasing variance of the dataset, along with the mean of the trend. This is a limited example of the power of TensorFlow Probability, but in future posts I plan to show how to develop more complicated applications like Bayesian Neural Networks. I hope this post has been useful for you in getting up to speed in topics such as conditional probability, Bayes Theorem, the prior, posterior and likelihood function, maximum likelihood estimation and a quick introduction to TensorFlow Probability. Look out for future posts expanding on the increasingly important probabilistic side of machine learning.


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

The post Bayes Theorem, maximum likelihood estimation and TensorFlow Probability appeared first on Adventures in Machine Learning.

Categories
Offsites

Python TensorFlow Tutorial – Build a Neural Network

Updated for TensorFlow 2

Google’s TensorFlow has been a hot topic in deep learning recently.  The open source software, designed to allow efficient computation of data flow graphs, is especially suited to deep learning tasks.  It is designed to be executed on single or multiple CPUs and GPUs, making it a good option for complex deep learning tasks.  In its most recent incarnation – version 1.0 – it can even be run on certain mobile operating systems.  This introductory tutorial to TensorFlow will give an overview of some of the basic concepts of TensorFlow in Python.  These will be a good stepping stone to building more complex deep learning networks, such as Convolution Neural Networks, natural language models, and Recurrent Neural Networks in the package.  We’ll be creating a simple three-layer neural network to classify the MNIST dataset.  This tutorial assumes that you are familiar with the basics of neural networks, which you can get up to scratch with in the neural networks tutorial if required.  To install TensorFlow, follow the instructions here. The code for this tutorial can be found in this site’s GitHub repository.  Once you’re done, you also might want to check out a higher level deep learning library that sits on top of TensorFlow called Keras – see my Keras tutorial.

First, let’s have a look at the main ideas of TensorFlow.

1.0 TensorFlow graphs

TensorFlow is based on graph based computation – “what on earth is that?”, you might say.  It’s an alternative way of conceptualising mathematical calculations.  Consider the following expression $a = (b + c) * (c + 2)$.  We can break this function down into the following components:

begin{align}
d &= b + c \
e &= c + 2 \
a &= d * e
end{align}

Now we can represent these operations graphically as:

TensorFlow tutorial - simple computational graph

Simple computational graph

This may seem like a silly example – but notice a powerful idea in expressing the equation this way: two of the computations ($d=b+c$ and $e=c+2$) can be performed in parallel.  By splitting up these calculations across CPUs or GPUs, this can give us significant gains in computational times.  These gains are a must for big data applications and deep learning – especially for complicated neural network architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).  The idea behind TensorFlow is to the ability to create these computational graphs in code and allow significant performance improvements via parallel operations and other efficiency gains.

We can look at a similar graph in TensorFlow below, which shows the computational graph of a three-layer neural network.

TensorFlow tutorial - data flow graph

TensorFlow data flow graph

The animated data flows between different nodes in the graph are tensors which are multi-dimensional data arrays.  For instance, the input data tensor may be 5000 x 64 x 1, which represents a 64 node input layer with 5000 training samples.  After the input layer, there is a hidden layer with rectified linear units as the activation function.  There is a final output layer (called a “logit layer” in the above graph) that uses cross-entropy as a cost/loss function.  At each point we see the relevant tensors flowing to the “Gradients” block which finally flows to the Stochastic Gradient Descent optimizer which performs the back-propagation and gradient descent.

Here we can see how computational graphs can be used to represent the calculations in neural networks, and this, of course, is what TensorFlow excels at.  Let’s see how to perform some basic mathematical operations in TensorFlow to get a feel for how it all works.

2.0 A Simple TensorFlow example

So how can we make TensorFlow perform the little example calculation shown above – $a = (b + c) * (c + 2)$? First, there is a need to introduce TensorFlow variables.  The code below shows how to declare these objects:

import tensorflow as tf
# create TensorFlow variables
const = tf.Variable(2.0, name="const")
b = tf.Variable(2.0, name='b')
c = tf.Variable(1.0, name='c')

As can be observed above, TensorFlow variables can be declared using the tf.Variable function.  The first argument is the value to be assigned to the variable. The second is an optional name string which can be used to label the constant/variable – this is handy for when you want to do visualizations.  TensorFlow will infer the type of the variable from the initialized value, but it can also be set explicitly using the optional dtype argument.  TensorFlow has many of its own types like tf.float32, tf.int32 etc.

The objects assigned to the Python variables are actually TensorFlow tensors. Thereafter, they act like normal Python objects – therefore, if you want to access the tensors you need to keep track of the Python variables. In previous versions of TensorFlow, there were global methods of accessing the tensors and operations based on their names. This is no longer the case.

To examine the tensors stored in the Python variables, simply call them as you would a normal Python variable. If we do this for the “const” variable, you will see the following output:

<tf.Variable ‘const:0′ shape=() dtype=float32, numpy=2.0>

This output gives you a few different pieces of information – first, is the name ‘const:0’ which has been assigned to the tensor. Next is the data type, in this case, a TensorFlow float 32 type. Finally, there is a “numpy” value. TensorFlow variables in TensorFlow 2 can be converted easily into numpy objects. Numpy stands for Numerical Python and is a crucial library for Python data science and machine learning. If you don’t know Numpy, what it is, and how to use it, check out this site. The command to access the numpy form of the tensor is simply .numpy() – the use of this method will be shown shortly.

Next, some calculation operations are created:

# now create some operations
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')

Note that d and e are automatically converted to tensor values upon the execution of the operations. TensorFlow has a wealth of calculation operations available to perform all sorts of interactions between tensors, as you will discover as you progress through this book.  The purpose of the operations shown above are pretty obvious, and they instantiate the operations b + c, c + 2.0, and d * e. However, these operations are an unwieldy way of doing things in TensorFlow 2. The operations below are equivalent to those above:

d = b + c
e = c + 2
a = d * e

To access the value of variable a, one can use the .numpy() method as shown below:

print(f”Variable a is {a.numpy()}”)

The computational graph for this simple example can be visualized by using the TensorBoard functionality that comes packaged with TensorFlow. This is a great visualization feature and is explained more in this post. Here is what the graph looks like in TensorBoard:

TensorFlow tutorial - simple graph

Simple TensorFlow graph

The larger two vertices or nodes, b and c, correspond to the variables. The smaller nodes correspond to the operations, and the edges between the vertices are the scalar values emerging from the variables and operations.

The example above is a trivial example – what would this look like if there was an array of b values from which an array of equivalent a values would be calculated? TensorFlow variables can easily be instantiated using numpy variables, like the following:

b = tf.Variable(np.arange(0, 10), name='b')

Calling b shows the following:

<tf.Variable ‘b:0′ shape=(10,) dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

Note the numpy value of the tensor is an array. Because the numpy variable passed during the instantiation is a range of int32 values, we can’t add it directly to c as c is of float32 type. Therefore, the tf.cast operation, which changes the type of a tensor, first needs to be utilized like so:

d = tf.cast(b, tf.float32) + c

Running the rest of the previous operations, using the new b tensor, gives the following value for a:

Variable a is [ 3.  6.  9. 12. 15. 18. 21. 24. 27. 30.]

In numpy, the developer can directly access slices or individual indices of an array and change their values directly. Can the same be done in TensorFlow 2? Can individual indices and/or slices be accessed and changed? The answer is yes, but not quite as straight-forwardly as in numpy. For instance, if b was a simple numpy array, one could easily execute the following b[1] = 10 – this would change the value of the second element in the array to the integer 10.

b[1].assign(10)

This will then flow through to a like so:

Variable a is [ 3. 33.  9. 12. 15. 18. 21. 24. 27. 30.]

The developer could also run the following, to assign a slice of b values:

b[6:9].assign([10, 10, 10])

A new tensor can also be created by using the slice notation:

f = b[2:5]

The explanations and code above show you how to perform some basic tensor manipulations and operations. In the section below, an example will be presented where a neural network is created using the Eager paradigm in TensorFlow 2. It will show how to create a training loop, perform a feed-forward pass through a neural network and calculate and apply gradients to an optimization method.

3.0 A Neural Network Example

In this section, a simple three-layer neural network build in TensorFlow is demonstrated.  In following chapters more complicated neural network structures such as convolution neural networks and recurrent neural networks are covered.  For this example, though, it will be kept simple.

In this example, the MNIST dataset will be used that is packaged as part of the TensorFlow installation. This MNIST dataset is a set of 28×28 pixel grayscale images which represent hand-written digits.  It has 60,000 training rows, 10,000 testing rows, and 5,000 validation rows. It is a very common, basic, image classification dataset that is used in machine learning.

The data can be loaded by running the following:

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

As can be observed, the Keras MNIST data loader returns Python tuples corresponding to the training and test set respectively (Keras is another deep learning framework, now tightly integrated with TensorFlow, as mentioned earlier). The data sizes of the tuples defined above are:

  • x_train: (60,000 x 28 x 28)
  • y_train: (60,000)
  • x_test: (10,000 x 28 x 28)
  • y_test: (10,000)

The x data is the image information – 60,000 images of 28 x 28 pixels size in the training set. The images are grayscale (i.e black and white) with maximum values, specifying the intensity of whites, of 255. The x data will need to be scaled so that it resides between 0 and 1, as this improves training efficiency. The y data is the matching image labels – signifying what digit is displayed in the image. This will need to be transformed to “one-hot” format.

When using a standard, categorical cross-entropy loss function (this will be shown later), a one-hot format is required when training classification tasks, as the output layer of the neural network will have the same number of nodes as the total number of possible classification labels. The output node with the highest value is considered as a prediction for that corresponding label. For instance, in the MNIST task, there are 10 possible classification labels – 0 to 9. Therefore, there will be 10 output nodes in any neural network performing this classification task. If we have an example output vector of [0.01, 0.8, 0.25, 0.05, 0.10, 0.27, 0.55, 0.32, 0.11, 0.09], the maximum value is in the second position / output node, and therefore this corresponds to the digit “1”. To train the network to produce this sort of outcome when the digit “1” appears, the loss needs to be calculated according to the difference between the output of the network and a “one-hot” array of the label 1. This one-hot array looks like [0, 1, 0, 0, 0, 0, 0, 0, 0, 0].

This conversion is easily performed in TensorFlow, as will be demonstrated shortly when the main training loop is covered.

One final thing that needs to be considered is how to extract the training data in batches of samples. The function below can handle this:

def get_batch(x_data, y_data, batch_size):
    idxs = np.random.randint(0, len(y_data), batch_size)
    return x_data[idxs,:,:], y_data[idxs]

As can be observed in the code above, the data to be batched i.e. the x and y data is passed to this function along with the batch size. The first line of the function generates a random vector of integers, with random values between 0 and the length of the data passed to the function. The number of random integers generated is equal to the batch size. The x and y data are then returned, but the return data is only for those random indices chosen. Note, that this is performed on numpy array objects – as will be shown shortly, the conversion from numpy arrays to tensor objects will be performed “on the fly” within the training loop.

There is also the requirement for a loss function and a feed-forward function, but these will be covered shortly.

# Python optimisation variables
epochs = 10
batch_size = 100

# normalize the input images by dividing by 255.0
x_train = x_train / 255.0
x_test = x_test / 255.0
# convert x_test to tensor to pass through model (train data will be converted to
# tensors on the fly)
x_test = tf.Variable(x_test)

First, the number of training epochs and the batch size are created – note these are simple Python variables, not TensorFlow variables. Next, the input training and test data, x_train and x_test, are scaled so that their values are between 0 and 1. Input data should always be scaled when training neural networks, as large, uncontrolled, inputs can heavily impact the training process. Finally, the test input data, x_test is converted into a tensor. The random batching process for the training data is most easily performed using numpy objects and functions. However, the test data will not be batched in this example, so the full test input data set x_test is converted into a tensor.

The next step is to setup the weight and bias variables for the three-layer neural network.  There are always L1 number of weights/bias tensors, where L is the number of layers.  These variables are defined in the code below:

# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random.normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random.normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random.normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random.normal([10]), name='b2')

The weight and bias variables are initialized using the tf.random.normal function – this function creates tensors of random numbers, drawn from a normal distribution. It allows the developer to specify things like the standard deviation of the distribution from which the random numbers are drawn.

Note the shape of the variables. The W1 variable is a [784, 300] tensor – the 784 nodes are the size of the input layer. This size comes from the flattening of the input images – if we have 28 rows and 28 columns of pixels, flattening these out gives us 1 row or column of 28 x 28 = 784 values.  The 300 in the declaration of W1 is the number of nodes in the hidden layer. The W2 variable is a [300, 10] tensor, connecting the 300-node hidden layer to the 10-node output layer. In each case, a name is given to the variable for later viewing in TensorBoard – the TensorFlow visualization package. The next step in the code is to create the computations that occur within the nodes of the network. If the reader recalls, the computations within the nodes of a neural network are of the following form:

$$z = Wx + b$$

$$h=f(z)$$

Where W is the weights matrix, x is the layer input vector, b is the bias and f is the activation function of the node. These calculations comprise the feed-forward pass of the input data through the neural network. To execute these calculations, a dedicated feed-forward function is created:

def nn_model(x_input, W1, b1, W2, b2):
    # flatten the input image from 28 x 28 to 784
    x_input = tf.reshape(x_input, (x_input.shape[0], -1))
    x = tf.add(tf.matmul(tf.cast(x_input, tf.float32), W1), b1)
    x = tf.nn.relu(x)
    logits = tf.add(tf.matmul(x, W2), b2)
    return logits

Examining the first line, the x_input data is reshaped from (batch_size, 28, 28) to (batch_size, 784) – in other words, the images are flattened out. On the next line, the input data is then converted to tf.float32 type using the TensorFlow cast function. This is important – the x­_input data comes in as tf.float64 type, and TensorFlow won’t perform a matrix multiplication operation (tf.matmul) between tensors of different data types. This re-typed input data is then matrix-multiplied by W1 using the TensorFlow matmul function (which stands for matrix multiplication). Then the bias b1 is added to this product. On the line after this, the ReLU activation function is applied to the output of this line of calculation. The ReLU function is usually the best activation function to use in deep learning – the reasons for this are discussed in this post.

The output of this calculation is then multiplied by the final set of weights W2, with the bias b2 added. The output of this calculation is titled logits. Note that no activation function has been applied to this output layer of nodes (yet). In machine/deep learning, the term “logits” refers to the un-activated output of a layer of nodes.

The reason no activation function has been applied to this layer is that there is a handy function in TensorFlow called tf.nn.softmax_cross_entropy_with_logits. This function does two things for the developer – it applies a softmax activation function to the logits, which transforms them into a quasi-probability (i.e. the sum of the output nodes is equal to 1). This is a common activation function to apply to an output layer in classification tasks. Next, it applies the cross-entropy loss function to the softmax activation output. The cross-entropy loss function is a commonly used loss in classification tasks. The theory behind it is quite interesting, but it won’t be covered in this book – a good summary can be found here. The code below applies this handy TensorFlow function, and in this example,  it has been nested in another function called loss_fn:

def loss_fn(logits, labels):
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels,
                                                                              logits=logits))
    return cross_entropy

The arguments to softmax_cross_entropy_with_logits are labels and logits. The logits argument is supplied from the outcome of the nn_model function. The usage of this function in the main training loop will be demonstrated shortly. The labels argument is supplied from the one-hot y values that are fed into loss_fn during the training process. The output of the softmax_cross_entropy_with_logits function will be the output of the cross-entropy loss value for each sample in the batch. To train the weights of the neural network, the average cross-entropy loss across the samples needs to be minimized as part of the optimization process. This is calculated by using the tf.reduce_mean function, which, unsurprisingly, calculates the mean of the tensor supplied to it.

The next step is to define an optimizer function. In many examples within this book, the versatile Adam optimizer will be used. The theory behind this optimizer is interesting, and is worth further examination (such as shown here) but won’t be covered in detail within this post. It is basically a gradient descent method, but with sophisticated averaging of the gradients to provide appropriate momentum to the learning. To define the optimizer, which will be used in the main training loop, the following code is run:

# setup the optimizer
optimizer = tf.keras.optimizers.Adam()

The Adam object can take a learning rate as input, but for the present purposes, the default value is used.

3.1 Training the network

Now that the appropriate functions, variables and optimizers have been created, it is time to define the overall training loop. The training loop is shown below:

total_batch = int(len(y_train) / batch_size)
for epoch in range(epochs):
    avg_loss = 0
    for i in range(total_batch):
        batch_x, batch_y = get_batch(x_train, y_train, batch_size=batch_size)
        # create tensors
        batch_x = tf.Variable(batch_x)
        batch_y = tf.Variable(batch_y)
        # create a one hot vector
        batch_y = tf.one_hot(batch_y, 10)
        with tf.GradientTape() as tape:
            logits = nn_model(batch_x, W1, b1, W2, b2)
            loss = loss_fn(logits, batch_y)
        gradients = tape.gradient(loss, [W1, b1, W2, b2])
        optimizer.apply_gradients(zip(gradients, [W1, b1, W2, b2]))
        avg_loss += loss / total_batch
    test_logits = nn_model(x_test, W1, b1, W2, b2)
    max_idxs = tf.argmax(test_logits, axis=1)
    test_acc = np.sum(max_idxs.numpy() == y_test) / len(y_test)
    print(f"Epoch: {epoch + 1}, loss={avg_loss:.3f}, test set      accuracy={test_acc*100:.3f}%")

print("nTraining complete!")

Stepping through the lines above, the first line is a calculation to determine the number of batches to run through in each training epoch – this will ensure that, on average, each training sample will be used once in the epoch.  After that, a loop for each training epoch is entered. An avg_cost variable is initialized to keep track of the average cross entropy cost/loss for each epoch. The next line is where randomised batches of samples are extracted (batch_x and batch_y) from the MNIST training dataset, using the get_batch() function that was created earlier.

Next, the batch_x and batch_y numpy variables are converted to tensor variables. After this, the label data stored in batch_y as simple integers (i.e. 2 for handwritten digit “2” and so on) needs to be converted to “one hot” format, as discussed previously. To do this, the tf.one_hot function can be utilized – the first argument to this function is the tensor you wish to convert, and the second argument is the number of distinct classes. This transforms the batch_y tensor from size (batch_size, 1) to (batch_size, 10).

The next line is important. Here the TensorFlow GradientTape API is introduced. In previous versions of TensorFlow a static graph of all the operations and variables was constructed. In this paradigm, the gradients that were required to be calculated could be determined by reading from the graph structure. However, in Eager mode, all tensor calculations are performed on the fly, and TensorFlow doesn’t know which variables and operations you are interested in calculating gradients for. The Gradient Tape API is the solution for this. Whatever variables and operations you wish to calculate gradients over you supply to the “with GradientTape() as tape:” context manager. In a neural network, this involves all the variables and operations involved in the feed-forward pass through your network, along with the evaluation of the loss function. Note that if you call a function within the gradient tape context, all the operations performed within that function (and any further nested functions), will be captured for gradient calculation as required.

As can be observed in the code above, the feed forward pass and the loss function evaluation are encapsulated in the functions which were explained earlier: nn_model and loss_fn. By executing these functions within the gradient tape context manager, TensorFlow knows to keep track of all the variables and operation outcomes to ensure they are ready for gradient computations. Following the function calls nn_model and loss_fn within the gradient tape context, we have the place where the gradients of the neural network are calculated.

Here, the gradient tape is accessed via its name (tape in this example) and the gradient function is called tape.gradient(). The first argument to this function is the dependent variable of the differentiation, and the second argument is the independent variable/s. In other words, if we were trying to calculate the derivative dy/dx, the first argument would be y and the second would be x for this function.  In the context of a neural network, we are trying to calculate dL/dw and dL/db where L is the loss, w represents the weights and b the weights of the bias connections. Therefore, in the code above, the reader can observe that the first argument is the loss output from loss_fn and the second argument is a list of all the weight and bias variables through-out the simple neural network.

The next line is where these gradients are zipped together with the weight and bias variables and passed to the optimizer to perform the gradient descent step. This is executed easily using the optimizer’s apply_gradients() function.

The line following this is the accumulation of the average loss within the epoch. This constitutes the inner-epoch training loop. In the outer epoch training loop, after each epoch of training, the accuracy of the model on the test set is evaluated.

To determine the accuracy, first the test set images are passed through the neural network model using nn_model. This returns the logits from the model (the un-activated outputs from the last layer). The “prediction” of the model is then calculated from these logits – whatever output node has the highest logits value, this constitutes the digit prediction of the model. To determine what the highest logit value is for each test image, we can use the tf.argmax() function. This function mimics the numpy argmax() function, which returns the index of the highest value in an array/tensor. The logits output from the model in this case will be of the following dimensions: (test_set_size, 10) – we want the argmax function to find the maximum in each of the “column” dimensions i.e. across the 10 output nodes. The “row” dimension corresponds to axis=0, and the column dimension corresponds to axis=1. Therefore, supplying the axis=1 argument to tf.argmax() function creates (test_set_size, 1) integer predictions.

In the following line, these max_idxs are converted to a numpy array (using .numpy()) and asserted to be equal to the test labels (also integers – you will recall that we did not convert the test labels to a one-hot format). Where the labels are equal, this will return a “true” value, which is equivalent to an integer of 1 in numpy, or alternatively a “false” / 0 value. By summing up the results of these assertions, we obtain the number of correct predictions. Dividing this by the total size of the test set, the test set accuracy is obtained.

Note: if some of these explanations aren’t immediately clear, it is a good idea to jump over to the code supplied for this chapter and running it within a standard Python development environment. Insert a breakpoint in the code that you want to examine more closely – you can then inspect all the tensor sizes, convert them to numpy arrays, apply operations on the fly and so on. This is all possible within TensorFlow 2 now that the default operating paradigm is Eager execution.

The epoch number, average loss and accuracy are then printed, so one can observe the progress of the training. The average loss should be decreasing on average after every epoch – if it is not, something is going wrong with the network, or the learning has stagnated. Therefore, it is an important variable to monitor. On running this code, something like the following output should be observed:

Epoch: 1, cost=0.317, test set accuracy=94.350%

Epoch: 2, cost=0.124, test set accuracy=95.940%

Epoch: 3, cost=0.085, test set accuracy=97.070%

Epoch: 4, cost=0.065, test set accuracy=97.570%

Epoch: 5, cost=0.052, test set accuracy=97.630%

Epoch: 6, cost=0.048, test set accuracy=97.620%

Epoch: 7, cost=0.037, test set accuracy=97.770%

Epoch: 8, cost=0.032, test set accuracy=97.630%

Epoch: 9, cost=0.027, test set accuracy=97.950%

Epoch: 10, cost=0.022, test set accuracy=98.000%

Training complete!

As can be observed, the loss declines monotonically, and the test set accuracy steadily increases. This shows that the model is training correctly. It is also possible to visualize the training progress using TensorBoard, as shown below:

TensorFlow tutorial - TensorBoard accuracy plot

TensorBoard plot of the increase in accuracy over 10 epochs

I hope this tutorial was instructive and helps get you going on the TensorFlow journey.  Just a reminder, you can check out the code for this post here.  I’ve also written an article that shows you how to build more complex neural networks such as convolution neural networks, recurrent neural networks, and Word2Vec natural language models in TensorFlow.  You also might want to check out a higher level deep learning library that sits on top of TensorFlow called Keras – see my Keras tutorial.

Have fun!

The post Python TensorFlow Tutorial – Build a Neural Network appeared first on Adventures in Machine Learning.

Categories
Misc

Brain image segmentation with torch

When what is not enough

True, sometimes it’s vital to distinguish between different
kinds of objects. Is that a car speeding towards me, in which case
I’d better jump out of the way? Or is it a huge Doberman (in
which case I’d probably do the same)? Often in real life though,
instead of coarse-grained classification, what is needed is
fine-grained segmentation.

Zooming in on images, we’re not looking for a single label;
instead, we want to classify every pixel according to some
criterion:

  • In medicine, we may want to distinguish between different cell
    types, or identify tumors.

  • In various earth sciences, satellite data are used to segment
    terrestrial surfaces.

  • To enable use of custom backgrounds, video-conferencing software
    has to be able to tell foreground from background.

Image segmentation is a form of supervised learning: Some kind
of ground truth is needed. Here, it comes in form of a mask – an
image, of spatial resolution identical to that of the input data,
that designates the true class for every pixel. Accordingly,
classification loss is calculated pixel-wise; losses are then
summed up to yield an aggregate to be used in optimization.

The “canonical” architecture for image segmentation is U-Net
(around since 2015).

U-Net

Here is the prototypical U-Net, as depicted in the original
Rönneberger et al.paper (Ronneberger, Fischer, and Brox
2015).

Of this architecture, numerous variants exist. You could use
different layer sizes, activations, ways to achieve downsizing and
upsizing, and more. However, there is one defining characteristic:
the U-shape, stabilized by the “bridges” crossing over
horizontally at all levels.

In a nutshell, the left-hand side of the U resembles the
convolutional architectures used in image classification. It
successively reduces spatial resolution. At the same time, another
dimension – the channels dimension – is used to build up a
hierarchy of features, ranging from very basic to very
specialized.

Unlike in classification, however, the output should have the
same spatial resolution as the input. Thus, we need to upsize again
– this is taken care of by the right-hand side of the U. But, how
are we going to arrive at a good per-pixel classification, now that
so much spatial information has been lost?

This is what the “bridges” are for: At each level, the input
to an upsampling layer is a concatenation of the previous layer’s
output – which went through the whole compression/decompression
routine – and some preserved intermediate representation from the
downsizing phase. In this way, a U-Net architecture combines
attention to detail with feature extraction.

Brain image segmentation

With U-Net, domain applicability is as broad as the architecture
is flexible. Here, we want to detect abnormalities in brain scans.
The dataset, used in Buda, Saha, and Mazurowski (2019), contains
MRI images together with manually created
FLAIR
abnormality segmentation masks. It is available on
Kaggle.

Nicely, the paper is accompanied by a GitHub
repository
. Below, we closely follow (though not exactly
replicate) the authors’ preprocessing and data augmentation
code.

As is often the case in medical imaging, there is notable class
imbalance in the data. For every patient, sections have been taken
at multiple positions. (Number of sections per patient varies.)
Most sections do not exhibit any lesions; the corresponding masks
are colored black everywhere.

Here are three examples where the masks do indicate
abnormalities:

Let’s see if we can build a U-Net that generates such masks
for us.

Data

Before you start typing, here is a
Colaboratory notebook
to conveniently follow along.

We use pins to obtain the data. Please see this
introduction
if you haven’t used that package before.

# deep learning (incl. dependencies) library(torch) library(torchvision) # data wrangling library(tidyverse) library(zeallot) # image processing and visualization library(magick) library(cowplot) # dataset loading library(pins) library(zip) torch_manual_seed(777) set.seed(777) # use your own kaggle.json here pins::board_register_kaggle(token = "~/kaggle.json") files <- pins::pin_get("mateuszbuda/lgg-mri-segmentation", board = "kaggle", extract = FALSE)

The dataset is not that big – it includes scans from 110
different patients – so we’ll have to do with just a training
and a validation set. (Don’t do this in real life, as you’ll
inevitably end up fine-tuning on the latter.)

train_dir <- "data/mri_train" valid_dir <- "data/mri_valid" if(dir.exists(train_dir)) unlink(train_dir, recursive = TRUE, force = TRUE) if(dir.exists(valid_dir)) unlink(valid_dir, recursive = TRUE, force = TRUE) zip::unzip(files, exdir = "data") file.rename("data/kaggle_3m", train_dir) # this is a duplicate, again containing kaggle_3m (evidently a packaging error on Kaggle) # we just remove it unlink("data/lgg-mri-segmentation", recursive = TRUE) dir.create(valid_dir)

Of those 110 patients, we keep 30 for validation. Some more file
manipulations, and we’re set up with a nice hierarchical
structure, with train_dir and valid_dir holding their per-patient
sub-directories, respectively.

valid_indices <- sample(1:length(patients), 30) patients <- list.dirs(train_dir, recursive = FALSE) for (i in valid_indices) { dir.create(file.path(valid_dir, basename(patients[i]))) for (f in list.files(patients[i])) { file.rename(file.path(train_dir, basename(patients[i]), f), file.path(valid_dir, basename(patients[i]), f)) } unlink(file.path(train_dir, basename(patients[i])), recursive = TRUE) }

We now need a dataset that knows what to do with these
files.

Dataset

Like every torch dataset, this one has initialize() and
.getitem() methods. initialize() creates an inventory of scan and
mask file names, to be used by .getitem() when it actually reads
those files. In contrast to what we’ve seen in previous posts,
though , .getitem() does not simply return input-target pairs in
order. Instead, whenever the parameter random_sampling is true, it
will perform weighted sampling, preferring items with sizable
lesions. This option will be used for the training set, to counter
the class imbalance mentioned above.

The other way training and validation sets will differ is use of
data augmentation. Training images/masks may be flipped, re-sized,
and rotated; probabilities and amounts are configurable.

An instance of brainseg_dataset encapsulates all this
functionality:

brainseg_dataset <- dataset( name = "brainseg_dataset", initialize = function(img_dir, augmentation_params = NULL, random_sampling = FALSE) { self$images <- tibble( img = grep( list.files( img_dir, full.names = TRUE, pattern = "tif", recursive = TRUE ), pattern = 'mask', invert = TRUE, value = TRUE ), mask = grep( list.files( img_dir, full.names = TRUE, pattern = "tif", recursive = TRUE ), pattern = 'mask', value = TRUE ) ) self$slice_weights <- self$calc_slice_weights(self$images$mask) self$augmentation_params <- augmentation_params self$random_sampling <- random_sampling }, .getitem = function(i) { index <- if (self$random_sampling == TRUE) sample(1:self$.length(), 1, prob = self$slice_weights) else i img <- self$images$img[index] %>% image_read() %>% transform_to_tensor() mask <- self$images$mask[index] %>% image_read() %>% transform_to_tensor() %>% transform_rgb_to_grayscale() %>% torch_unsqueeze(1) img <- self$min_max_scale(img) if (!is.null(self$augmentation_params)) { scale_param <- self$augmentation_params[1] c(img, mask) %<-% self$resize(img, mask, scale_param) rot_param <- self$augmentation_params[2] c(img, mask) %<-% self$rotate(img, mask, rot_param) flip_param <- self$augmentation_params[3] c(img, mask) %<-% self$flip(img, mask, flip_param) } list(img = img, mask = mask) }, .length = function() { nrow(self$images) }, calc_slice_weights = function(masks) { weights <- map_dbl(masks, function(m) { img <- as.integer(magick::image_data(image_read(m), channels = "gray")) sum(img / 255) }) sum_weights <- sum(weights) num_weights <- length(weights) weights <- weights %>% map_dbl(function(w) { w <- (w + sum_weights * 0.1 / num_weights) / (sum_weights * 1.1) }) weights }, min_max_scale = function(x) { min = x$min()$item() max = x$max()$item() x$clamp_(min = min, max = max) x$add_(-min)$div_(max - min + 1e-5) x }, resize = function(img, mask, scale_param) { img_size <- dim(img)[2] rnd_scale <- runif(1, 1 - scale_param, 1 + scale_param) img <- transform_resize(img, size = rnd_scale * img_size) mask <- transform_resize(mask, size = rnd_scale * img_size) diff <- dim(img)[2] - img_size if (diff > 0) { top <- ceiling(diff / 2) left <- ceiling(diff / 2) img <- transform_crop(img, top, left, img_size, img_size) mask <- transform_crop(mask, top, left, img_size, img_size) } else { img <- transform_pad(img, padding = -c( ceiling(diff / 2), floor(diff / 2), ceiling(diff / 2), floor(diff / 2) )) mask <- transform_pad(mask, padding = -c( ceiling(diff / 2), floor(diff / 2), ceiling(diff / 2), floor(diff / 2) )) } list(img, mask) }, rotate = function(img, mask, rot_param) { rnd_rot <- runif(1, 1 - rot_param, 1 + rot_param) img <- transform_rotate(img, angle = rnd_rot) mask <- transform_rotate(mask, angle = rnd_rot) list(img, mask) }, flip = function(img, mask, flip_param) { rnd_flip <- runif(1) if (rnd_flip > flip_param) { img <- transform_hflip(img) mask <- transform_hflip(mask) } list(img, mask) } )

After instantiation, we see we have 2977 training pairs and 952
validation pairs, respectively:

train_ds <- brainseg_dataset( train_dir, augmentation_params = c(0.05, 15, 0.5), random_sampling = TRUE ) length(train_ds) # 2977 valid_ds <- brainseg_dataset( valid_dir, augmentation_params = NULL, random_sampling = FALSE ) length(valid_ds) # 952

As a correctness check, let’s plot an image and associated
mask:

par(mfrow = c(1, 2), mar = c(0, 1, 0, 1)) img_and_mask <- valid_ds[27] img <- img_and_mask[[1]] mask <- img_and_mask[[2]] img$permute(c(2, 3, 1)) %>% as.array() %>% as.raster() %>% plot() mask$squeeze() %>% as.array() %>% as.raster() %>% plot()

With torch, it is straightforward to inspect what happens when
you change augmentation-related parameters. We just pick a pair
from the validation set, which has not had any augmentation applied
as yet, and call valid_ds$<augmentation_func()> directly.
Just for fun, let’s use more “extreme” parameters here than
we do in actual training. (Actual training uses the settings from
Mateusz’ GitHub repository, which we assume have been carefully
chosen for optimal performance.1)

img_and_mask <- valid_ds[77] img <- img_and_mask[[1]] mask <- img_and_mask[[2]] imgs <- map (1:24, function(i) { # scale factor; train_ds really uses 0.05 c(img, mask) %<-% valid_ds$resize(img, mask, 0.2) c(img, mask) %<-% valid_ds$flip(img, mask, 0.5) # rotation angle; train_ds really uses 15 c(img, mask) %<-% valid_ds$rotate(img, mask, 90) img %>% transform_rgb_to_grayscale() %>% as.array() %>% as_tibble() %>% rowid_to_column(var = "Y") %>% gather(key = "X", value = "value", -Y) %>% mutate(X = as.numeric(gsub("V", "", X))) %>% ggplot(aes(X, Y, fill = value)) + geom_raster() + theme_void() + theme(legend.position = "none") + theme(aspect.ratio = 1) }) plot_grid(plotlist = imgs, nrow = 4)

Now we still need the data loaders, and then, nothing keeps us
from proceeding to the next big task: building the model.

batch_size <- 4 train_dl <- dataloader(train_ds, batch_size) valid_dl <- dataloader(valid_ds, batch_size)

Model

Our model nicely illustrates the kind of modular code that comes
“naturally” with torch. We approach things top-down, starting
with the U-Net container itself.

unet takes care of the global composition – how far “down”
do we go, shrinking the image while incrementing the number of
filters, and then how do we go “up” again?

Importantly, it is also in the system’s memory. In forward(),
it keeps track of layer outputs seen going “down”, to be added
back in going “up”.

unet <- nn_module( "unet", initialize = function(channels_in = 3, n_classes = 1, depth = 5, n_filters = 6) { self$down_path <- nn_module_list() prev_channels <- channels_in for (i in 1:depth) { self$down_path$append(down_block(prev_channels, 2 ^ (n_filters + i - 1))) prev_channels <- 2 ^ (n_filters + i -1) } self$up_path <- nn_module_list() for (i in ((depth - 1):1)) { self$up_path$append(up_block(prev_channels, 2 ^ (n_filters + i - 1))) prev_channels <- 2 ^ (n_filters + i - 1) } self$last = nn_conv2d(prev_channels, n_classes, kernel_size = 1) }, forward = function(x) { blocks <- list() for (i in 1:length(self$down_path)) { x <- self$down_path[[i]](x) if (i != length(self$down_path)) { blocks <- c(blocks, x) x <- nnf_max_pool2d(x, 2) } } for (i in 1:length(self$up_path)) { x <- self$up_path[[i]](x, blocks[[length(blocks) - i + 1]]$to(device = device)) } torch_sigmoid(self$last(x)) } )

unet delegates to two containers just below it in the hierarchy:
down_block and up_block. While down_block is “just” there for
aesthetic reasons (it immediately delegates to its own workhorse,
conv_block), in up_block we see the U-Net “bridges” in
action.

down_block <- nn_module( "down_block", initialize = function(in_size, out_size) { self$conv_block <- conv_block(in_size, out_size) }, forward = function(x) { self$conv_block(x) } ) up_block <- nn_module( "up_block", initialize = function(in_size, out_size) { self$up = nn_conv_transpose2d(in_size, out_size, kernel_size = 2, stride = 2) self$conv_block = conv_block(in_size, out_size) }, forward = function(x, bridge) { up <- self$up(x) torch_cat(list(up, bridge), 2) %>% self$conv_block() } )

Finally, a conv_block is a sequential structure containing
convolutional, ReLU, and dropout layers.

conv_block <- nn_module( "conv_block", initialize = function(in_size, out_size) { self$conv_block <- nn_sequential( nn_conv2d(in_size, out_size, kernel_size = 3, padding = 1), nn_relu(), nn_dropout(0.6), nn_conv2d(out_size, out_size, kernel_size = 3, padding = 1), nn_relu() ) }, forward = function(x){ self$conv_block(x) } )

Now instantiate the model, and possibly, move it to the GPU:

device <- torch_device(if(cuda_is_available()) "cuda" else "cpu") model <- unet(depth = 5)$to(device = device)

Optimization

We train our model with a combination of cross entropy and

dice loss
.

The latter, though not shipped with torch, may be implemented
manually:

calc_dice_loss <- function(y_pred, y_true) { smooth <- 1 y_pred <- y_pred$view(-1) y_true <- y_true$view(-1) intersection <- (y_pred * y_true)$sum() 1 - ((2 * intersection + smooth) / (y_pred$sum() + y_true$sum() + smooth)) } dice_weight <- 0.3

Optimization uses stochastic gradient descent (SGD), together
with the one-cycle learning rate scheduler introduced in the
context of
image classification with torch
.

optimizer <- optim_sgd(model$parameters, lr = 0.1, momentum = 0.9) num_epochs <- 20 scheduler <- lr_one_cycle( optimizer, max_lr = 0.1, steps_per_epoch = length(train_dl), epochs = num_epochs )

Training

The training loop then follows the usual scheme. One thing to
note: Every epoch, we save the model (using torch_save()), so we
can later pick the best one, should performance have degraded
thereafter.

train_batch <- function(b) { optimizer$zero_grad() output <- model(b[[1]]$to(device = device)) target <- b[[2]]$to(device = device) bce_loss <- nnf_binary_cross_entropy(output, target) dice_loss <- calc_dice_loss(output, target) loss <- dice_weight * dice_loss + (1 - dice_weight) * bce_loss loss$backward() optimizer$step() scheduler$step() list(bce_loss$item(), dice_loss$item(), loss$item()) } valid_batch <- function(b) { output <- model(b[[1]]$to(device = device)) target <- b[[2]]$to(device = device) bce_loss <- nnf_binary_cross_entropy(output, target) dice_loss <- calc_dice_loss(output, target) loss <- dice_weight * dice_loss + (1 - dice_weight) * bce_loss list(bce_loss$item(), dice_loss$item(), loss$item()) } for (epoch in 1:num_epochs) { model$train() train_bce <- c() train_dice <- c() train_loss <- c() for (b in enumerate(train_dl)) { c(bce_loss, dice_loss, loss) %<-% train_batch(b) train_bce <- c(train_bce, bce_loss) train_dice <- c(train_dice, dice_loss) train_loss <- c(train_loss, loss) } torch_save(model, paste0("model_", epoch, ".pt")) cat(sprintf("nEpoch %d, training: loss:%3f, bce: %3f, dice: %3fn", epoch, mean(train_loss), mean(train_bce), mean(train_dice))) model$eval() valid_bce <- c() valid_dice <- c() valid_loss <- c() i <- 0 for (b in enumerate(valid_dl)) { i <<- i + 1 c(bce_loss, dice_loss, loss) %<-% valid_batch(b) valid_bce <- c(valid_bce, bce_loss) valid_dice <- c(valid_dice, dice_loss) valid_loss <- c(valid_loss, loss) } cat(sprintf("nEpoch %d, validation: loss:%3f, bce: %3f, dice: %3fn", epoch, mean(valid_loss), mean(valid_bce), mean(valid_dice))) }
Epoch 1, training: loss:0.304232, bce: 0.148578, dice: 0.667423 Epoch 1, validation: loss:0.333961, bce: 0.127171, dice: 0.816471 Epoch 2, training: loss:0.194665, bce: 0.101973, dice: 0.410945 Epoch 2, validation: loss:0.341121, bce: 0.117465, dice: 0.862983 [...] Epoch 19, training: loss:0.073863, bce: 0.038559, dice: 0.156236 Epoch 19, validation: loss:0.302878, bce: 0.109721, dice: 0.753577 Epoch 20, training: loss:0.070621, bce: 0.036578, dice: 0.150055 Epoch 20, validation: loss:0.295852, bce: 0.101750, dice: 0.748757

Evaluation

In this run, it is the final model that performs best on the
validation set. Still, we’d like to show how to load a saved
model, using torch_load() .

Once loaded, put the model into eval mode:

saved_model <- torch_load("model_20.pt") model <- saved_model model$eval()

Now, since we don’t have a separate test set, we already know
the average out-of-sample metrics; but in the end, what we care
about are the generated masks. Let’s view some, displaying ground
truth and MRI scans for comparison.

# without random sampling, we'd mainly see lesion-free patches eval_ds <- brainseg_dataset(valid_dir, augmentation_params = NULL, random_sampling = TRUE) eval_dl <- dataloader(eval_ds, batch_size = 8) batch <- eval_dl %>% dataloader_make_iter() %>% dataloader_next() par(mfcol = c(3, 8), mar = c(0, 1, 0, 1)) for (i in 1:8) { img <- batch[[1]][i, .., drop = FALSE] inferred_mask <- model(img$to(device = device)) true_mask <- batch[[2]][i, .., drop = FALSE]$to(device = device) bce <- nnf_binary_cross_entropy(inferred_mask, true_mask)$to(device = "cpu") %>% as.numeric() dc <- calc_dice_loss(inferred_mask, true_mask)$to(device = "cpu") %>% as.numeric() cat(sprintf("nSample %d, bce: %3f, dice: %3fn", i, bce, dc)) inferred_mask <- inferred_mask$to(device = "cpu") %>% as.array() %>% .[1, 1, , ] inferred_mask <- ifelse(inferred_mask > 0.5, 1, 0) img[1, 1, ,] %>% as.array() %>% as.raster() %>% plot() true_mask$to(device = "cpu")[1, 1, ,] %>% as.array() %>% as.raster() %>% plot() inferred_mask %>% as.raster() %>% plot() }

We also print the individual cross entropy and dice losses;
relating those to the generated masks might yield useful
information for model tuning.

Sample 1, bce: 0.088406, dice: 0.387786} Sample 2, bce: 0.026839, dice: 0.205724 Sample 3, bce: 0.042575, dice: 0.187884 Sample 4, bce: 0.094989, dice: 0.273895 Sample 5, bce: 0.026839, dice: 0.205724 Sample 6, bce: 0.020917, dice: 0.139484 Sample 7, bce: 0.094989, dice: 0.273895 Sample 8, bce: 2.310956, dice: 0.999824

While far from perfect, most of these masks aren’t that bad
– a nice result given the small dataset!

Wrapup

This has been our most complex torch post so far; however, we
hope you’ve found the time well spent. For one, among
applications of deep learning, medical image segmentation stands
out as highly societally useful. Secondly, U-Net-like architectures
are employed in many other areas. And finally, we once more saw
torch’s flexibility and intuitive behavior in action.

Thanks for reading!

Buda, Mateusz, Ashirbani Saha, and Maciej A. Mazurowski. 2019.
“Association of Genomic Subtypes of Lower-Grade Gliomas with
Shape Features Automatically Extracted by a Deep Learning
Algorithm.” Computers in Biology and Medicine 109: 218–25.
https://doi.org/https://doi.org/10.1016/j.compbiomed.2019.05.002.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015.
“U-Net: Convolutional Networks for Biomedical Image
Segmentation.” CoRR abs/1505.04597. http://arxiv.org/abs/1505.04597.

  1. Yes, we did a few experiments, confirming that more augmentation
    isn’t better … what did I say about inevitably ending up doing
    optimization on the validation set …?↩︎

Categories
Misc

Python TensorFlow Tutorial – Build a Neural Network

Updated for TensorFlow 2

Google’s TensorFlow has been a hot topic in deep learning
recently. The open source software, designed to allow efficient
computation of data flow graphs, is especially suited to deep
learning tasks. It is designed to be executed on single or
multiple CPUs and GPUs, making it a good option for complex deep
learning tasks.  In its most recent incarnation – version 1.0 –
it can even be run on certain mobile operating systems.  This
introductory tutorial to TensorFlow will give an overview of some
of the basic concepts of TensorFlow in Python.  These will be a
good stepping stone to building more complex deep learning
networks, such as
Convolution Neural Networks
,
natural language models
, and
Recurrent Neural Networks
in the package.  We’ll be creating a
simple three-layer neural network to classify the MNIST dataset. 
This tutorial assumes that you are familiar with the basics of
neural networks, which you can get up to scratch with in the

neural networks tutorial
if required.  To install TensorFlow,
follow the instructions here. The code for this
tutorial can be found in this
site’s GitHub repository
.  Once you’re done, you also might
want to check out a higher level deep learning library that sits on
top of TensorFlow called Keras – see
my Keras tutorial
.

First, let’s have a look at the main ideas of TensorFlow.

1.0 TensorFlow graphs

TensorFlow is based on graph based computation – “what on
earth is that?”, you might say.  It’s an alternative way
of conceptualising mathematical calculations.  Consider the
following expression $a = (b + c) * (c + 2)$.  We can break this
function down into the following components:

begin{align}
d &= b + c \
e &= c + 2 \
a &= d * e
end{align}

Now we can represent these operations graphically as:

TensorFlow tutorial - simple computational graph

Simple computational graph

This may seem like a silly example – but notice a powerful
idea in expressing the equation this way: two of the computations
($d=b+c$ and $e=c+2$) can be performed in parallel.  By splitting
up these calculations across CPUs or GPUs, this can give us
significant gains in computational times.  These gains are a must
for big data applications and deep learning – especially for
complicated neural network architectures such as Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).  The
idea behind TensorFlow is to the ability to create these
computational graphs in code and allow significant performance
improvements via parallel operations and other efficiency
gains.

We can look at a similar graph in TensorFlow below, which shows
the computational graph of a three-layer neural network.

TensorFlow tutorial - data flow graph

TensorFlow data flow graph

The animated data flows between different nodes in the graph are
tensors which are multi-dimensional data arrays.  For instance, the
input data tensor may be 5000 x 64 x 1, which represents a 64 node
input layer with 5000 training samples.  After the input layer,
there is a hidden layer with rectified
linear units
as the activation function.  There is a final
output layer (called a “logit layer” in the above graph) that
uses cross-entropy as a cost/loss function.  At each point we see
the relevant tensors flowing to the “Gradients” block which
finally flows to the
Stochastic Gradient Descent
optimizer which performs the
back-propagation and gradient descent.

Here we can see how computational graphs can be used to
represent the calculations in neural networks, and this, of course,
is what TensorFlow excels at.  Let’s see how to perform some basic
mathematical operations in TensorFlow to get a feel for how it all
works.

2.0 A Simple TensorFlow example

So how can we make TensorFlow perform the little example
calculation shown above – $a = (b + c) * (c + 2)$? First, there
is a need to introduce TensorFlow variables.  The code below shows
how to declare these objects:

import tensorflow as tf
# create TensorFlow variables
const = tf.Variable(2.0, name="const")
b = tf.Variable(2.0, name='b')
c = tf.Variable(1.0, name='c')

As can be observed above, TensorFlow variables can be declared
using the tf.Variable function.  The first argument is the value to
be assigned to the variable. The second is an optional name string
which can be used to label the constant/variable – this is handy
for when you want to do visualizations.  TensorFlow will infer the
type of the variable from the initialized value, but it can also be
set explicitly using the optional dtype argument.  TensorFlow has
many of its own types like tf.float32, tf.int32 etc.

The objects assigned to the Python variables are actually
TensorFlow tensors. Thereafter, they act like normal Python objects
– therefore, if you want to access the tensors you need to keep
track of the Python variables. In previous versions of TensorFlow,
there were global methods of accessing the tensors and operations
based on their names. This is no longer the case.

To examine the tensors stored in the Python variables, simply
call them as you would a normal Python variable. If we do this for
the “const” variable, you will see the following output:

<tf.Variable ‘const:0′ shape=() dtype=float32,
numpy=2.0>

This output gives you a few different pieces of information –
first, is the name ‘const:0’ which has been assigned to the
tensor. Next is the data type, in this case, a TensorFlow float 32
type. Finally, there is a “numpy” value. TensorFlow variables
in TensorFlow 2 can be converted easily into numpy objects. Numpy
stands for Numerical Python and is a crucial library for Python
data science and machine learning. If you don’t know Numpy, what
it is, and how to use it, check out this site. The command to access the
numpy form of the tensor is simply .numpy() – the use of this
method will be shown shortly.

Next, some calculation operations are created:

# now create some operations
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')

Note that d and e are automatically converted to tensor values
upon the execution of the operations. TensorFlow has a wealth of
calculation operations available to perform all sorts of
interactions between tensors, as you will discover as you progress
through this book.  The purpose of the operations shown above are
pretty obvious, and they instantiate the operations b + c, c + 2.0,
and d * e. However, these operations are an unwieldy way of doing
things in TensorFlow 2. The operations below are equivalent to
those above:

d = b + c
e = c + 2
a = d * e

To access the value of variable a, one can use the .numpy()
method as shown below:

print(f”Variable a is {a.numpy()}”)

The computational graph for this simple example can be
visualized by using the TensorBoard functionality that comes
packaged with TensorFlow. This is a great visualization feature and
is explained more in
this post
. Here is what the graph looks like in
TensorBoard:

TensorFlow tutorial - simple graph

Simple TensorFlow graph

The larger two vertices or nodes, b and c, correspond to the
variables. The smaller nodes correspond to the operations, and the
edges between the vertices are the scalar values emerging from the
variables and operations.

The example above is a trivial example – what would this look
like if there was an array of b values from which an array of
equivalent a values would be calculated? TensorFlow variables can
easily be instantiated using numpy variables, like the
following:

b = tf.Variable(np.arange(0, 10), name='b')

Calling b shows the following:

<tf.Variable ‘b:0′ shape=(10,) dtype=int32, numpy=array([0,
1, 2, 3, 4, 5, 6, 7, 8, 9])>

Note the numpy value of the tensor is an array. Because the
numpy variable passed during the instantiation is a range of int32
values, we can’t add it directly to c as c is of float32 type.
Therefore, the tf.cast operation, which changes the type of a
tensor, first needs to be utilized like so:

d = tf.cast(b, tf.float32) + c

Running the rest of the previous operations, using the new b
tensor, gives the following value for a:

Variable a is [ 3.  6.  9. 12. 15. 18. 21. 24. 27. 30.]

In numpy, the developer can directly access slices or individual
indices of an array and change their values directly. Can the same
be done in TensorFlow 2? Can individual indices and/or slices be
accessed and changed? The answer is yes, but not quite as
straight-forwardly as in numpy. For instance, if b was a simple
numpy array, one could easily execute the following b[1] = 10 –
this would change the value of the second element in the array to
the integer 10.

b[1].assign(10)

This will then flow through to a like so:

Variable a is [ 3. 33.  9. 12. 15. 18. 21. 24. 27. 30.]

The developer could also run the following, to assign a slice of
b values:

b[6:9].assign([10, 10, 10])

A new tensor can also be created by using the slice
notation:

f = b[2:5]

The explanations and code above show you how to perform some
basic tensor manipulations and operations. In the section below, an
example will be presented where a neural network is created using
the Eager paradigm in TensorFlow 2. It will show how to create a
training loop, perform a feed-forward pass through a neural network
and calculate and apply gradients to an optimization method.

3.0 A Neural Network Example

In this section, a simple three-layer neural network build in
TensorFlow is demonstrated.  In following chapters more complicated
neural network structures such as convolution neural networks and
recurrent neural networks are covered.  For this example, though,
it will be kept simple.

In this example, the MNIST dataset will be used that is packaged
as part of the TensorFlow installation. This MNIST dataset is a set
of 28×28 pixel grayscale images which represent hand-written
digits.  It has 60,000 training rows, 10,000 testing rows, and
5,000 validation rows. It is a very common, basic, image
classification dataset that is used in machine learning.

The data can be loaded by running the following:

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

As can be observed, the Keras MNIST data loader returns Python
tuples corresponding to the training and test set respectively
(Keras is another deep learning framework, now tightly integrated
with TensorFlow, as mentioned earlier). The data sizes of the
tuples defined above are:

  • x_train: (60,000 x 28 x 28)
  • y_train: (60,000)
  • x_test: (10,000 x 28 x 28)
  • y_test: (10,000)

The x data is the image information – 60,000 images of 28 x 28
pixels size in the training set. The images are grayscale (i.e
black and white) with maximum values, specifying the intensity of
whites, of 255. The x data will need to be scaled so that it
resides between 0 and 1, as this improves training efficiency. The
y data is the matching image labels – signifying what digit is
displayed in the image. This will need to be transformed to
“one-hot” format.

When using a standard, categorical cross-entropy loss function
(this will be shown later), a one-hot format is required when
training classification tasks, as the output layer of the neural
network will have the same number of nodes as the total number of
possible classification labels. The output node with the highest
value is considered as a prediction for that corresponding label.
For instance, in the MNIST task, there are 10 possible
classification labels – 0 to 9. Therefore, there will be 10
output nodes in any neural network performing this classification
task. If we have an example output vector of [0.01, 0.8, 0.25,
0.05, 0.10, 0.27, 0.55, 0.32, 0.11, 0.09], the maximum value is in
the second position / output node, and therefore this corresponds
to the digit “1”. To train the network to produce this sort of
outcome when the digit “1” appears, the loss needs to be
calculated according to the difference between the output of the
network and a “one-hot” array of the label 1. This one-hot
array looks like [0, 1, 0, 0, 0, 0, 0, 0, 0, 0].

This conversion is easily performed in TensorFlow, as will be
demonstrated shortly when the main training loop is covered.

One final thing that needs to be considered is how to extract
the training data in batches of samples. The function below can
handle this:

def get_batch(x_data, y_data, batch_size):
    idxs = np.random.randint(0, len(y_data), batch_size)
    return x_data[idxs,:,:], y_data[idxs]

As can be observed in the code above, the data to be batched
i.e. the x and y data is passed to this function along with the
batch size. The first line of the function generates a random
vector of integers, with random values between 0 and the length of
the data passed to the function. The number of random integers
generated is equal to the batch size. The x and y data are then
returned, but the return data is only for those random indices
chosen. Note, that this is performed on numpy array objects – as
will be shown shortly, the conversion from numpy arrays to tensor
objects will be performed “on the fly” within the training
loop.

There is also the requirement for a loss function and a
feed-forward function, but these will be covered shortly.

# Python optimisation variables
epochs = 10
batch_size = 100

# normalize the input images by dividing by 255.0
x_train = x_train / 255.0
x_test = x_test / 255.0
# convert x_test to tensor to pass through model (train data will be converted to
# tensors on the fly)
x_test = tf.Variable(x_test)

First, the number of training epochs and the batch size are
created – note these are simple Python variables, not TensorFlow
variables. Next, the input training and test data, x_train and
x_test, are scaled so that their values are between 0 and 1. Input
data should always be scaled when training neural networks, as
large, uncontrolled, inputs can heavily impact the training
process. Finally, the test input data, x_test is converted into a
tensor. The random batching process for the training data is most
easily performed using numpy objects and functions. However, the
test data will not be batched in this example, so the full test
input data set x_test is converted into a tensor.

The next step is to setup the weight and bias variables for the
three-layer neural network.  There are always L – 1 number of
weights/bias tensors, where L is the number of layers.  These
variables are defined in the code below:

# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random.normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random.normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random.normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random.normal([10]), name='b2')

The weight and bias variables are initialized using the
tf.random.normal function – this function creates tensors of
random numbers, drawn from a normal distribution. It allows the
developer to specify things like the standard deviation of the
distribution from which the random numbers are drawn.

Note the shape of the variables. The W1 variable is a [784, 300]
tensor – the 784 nodes are the size of the input layer. This size
comes from the flattening of the input images – if we have 28
rows and 28 columns of pixels, flattening these out gives us 1 row
or column of 28 x 28 = 784 values.  The 300 in the declaration of
W1 is the number of nodes in the hidden layer. The W2 variable is a
[300, 10] tensor, connecting the 300-node hidden layer to the
10-node output layer. In each case, a name is given to the variable
for later viewing in TensorBoard – the TensorFlow visualization
package. The next step in the code is to create the computations
that occur within the nodes of the network. If the reader recalls,
the computations within the nodes of a neural network are of the
following form:

$$z = Wx + b$$

$$h=f(z)$$

Where W is the weights matrix, x is the layer input vector, b is
the bias and f is the activation function of the node. These
calculations comprise the feed-forward pass of the input data
through the neural network. To execute these calculations, a
dedicated feed-forward function is created:

def nn_model(x_input, W1, b1, W2, b2):
    # flatten the input image from 28 x 28 to 784
    x_input = tf.reshape(x_input, (x_input.shape[0], -1))
    x = tf.add(tf.matmul(tf.cast(x_input, tf.float32), W1), b1)
    x = tf.nn.relu(x)
    logits = tf.add(tf.matmul(x, W2), b2)
    return logits

Examining the first line, the x_input data is reshaped from
(batch_size, 28, 28) to (batch_size, 784) – in other words, the
images are flattened out. On the next line, the input data is then
converted to tf.float32 type using the TensorFlow cast function.
This is important – the x­_input data comes in as tf.float64
type, and TensorFlow won’t perform a matrix multiplication
operation (tf.matmul) between tensors of different data types. This
re-typed input data is then matrix-multiplied by W1 using the
TensorFlow matmul function (which stands for matrix
multiplication). Then the bias b1 is added to this product. On the
line after this, the ReLU activation function is applied to the
output of this line of calculation. The ReLU function is usually
the best activation function to use in deep learning – the
reasons for this are discussed in
this post
.

The output of this calculation is then multiplied by the final
set of weights W2, with the bias b2 added. The output of this
calculation is titled logits. Note that no activation function has
been applied to this output layer of nodes (yet). In machine/deep
learning, the term “logits” refers to the un-activated output
of a layer of nodes.

The reason no activation function has been applied to this layer
is that there is a handy function in TensorFlow called
tf.nn.softmax_cross_entropy_with_logits. This function does two
things for the developer – it applies a softmax activation
function to the logits, which transforms them into a
quasi-probability (i.e. the sum of the output nodes is equal to 1).
This is a common activation function to apply to an output layer in
classification tasks. Next, it applies the cross-entropy loss
function to the softmax activation output. The cross-entropy loss
function is a commonly used loss in classification tasks. The
theory behind it is quite interesting, but it won’t be covered in
this book – a good summary can be found
here
. The code below applies this handy TensorFlow function,
and in this example,  it has been nested in another function called
loss_fn:

def loss_fn(logits, labels):
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels,
                                                                              logits=logits))
    return cross_entropy

The arguments to softmax_cross_entropy_with_logits are labels
and logits. The logits argument is supplied from the outcome of the
nn_model function. The usage of this function in the main training
loop will be demonstrated shortly. The labels argument is supplied
from the one-hot y values that are fed into loss_fn during the
training process. The output of the
softmax_cross_entropy_with_logits function will be the output of
the cross-entropy loss value for each sample in the batch. To train
the weights of the neural network, the average cross-entropy loss
across the samples needs to be minimized as part of the
optimization process. This is calculated by using the
tf.reduce_mean function, which, unsurprisingly, calculates the mean
of the tensor supplied to it.

The next step is to define an optimizer function. In many
examples within this book, the versatile Adam optimizer will be
used. The theory behind this optimizer is interesting, and is worth
further examination (such as shown here)
but won’t be covered in detail within this post. It is basically
a gradient descent method, but with sophisticated averaging of the
gradients to provide appropriate momentum to the learning. To
define the optimizer, which will be used in the main training loop,
the following code is run:

# setup the optimizer
optimizer = tf.keras.optimizers.Adam()

The Adam object can take a learning rate as input, but for the
present purposes, the default value is used.

3.1 Training the network

Now that the appropriate functions, variables and optimizers
have been created, it is time to define the overall training loop.
The training loop is shown below:

total_batch = int(len(y_train) / batch_size)
for epoch in range(epochs):
    avg_loss = 0
    for i in range(total_batch):
        batch_x, batch_y = get_batch(x_train, y_train, batch_size=batch_size)
        # create tensors
        batch_x = tf.Variable(batch_x)
        batch_y =..