Categories
Offsites

Using GANs to Create Fantastical Creatures

Creating art for digital video games takes a high degree of artistic creativity and technical knowledge, while also requiring game artists to quickly iterate on ideas and produce a high volume of assets, often in the face of tight deadlines. What if artists had a paintbrush that acted less like a tool and more like an assistant? A machine learning model acting as such a paintbrush could reduce the amount of time necessary to create high-quality art without sacrificing artistic choices, perhaps even enhancing creativity.

Today, we present Chimera Painter, a trained machine learning (ML) model that automatically creates a fully fleshed out rendering from a user-supplied creature outline. Employed as a demo application, Chimera Painter adds features and textures to a creature outline segmented with body part labels, such as “wings” or “claws”, when the user clicks the “transform” button. Below is an example using the demo with one of the preset creature outlines.

Using an image imported to Chimera Painter or generated with the tools provided, an artist can iteratively construct or modify a creature outline and use the ML model to generate realistic looking surface textures. In this example, an artist (Lee Dotson) customizes one of the creature designs that comes pre-loaded in the Chimera Painter demo.

In this post, we describe some of the challenges in creating the ML model behind Chimera Painter and demonstrate how one might use the tool for the creation of video game-ready assets.

Prototyping for a New Type of Model
In developing an ML model to produce video-game ready creature images, we created a digital card game prototype around the concept of combining creatures into new hybrids that can then battle each other. In this game, a player would begin with cards of real-world animals (e.g., an axolotl or a whale) and could make them more powerful by combining them (making the dreaded Axolotl-Whale chimera). This provided a creative environment for demonstrating an image-generating model, as the number of possible chimeras necessitated a method for quickly designing large volumes of artistic assets that could be combined naturally, while still retaining identifiable visual characteristics of the original creatures.

Since our goal was to create high-quality creature card images guided by artist input, we experimented with generative adversarial networks (GANs), informed by artist feedback, to create creature images that would be appropriate for our fantasy card game prototype. GANs pair two convolutional neural networks against each other: a generator network to create new images and a discriminator network to determine if these images are samples from the training dataset (in this case, artist-created images) or not. We used a variant called a conditional GAN, where the generator takes a separate input to guide the image generation process. Interestingly, our approach was a strict departure from other GAN efforts, which typically focus on photorealism.

To train the GANs, we created a dataset of full color images with single-species creature outlines adapted from 3D creature models. The creature outlines characterized the shape and size of each creature, and provided a segmentation map that identified individual body parts. After model training, the model was tasked with generating multi-species chimeras, based on outlines provided by artists. The best performing model was then incorporated into Chimera Painter. Below we show some sample assets generated using the model, including single-species creatures, as well as the more complex multi-species chimeras.

Generated card art integrated into the card game prototype showing basic creatures (bottom row) and chimeras from multiple creatures, including an Antlion-Porcupine, Axolotl-Whale, and a Crab-Antion-Moth (top row). More info about the game itself is detailed in this Stadia Research presentation.

Learning to Generate Creatures with Structure
An issue with using GANs for generating creatures was the potential for loss of anatomical and spatial coherence when rendering subtle or low-contrast parts of images, despite these being of high perceptual importance to humans. Examples of this can include eyes, fingers, or even distinguishing between overlapping body parts with similar textures (see the affectionately named BoggleDog below).

GAN-generated image showing mismatched body parts.

Generating chimeras required a new non-photographic fantasy-styled dataset with unique characteristics, such as dramatic perspective, composition, and lighting. Existing repositories of illustrations were not appropriate to use as datasets for training an ML model, because they may be subject to licensing restrictions, have conflicting styles, or simply lack the variety needed for this task.

To solve this, we developed a new artist-led, semi-automated approach for creating an ML training dataset from 3D creature models, which allowed us to work at scale and rapidly iterate as needed. In this process, artists would create or obtain a set of 3D creature models, one for each creature type needed (such as hyenas or lions). Artists then produced two sets of textures that were overlaid on the 3D model using the Unreal Engine — one with the full color texture (left image, below) and the other with flat colors for each body part (e.g., head, ears, neck, etc), called a “segmentation map” (right image, below). This second set of body part segments was given to the model at training to ensure that the GAN learned about body part-specific structure, shapes, textures, and proportions for a variety of creatures.

Example dataset training image and its paired segmentation map.

The 3D creature models were all placed in a simple 3D scene, again using the Unreal Engine. A set of automated scripts would then take this 3D scene and interpolate between different poses, viewpoints, and zoom levels for each of the 3D creature models, creating the full color images and segmentation maps that formed the training dataset for the GAN. Using this approach, we generated 10,000+ image + segmentation map pairs per 3D creature model, saving the artists millions of hours of time compared to creating such data manually (at approximately 20 minutes per image).

Fine Tuning
The GAN had many different hyper-parameters that could be adjusted, leading to different qualities in the output images. In order to better understand which versions of the model were better than others, artists were provided samples for different creature types generated by these models and asked to cull them down to a few best examples. We gathered feedback about desired characteristics present in these examples, such as a feeling of depth, style with regard to creature textures, and realism of faces and eyes. This information was used both to train new versions of the model and, after the model had generated hundreds of thousands of creature images, to select the best image from each creature category (e.g., gazelle, lynx, gorilla, etc).

We tuned the GAN for this task by focusing on the perceptual loss. This loss function component (also used in Stadia’s Style Transfer ML) computes a difference between two images using extracted features from a separate convolutional neural network (CNN) that was previously trained on millions of photographs from the ImageNet dataset. The features are extracted from different layers of the CNN and a weight is applied to each, which affects their contribution to the final loss value. We discovered that these weights were critically important in determining what a final generated image would look like. Below are some examples from the GAN trained with different perceptual loss weights.

Dino-Bat Chimeras generated using varying perceptual loss weights.

Some of the variation in the images above is due to the fact that the dataset includes multiple textures for each creature (for example, a reddish or grayish version of the bat). However, ignoring the coloration, many differences are directly tied to changes in perceptual loss values. In particular, we found that certain values brought out sharper facial features (e.g., bottom right vs. top right) or “smooth” versus “patterned” (top right vs. bottom left) that made generated creatures feel more real.

Here are some creatures generated from the GAN trained with different perceptual loss weights, showing off a small sample of the outputs and poses that the model can handle.

Creatures generated using different models.
A generated chimera (Dino-Bat-Hyena, to be exact) created using the conditional GAN. Output from the GAN (left) and the post-processed / composited card (right).

Chimera Painter
The trained GAN is now available in the Chimera Painter demo, allowing artists to work iteratively with the model, rather than drawing dozens of similar creatures from scratch. An artist can select a starting point and then adjust the shape, type, or placement of creature parts, enabling rapid exploration and for the creation of a large volume of images. The demo also allows for uploading a creature outline created in an external program, like Photoshop. Simply download one of the preset creature outlines to get the colors needed for each creature part and use this as a template for drawing one outside of Chimera Painter, and then use the “Load’ button on the demo to use this outline to flesh out your creation.

It is our hope that these GAN models and the Chimera Painter demonstration tool might inspire others to think differently about their art pipeline. What can one create when using machine learning as a paintbrush?

Acknowledgments
This project is conducted in collaboration with many people. Thanks to Ryan Poplin, Lee Dotson, Trung Le, Monica Dinculescu, Marc Destefano, Aaron Cammarata, Maggie Oh, Richard Wu, Ji Hun Kim, Erin Hoffman-John, and Colin Boswell. Thanks to everyone who pitched in to give hours of art direction, technical feedback, and drawings of fantastic creatures.

Categories
Offsites

Mitigating Unfair Bias in ML Models with the MinDiff Framework

The responsible research and development of machine learning (ML) can play a pivotal role in helping to solve a wide variety of societal challenges. At Google, our research reflects our AI Principles, from helping to protect patients from medication errors and improving flood forecasting models, to presenting methods that tackle unfair bias in products, such as Google Translate, and providing resources for other researchers to do the same.

One broad category for applying ML responsibly is the task of classification — systems that sort data into labeled categories. At Google, such models are used throughout our products to enforce policies, ranging from the detection of hate speech to age-appropriate content filtering. While these classifiers serve vital functions, it is also essential that they are built in ways that minimize unfair biases for users.

Today, we are announcing the release of MinDiff, a new regularization technique available in the TF Model Remediation library for effectively and efficiently mitigating unfair biases when training ML models. In this post, we discuss the research behind this technique and explain how it addresses the practical constraints and requirements we’ve observed when incorporating it in Google’s products.

Unfair Biases in Classifiers
To illustrate how MinDiff can be used, consider an example of a product policy classifier that is tasked with identifying and removing text comments that could be considered toxic. One challenge is to make sure that the classifier is not unfairly biased against submissions from a particular group of users, which could result in incorrect removal of content from these groups.

The academic community has laid a solid theoretical foundation for ML fairness, offering a breadth of perspectives on what unfair bias means and on the tensions between different frameworks for evaluating fairness. One of the most common metrics is equality of opportunity, which, in our example, means measuring and seeking to minimize the difference in false positive rate (FPR) across groups. In the example above, this means that the classifier should not be more likely to incorrectly remove safe comments from one group than another. Similarly, the classifier’s false negative rate should be equal between groups. That is, the classifier should not miss toxic comments against one group more than it does for another.

When the end goal is to improve products, it’s important to be able to scale unfair bias mitigation to many models. However, this poses a number of challenges:

  • Sparse demographic data: The original work on equality of opportunity proposed a post-processing approach to the problem, which consisted of assigning each user group a different classifier threshold at serving time to offset biases of the model. However, in practice this is often not possible for many reasons, such as privacy policies. For example, demographics are often collected by users self-identifying and opting in, but while some users will choose to do this, others may choose to opt-out or delete data. Even for in-process solutions (i.e., methods that change how a model is trained) one needs to assume that most data will not have associated demographics, and thus needs to make efficient use of the few examples for which demographics are known.
  • Ease of Use: In order for any technique to be adopted broadly, it should be easy to incorporate into existing model architectures, and not be highly sensitive to hyperparameters. While an early approach to incorporating ML fairness principles into applications utilized adversarial learning, we found that it too frequently caused models to degenerate during training, which made it difficult for product teams to iterate and made new product teams wary.
  • Quality: The method for removing unfair biases should also reduce the overall classification performance (e.g., accuracy) as little as possible. Because any decrease in accuracy caused by the mitigation approach could result in the moderation model allowing more toxic comments, striking the right balance is crucial.

MinDiff Framework
We iteratively developed the MinDiff framework over the previous few years to meet these design requirements. Because demographic information is so rarely known, we utilize in-process approaches in which the model’s training objective is augmented with an objective specifically focused on removing biases. This new objective is then optimized over the small sample of data with known demographic information. To improve ease of use, we switched from adversarial training to a regularization framework, which penalizes statistical dependency between its predictions and demographic information for non-harmful examples. This encourages the model to equalize error rates across groups, e.g., classifying non-harmful examples as toxic.

There are several ways to encode this dependency between predictions and demographic information. Our initial MinDiff implementation minimized the correlation between the predictions and the demographic group, which essentially optimized for the average and variance of predictions to be equal across groups, even if the distributions still differ afterward. We have since improved MinDiff further by considering the maximum mean discrepancy (MMD) loss, which is closer to optimizing for the distribution of predictions to be independent of demographics. We have found that this approach is better able to both remove biases and maintain model accuracy.

MinDiff with MMD better closes the FPR gap with less decrease in accuracy
(on an academic benchmark dataset).

To date we have launched modeling improvements across several classifiers at Google that moderate content quality. We went through multiple iterations to develop a robust, responsible, and scalable approach, solving research challenges and enabling broad adoption.

Gaps in error rates of classifiers is an important set of unfair biases to address, but not the only one that arises in ML applications. For ML researchers and practitioners, we hope this work can further advance research toward addressing even broader classes of unfair biases and the development of approaches that can be used in practical applications. In addition, we hope that the release of the MinDiff library and the associated demos and documentation, along with the tools and experience shared here, can help practitioners improve their models and products.

Acknowledgements
This research effort on ML Fairness in classification was jointly led with Jilin Chen, Shuo Chen, Ed H. Chi, Tulsee Doshi, and Hai Qian. Further, this work was pursued in collaboration with Jonathan Bischof, Qiuwen Chen, Pierre Kreitmann, and Christine Luu. The MinDiff infrastructure was also developed in collaboration with Nick Blumm, James Chen, Thomas Greenspan, Christina Greer, Lichan Hong, Manasi Joshi, Maciej Kula, Summer Misherghi, Dan Nanas, Sean O’Keefe, Mahesh Sathiamoorthy, Catherina Xu, and Zhe Zhao. (All names are listed in alphabetical order of last names.)

Categories
Offsites

sparklyr 1.5: better dplyr interface, more sdf_* functions, and RDS-based serialization routines

We are thrilled to announce sparklyr 1.5 is now available on

CRAN
!

To install sparklyr 1.5 from CRAN, run

install.packages("sparklyr")

In this blog post, we will highlight the following aspects of
sparklyr 1.5:

Better dplyr interface

A large fraction of pull requests that went into the sparklyr
1.5 release were focused on making Spark dataframes work with
various dplyr verbs in the same way that R dataframes do. The full
list of dplyr-related bugs and feature requests that were resolved
in sparklyr 1.5 can be found in
here
.

In this section, we will showcase three new dplyr
functionalities that were shipped with sparklyr 1.5.

Stratified sampling

Stratified sampling on an R dataframe can be accomplished with a
combination of dplyr::group_by() followed by dplyr::sample_n() or
dplyr::sample_frac(), where the grouping variables specified in the
dplyr::group_by() step are the ones that define each stratum. For
instance, the following query will group mtcars by number of
cylinders and return a weighted random sample of size two from each
group, without replacement, and weighted by the mpg column:

mtcars %>% dplyr::group_by(cyl) %>% dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>% print()
## # A tibble: 6 x 11 ## # Groups: cyl [3] ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 ## 2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 ## 3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 5 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 ## 6 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2

Starting from sparklyr 1.5, the same can also be done for Spark
dataframes with Spark 3.0 or above, e.g.,:

library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE, repartition = 3) mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>% print()
# Source: spark<?> [?? x 11] # Groups: cyl mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 3 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 5 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 6 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2

or

mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::sample_frac(size = 0.2, weight = mpg, replace = FALSE) %>% print()
## # Source: spark<?> [?? x 11] ## # Groups: cyl ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 ## 4 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 ## 5 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 ## 6 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 ## 7 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 ## 8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3

Row sums

The rowSums() functionality offered by dplyr is handy when one
needs to sum up a large number of columns within an R dataframe
that are impractical to be enumerated individually. For example,
here we have a six-column dataframe of random real numbers, where
the partial_sum column in the result contains the sum of columns b
through d within each row:

ncols <- 6 nums <- seq(ncols) %>% lapply(function(x) runif(5)) names(nums) <- letters[1:ncols] tbl <- tibble::as_tibble(nums) tbl %>% dplyr::mutate(partial_sum = rowSums(.[2:5])) %>% print()
## # A tibble: 5 x 7 ## a b c d e f partial_sum ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.781 0.801 0.157 0.0293 0.169 0.0978 1.16 ## 2 0.696 0.412 0.221 0.941 0.697 0.675 2.27 ## 3 0.802 0.410 0.516 0.923 0.190 0.904 2.04 ## 4 0.200 0.590 0.755 0.494 0.273 0.807 2.11 ## 5 0.00149 0.711 0.286 0.297 0.107 0.425 1.40

Beginning with sparklyr 1.5, the same operation can be performed
with Spark dataframes:

library(sparklyr) sc <- spark_connect(master = "local") sdf <- copy_to(sc, tbl, overwrite = TRUE) sdf %>% dplyr::mutate(partial_sum = rowSums(.[2:5])) %>% print()
## # Source: spark<?> [?? x 7] ## a b c d e f partial_sum ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.781 0.801 0.157 0.0293 0.169 0.0978 1.16 ## 2 0.696 0.412 0.221 0.941 0.697 0.675 2.27 ## 3 0.802 0.410 0.516 0.923 0.190 0.904 2.04 ## 4 0.200 0.590 0.755 0.494 0.273 0.807 2.11 ## 5 0.00149 0.711 0.286 0.297 0.107 0.425 1.40

As a bonus from implementing the rowSums feature for Spark
dataframes, sparklyr 1.5 now also offers limited support for the
column-subsetting operator on Spark dataframes. For example, all
code snippets below will return some subset of columns from the
dataframe named sdf:

# select columns `b` through `e` sdf[2:5]
# select columns `b` and `c` sdf[c("b", "c")]
# drop the first and third columns and return the rest sdf[c(-1, -3)]

Weighted-mean summarizer

Similar to the two dplyr functions mentioned above, the
weighted.mean() summarizer is another useful function that has
become part of the dplyr interface for Spark dataframes in sparklyr
1.5. One can see it in action by, for example, comparing the output
from the following

library(sparklyr) sc <- spark_connect(master = "local") mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE) mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>% print()

with output from the equivalent operation on mtcars in R:

mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>% print()

both of them should evaluate to the following:

## cyl mpg_wm ## <dbl> <dbl> ## 1 4 25.9 ## 2 6 19.6 ## 3 8 14.8

New additions to the sdf_* family of functions

sparklyr provides a large number of convenience functions for
working with Spark dataframes, and all of them have names starting
with the sdf_ prefix.

In this section we will briefly mention four new additions and
show some example scenarios in which those functions are
useful.

sdf_expand_grid()

As the name suggests, sdf_expand_grid() is simply the Spark
equivalent of expand.grid(). Rather than running expand.grid() in R
and importing the resulting R dataframe to Spark, one can now run
sdf_expand_grid(), which accepts both R vectors and Spark
dataframes and supports hints for broadcast hash joins. The example
below shows sdf_expand_grid() creating a 100-by-100-by-10-by-10
grid in Spark over 1000 Spark partitions, with broadcast hash join
hints on variables with small cardinalities:

library(sparklyr) sc <- spark_connect(master = "local") grid_sdf <- sdf_expand_grid( sc, var1 = seq(100), var2 = seq(100), var3 = seq(10), var4 = seq(10), broadcast_vars = c(var3, var4), repartition = 1000 ) grid_sdf %>% sdf_nrow() %>% print()
## [1] 1e+06

sdf_partition_sizes()

As sparklyr user @sbottelli suggested here, one
thing that would be great to have in sparklyr is an efficient way
to query partition sizes of a Spark dataframe. In sparklyr 1.5,
sdf_partition_sizes() does exactly that:

library(sparklyr) sc <- spark_connect(master = "local") sdf_len(sc, 1000, repartition = 5) %>% sdf_partition_sizes() %>% print(row.names = FALSE)
## partition_index partition_size ## 0 200 ## 1 200 ## 2 200 ## 3 200 ## 4 200

sdf_unnest_longer() and sdf_unnest_wider()

sdf_unnest_longer() and sdf_unnest_wider() are the equivalents
of tidyr::unnest_longer() and tidyr::unnest_wider() for Spark
dataframes. sdf_unnest_longer() expands all elements in a struct
column into multiple rows, and sdf_unnest_wider() expands them into
multiple columns. As illustrated with an example dataframe
below,

library(sparklyr) sc <- spark_connect(master = "local") sdf <- copy_to( sc, tibble::tibble( id = seq(3), attribute = list( list(name = "Alice", grade = "A"), list(name = "Bob", grade = "B"), list(name = "Carol", grade = "C") ) ) )
sdf %>% sdf_unnest_longer(col = record, indices_to = "key", values_to = "value") %>% print()

evaluates to

## # Source: spark<?> [?? x 3] ## id value key ## <int> <chr> <chr> ## 1 1 A grade ## 2 1 Alice name ## 3 2 B grade ## 4 2 Bob name ## 5 3 C grade ## 6 3 Carol name

whereas

sdf %>% sdf_unnest_wider(col = record) %>% print()

evaluates to

## # Source: spark<?> [?? x 3] ## id grade name ## <int> <chr> <chr> ## 1 1 A Alice ## 2 2 B Bob ## 3 3 C Carol

RDS-based serialization routines

Some readers must be wondering why a brand new serialization
format would need to be implemented in sparklyr at all. Long story
short, the reason is that RDS serialization is a strictly better
replacement for its CSV predecessor. It possesses all desirable
attributes the CSV format has, while avoiding a number of
disadvantages that are common among text-based data formats.

In this section, we will briefly outline why sparklyr should
support at least one serialization format other than arrow,
deep-dive into issues with CSV-based serialization, and then show
how the new RDS-based serialization is free from those issues.

Why arrow is not for everyone?

To transfer data between Spark and R correctly and efficiently,
sparklyr must rely on some data serialization format that is
well-supported by both Spark and R. Unfortunately, not many
serialization formats satisfy this requirement, and among the ones
that do are text-based formats such as CSV and JSON, and binary
formats such as Apache Arrow, Protobuf, and as of recent, a small
subset of RDS version 2. Further complicating the matter is the
additional consideration that sparklyr should support at least one
serialization format whose implementation can be fully
self-contained within the sparklyr code base, i.e., such
serialization should not depend on any external R package or system
library, so that it can accommodate users who want to use sparklyr
but who do not necessarily have the required C++ compiler tool
chain and other system dependencies for setting up R packages such
as arrow
or protolite.
Prior to sparklyr 1.5, CSV-based serialization was the default
alternative to fallback to when users do not have the arrow package
installed or when the type of data being transported from R to
Spark is unsupported by the version of arrow available.

Why is the CSV format not ideal?

There are at least three reasons to believe CSV format is not
the best choice when it comes to exporting data from R to
Spark.

One reason is efficiency. For example, a double-precision
floating point number such as .Machine$double.eps needs to be
expressed as “2.22044604925031e-16” in CSV format in order to not
incur any loss of precision, thus taking up 20 bytes rather than 8
bytes.

But more important than efficiency are correctness concerns. In
a R dataframe, one can store both NA_real_ and NaN in a column of
floating point numbers. NA_real_ should ideally translate to null
within a Spark dataframe, whereas NaN should continue to be NaN
when transported from R to Spark. Unfortunately, NA_real_ in R
becomes indistinguishable from NaN once serialized in CSV format,
as evident from a quick demo shown below:

original_df <- data.frame(x = c(NA_real_, NaN)) original_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()
## x is_nan ## 1 NA FALSE ## 2 NaN TRUE
csv_file <- "/tmp/data.csv" write.csv(original_df, file = csv_file, row.names = FALSE) deserialized_df <- read.csv(csv_file) deserialized_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()
## x is_nan ## 1 NA FALSE ## 2 NA FALSE

Another correctness issue very much similar to the one above was
the fact that “NA” and NA within a string column of an R dataframe
become indistinguishable once serialized in CSV format, as
correctly pointed out in this Github
issue
by @caewok and
others.

RDS to the rescue!

RDS format is one of the most widely used binary formats for
serializing R objects. It is described in some detail in chapter 1,
section 8 of this
document
. Among advantages of the RDS format are efficiency and
accuracy: it has a reasonably efficient implementation in base R,
and supports all R data types.

Also worth noticing is the fact that when an R dataframe
containing only data types with sensible equivalents in Apache
Spark (e.g., RAWSXP, LGLSXP, CHARSXP, REALSXP, etc) is saved using
RDS version 2, (e.g., serialize(mtcars, connection = NULL, version
= 2L, xdr = TRUE)), only a tiny subset of the RDS format will be
involved in the serialization process, and implementing
deserialization routines in Scala capable of decoding such a
restricted subset of RDS constructs is in fact a reasonably simple
and straightforward task (as shown in
here
).

Last but not least, because RDS is a binary format, it allows
NA_character_, “NA”, NA_real_, and NaN to all be encoded in an
unambiguous manner, hence allowing sparklyr 1.5 to avoid all
correctness issues detailed above in non-arrow serialization use
cases.

Other benefits of RDS serialization

In addition to correctness guarantees, RDS format also offers
quite a few other advantages.

One advantage is of course performance: for example, importing a
non-trivially-sized dataset such as nycflights13::flights from R to
Spark using the RDS format in sparklyr 1.5 is roughly 40%-50%
faster compared to CSV-based serialization in sparklyr 1.4. The
current RDS-based implementation is still nowhere as fast as
arrow-based serialization though (arrow is about 3-4x faster), so
for performance-sensitive tasks involving heavy serialization,
arrow should still be the top choice.

Another advantage is that with RDS serialization, sparklyr can
import R dataframes containing raw columns directly into binary
columns in Spark. Thus, use cases such as the one below will work
in sparklyr 1.5

library(sparklyr) sc <- spark_connect(master = "local") tbl <- tibble::tibble( x = list(serialize("sparklyr", NULL), serialize(c(123456, 789), NULL)) ) sdf <- copy_to(sc, tbl)

While most sparklyr users probably won’t find this capability
of importing binary columns to Spark immediately useful in their
typical sparklyr::copy_to() or sparklyr::collect() usages, it does
play a crucial role in reducing serialization overheads in the
Spark-based foreach
parallel backend that was first introduced in sparklyr 1.2. This is
because Spark workers can directly fetch the serialized R closures
to be computed from a binary Spark column instead of extracting
those serialized bytes from intermediate representations such as
base64-encoded strings. Similarly, the R results from executing
worker closures will be directly available in RDS format which can
be efficiently deserialized in R, rather than being delivered in
other less efficient formats.

Acknowledgement

In chronological order, we would like to thank the following
contributors for making their pull requests part of sparklyr
1.5:

We would also like to express our gratitude towards numerous bug
reports and feature requests for sparklyr from a fantastic
open-source community.

Finally, the author of this blog post is indebted to @javierluraschi, @batpigandme, and @skeydan for their valuable
editorial inputs.

If you wish to learn more about sparklyr, check out sparklyr.ai, spark.rstudio.com, and some of the
previous release posts such as
sparklyr 1.4
and sparklyr
1.3
.

Thanks for reading!

Categories
Offsites

Dueling Q networks in TensorFlow 2

In this post, we’ll be covering Dueling Q networks for reinforcement learning in TensorFlow 2. This reinforcement learning architecture is an improvement on the Double Q architecture, which has been covered here. In this tutorial, I’ll introduce the Dueling Q network architecture, it’s advantages and how to build one in TensorFlow 2. We’ll be running the code on the Open AI gym‘s CartPole environment so that readers can train the network quickly and easily. In future posts, I’ll be showing results on Atari environments which are more complicated. For an introduction to reinforcement learning, check out this post and this post. All the code for this tutorial can be found on this site’s Github repo.


Eager to build deep learning systems in TensorFlow 2? Get the book here


A recap of Double Q learning

As discussed in detail in this post, vanilla deep Q learning has some problems. These problems can be boiled down to two main issues:

  1. The bias problem: vanilla deep Q networks tend to overestimate rewards in noisy environments, leading to non-optimal training outcomes
  2. The moving target problem: because the same network is responsible for both the choosing of actions and the evaluation of actions, this leads to training instability

With regards to (1) – say we have a state with two possible actions, each giving noisy rewards. Action a returns a random reward based on a normal distribution with a mean of 2 and a standard deviation of 1 – N(2, 1). Action b returns a random reward from a normal distribution of N(1, 4). On average, action a is the optimal action to take in this state – however, because of the argmax function in deep Q learning, action will tend to be favoured because of the higher standard deviation / higher random rewards.

For (2) – let’s consider another state, state 1, with three possible actions a, b, and c. Let’s say we know that b is the optimal action. However, when we first initialize the neural network, in state 1, action tends to be chosen. When we’re training our network, the loss function will drive the weights of the network towards choosing action b. However, next time we are in state 1, the parameters of the network have changed to such a degree that now action is chosen. Ideally, we would have liked the network to consistently chose action in state 1 until it was gradually trained to chose action b. But now the goal posts have shifted, and we are trying to move the network from to instead of to b – this gives rise to instability in training. This is the problem that arises when you have the same network both choosing actions and evaluating the worth of actions.

To overcome this problem , Double Q learning proposed the following way of determining the target Q value: $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; theta_t); theta^-_t)$$ Here $theta_t$ refers to the primary network parameters (weights) at time t, and $theta^-_t$ refers to something called the target network parameters at time t. This target network is a kind of delayed copy of the primary network. As can be observed, the optimal action in state t + 1 is chosen from the primary network ($theta_t$) but the evaluation or estimate of the Q value of this action is determined from the target network ($theta^-_t$).

This can be shown more clearly by the equations below: $$a* = argmax Q(s_{t+1}, a; theta_t)$$ $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*; theta^-_t)$$ By doing this two things occur. First, different networks are used to chose the actions and evaluate the actions. This breaks the moving target problem mentioned earlier. Second, the primary network and the target network have essentially been trained on different samples from the memory bank of states and actions (the target network is “trained” on older samples than the primary network). Because of this, any bias due to environmental randomness should be “smoothed out”. As was shown in my previous post on Double Q learning, there is a significant improvement in using Double Q learning instead of vanilla deep Q learning. However, a further improvement can be made on the Double Q idea – the Dueling Q architecture, which will be covered in the next section.

Dueling Q introduction

The Dueling Q architecture trades on the idea that the evaluation of the Q function implicitely calculates two quantities:

  • V(s) – the value of being in state s
  • A(s, a) – the advantage of taking action in state s

These values, along with the Q function, Q(s, a), are very important to understand, so we will do a deep dive of these concepts here. Let’s first examine the generalised formula for the value function V(s): $$V^{pi}(s) = mathbb{E} left[ sum_{i=1}^T gamma^{i – 1}r_{i}right]$$ The formula above means that the value function at state s, operating under a policy $pi$, is the summation of future discounted rewards starting from state s. In other words, if an agent starts at s, it is the sum of all the rewards the agent collects operating under a given policy $pi$. The $mathbb{E}$ is the expectation operator.

Let’s consider a basic example. Let’s assume an agent is playing a game with a set number of turns. In the second-to-last turn, the agent is in state s. From this state, it has 3 possible actions, with a reward of 10, 50 and 100 respectively. Let’s say that the policy for this agent is a simple random selection. Because this is the last set of actions and rewards in the game, due to the game finishing next turn, there are no discounted future rewards. The value for this state and the random action policy is: $$V^{pi}(s) = mathbb{E} left[randomleft(10, 50, 100)right)right] = 53.333$$ Now, clearly this policy is not going to produce optimum outcomes. However, we know that for the optimum policy, the value of this state would be: $$V^*(s) = max (10, 50, 100) = 100$$ If you recall, from Q learning theory, the optimal action in this state is: $$a* = argmax Q(s_{t+1}, a)$$ and the optimal Q value from this action in this state would be: $$Q(s, a^*) = max (10, 50, 100) = 100$$ Therefore, under the optimal (deterministic) policy we have: $$Q(s,a^*) = V(s)$$ However, what if we aren’t operating under the optimal policy (yet)? Let’s return to the case where our policy is  simple random action selection. In such a case, the Q function at state s could be described as (remember there are no future discounted rewards, and V(s) = 53.333): $$Q(s, a) =  V(s) + (-43.33, -3.33,  46.67) = (10, 50, 100)$$ The term (-43.33, -3.33, 46.67) under such an analysis is called the Advantage function A(s, a). The Advantage function expresses the relative benefits of the various actions possible in state s.  The Q function can therefore be expressed as: $$Q(s, a) = V(s) + A(s, a)$$ Under the optimum policy we have $A(s, a^*) = 0$, $V(s) = 100$ and therefore: $$Q(s, a) = V(s) + A(s, a) = 100 + (-90, -50, 0) = (10, 50, 100)$$ Now the question becomes, why do we want to decompose the Q function in this way? Because there is a difference between the value of a particular state s and the actions proceeding from that state. Consider a game where, from a given state s*, all actions lead to the agent dying and ending the game. This is an inherently low value state to be in, and who cares about the actions which one can take in such a state? It is pointless for the learning algorithm to waste training resources trying to find the best actions to take. In such a state, the Q values should be based solely on the value function V, and this state should be avoided. The converse case also holds – some states are just inherently valuable to be in, regardless of the effects of subsequent actions.

Consider these images taken from the original Dueling Q paper – showing the value and advantage components of the Q value in the Atari game Enduro:

Dueling Q - Atari Enduro - value and advantage highlights

Atari Enduro – value and advantage highlights

In the Atari Enduro game, the goal of the agent is to pass as many cars as possible. “Running into” a car slows the agent’s car down and therefore reduces the number of cars which will be overtaken. In the images above, it can be observed that the value stream considers the road ahead and the score. However, the advantage stream, does not “pay attention” to anything much when there are no cars visible. It only begins to register when there are cars close by and an action is required to avoid them. This is a good outcome, as when no cars are in view, the network should not be trying to determine which actions to take as this is a waste of training resources. This is the benefit of splitting value and advantage functions.

Now, you could argue that, because the Q function inherently contains both the value and advantage functions anyway, the neural network should learn to separate out these components regardless. Indeed, it may do. However, this comes at a cost. If the ML engineer already knows that it is important to try and separate these values, why not build them into the architecture of the network and save the learning algorithm the hassle? That is essentially what the Dueling Q network architecture does. Consider the image below showing the original architecture:

Dueling Q architecture

Dueling Q architecture

First, notice that the first part of architecture is common, with CNN input filters and a common Flatten layer (for more on convolutional neural networks, see this tutorial). After the flatten layer, the network bifurcates – with separate densely connected layers. The first densely connected layer produces a single output corresponding to V(s). The second densely connected layer produces n outputs, where is the number of available actions – and each of these outputs is the expression of the advantage function. These value and advantage functions are then aggregated in the Aggregation layer to produce Q values estimations for each possible action in state s. These Q values can then be trained to approach the target Q values, generated via the Double Q mechanism i.e.: $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; theta_t); theta^-_t)$$ The idea is that, through these separate value and advantage streams, the network will learn to produce accurate estimates of the values and advantages, improving learning performance. What goes on in the aggregation layer? One might think we could just add the V(s) and A(s, a) values together like so: $$Q(s, a) = V(s) + A(s, a)$$ However, there is an issue here and it’s called the problem of identifiabilty. This problem in the current context can be stated as follows: given Q, there is no way to uniquely identify V or A. What does this mean? Say that the network is trying to learn some optimal Q value for action a. Given this Q value, can we uniquely learn a V(s) and A(s, a) value? Under the formulation above, the answer is no.

Let’s say the “true” value of being in state s is 50 i.e. V(s) = 50. Let’s also say the “true” advantage in state s for action is 10. This will give a Q value, Q(s, a) of 60 for this state and action. However, we can also arrive at the same Q value for a learned V(s) of, say, 0, and an advantage function A(s, a) = 60. Or alternatively, a learned V(s) of -1000 and an advantage A(s, a) of 1060. In other words, there is no way to guarantee the “true” values of V(s) and A(s, a) are being learned separately and uniquely from each other. The commonly used solution to this problem is to instead perform the following aggregation function: $$Q(s,a) = V(s) + A(s,a) – frac{1}{|a|}sum_{a’}A(s,a’)$$ Here the advantage function value is normalized with respect to the mean of the advantage function values over all actions in state s.

In TensorFlow 2.0, we can create a common “head” network, consisting of introductory layers which act to process the images or other environmental / state inputs. Then, two separate streams are created using densely connected layers which learn the value and advantage estimates, respectively. These are then combined in a special aggregation layer which calculates the equation above to finally arrive at Q values. Once the network architecture is specified in accordance with the above description, the training proceeds in the same fashion as Double Q learning. The agent actions can be selected either directly from the output of the advantage function, or from the output Q values. Because the Q values differ from the advantage values only by the addition of the V(s) value (which is independent of the actions), the argmax-based selection of the best action will be the same regardless of whether it is extracted from the advantage or the Q values of the network.

In the next section, the implementation of a Dueling Q network in TensorFlow 2.0 will be demonstrated.

Dueling Q network in TensorFlow 2

In this section we will be building a Dueling Q network in TensorFlow 2. However, the code will be written so that both Double Q and Dueling Q networks will be able to be constructed with the simple change of a boolean identifier. The environment that the agent will train in is Open AI Gym’s CartPole environment. In this environment, the agent must learn to move the cart platform back and forth in order to stop a pole falling too far below the vertical axis. While Dueling Q was originally designed for processing images, with its multiple CNN layers at the beginning of the model, in this example we will be replacing the CNN layers with simple dense connected layers. Because training reinforcement learning agents using images only (i.e. Atari RL environments) takes a long time, in this introductory post, only a simple environment is used for training the model. Future posts will detail how to efficiently train in Atari RL environments. All the code for this tutorial can be found on this site’s Github repo.

First of all, we declare some constants that will be used in the model, and initiate the CartPole environment:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard'
MAX_EPSILON = 1
MIN_EPSILON = 0.01
EPSILON_MIN_ITER = 5000
DELAY_TRAINING = 300
GAMMA = 0.95
BATCH_SIZE = 32
TAU = 0.08
RANDOM_REWARD_STD = 1.0

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

The MAX_EPSILON and MIN_EPSILON variables define the maximum and minimum values of the epsilon-greedy variable which will determine how often random actions are chosen. Over the course of the training, the epsilon-greedy parameter will decay from MAX_EPSILON gradually to MIN_EPSILON. The EPSILON_MIN_ITER value specifies how many training steps it will take before the MIN_EPSILON value is obtained. The DELAY_TRAINING constant specifies how many iterations should occur, with the memory buffer being filled, before training of the network is undertaken. The GAMMA value is the future reward discount value used in the Q-target equation, and TAU is the merging rate of the weight values between the primary network and the target network as per the Double Q learning algorithm. Finally, RANDOM_REWARD_STD is the standard deviation of the rewards that introduces some stochastic behaviour into the otherwise deterministic CartPole environment.

After the definition of all these constants, the CartPole environment is created and the state size and number of actions are defined.

Model definition

The next step in the code is to create a Keras model inherited class which defines the Double or Dueling Q network:

class DQModel(keras.Model):
    def __init__(self, hidden_size: int, num_actions: int, dueling: bool):
        super(DQModel, self).__init__()
        self.dueling = dueling
        self.dense1 = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.dense2 = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.adv_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.adv_out = keras.layers.Dense(num_actions,
                                          kernel_initializer=keras.initializers.he_normal())
        if dueling:
            self.v_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
            self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal())
            self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x))
            self.combine = keras.layers.Add()

    def call(self, input):
        x = self.dense1(input)
        x = self.dense2(x)
        adv = self.adv_dense(x)
        adv = self.adv_out(adv)
        if self.dueling:
            v = self.v_dense(x)
            v = self.v_out(v)
            norm_adv = self.lambda_layer(adv)
            combined = self.combine([v, norm_adv])
            return combined
        return adv

Let’s go through the above line by line. First, a number of parameters are passed to this model as part of its initialization – these include the size of the hidden layers of the advantage and value streams, the number of actions in the environment and finally a Boolean variable, dueling, to specify whether the network should be a standard Double Q network or a Dueling Q network. The first two model layers defined are simple Keras densely connected layers, dense1 and dense2. These layers have ReLU activations and use the He normal weight initialization. The next two layers defined adv_dense and adv_out pertain to the advantage stream of the network, provided we are discussing a Dueling Q network architecture. If in fact the network is to be a Double Q network (i.e. dueling == False), then these names are a bit misleading and will simply be a third densely connected layer followed by the output Q layer (adv_out). However, keeping with the Dueling Q terminology, the first dense layer associated with the advantage stream is simply another standard dense layer of size = hidden_size. The final layer in this stream, adv_out is a dense layer with only num_actions outputs – each of these outputs will learn to estimate the advantage of all the actions in the given state (A(s, a)).

If the network is specified to be a Dueling Q network (i.e. dueling == True), then the value stream is also created. Again, a standard densely connected layer of size = hidden_size is created (v_dense). Then a final, single node dense layer is created to output the single value estimation (V(s)) for the given state. These layers specify the advantage and value streams respectively. Now the aggregation layer is to be created. This aggregation layer is created by using two Keras layers – a Lambda layer and an Add layer. The Lambda layer allows the developer to specify some user-defined operation to perform on the inputs to the layer. In this case, we want the layer to calculate the following: $$A(s,a) – frac{1}{|a|}sum_{a’}A(s,a’)$$ This is calculated easily by using the lambda x: x – tf.reduce_mean(x) expression in the Lambda layer. Finally, we need a simple Keras addition layer to add this mean-normalized advantage function to the value estimation.

This completes the explanation of the layer definitions in the model. The call method in this model definition then applies these various layers to the state inputs of the model. The following two lines execute the Dueling Q aggregation function:

norm_adv = self.lambda_layer(adv) 
combined = self.combine([v, norm_adv])

Note that first the mean-normalizing lambda function is applied to the output from the advantage stream. This normalized advantage is then added to the value stream output to produce the final Q values (combined). Now that the model class has been defined, it is time to instantiate two models – one for the primary network and the other for the target network:

primary_network = DQModel(30, num_actions, True)
target_network = DQModel(30, num_actions, True)
primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')
# make target_network = primary_network
for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
    t.assign(e)

After the primary_network and target_network have been created, only the primary_network is compiled as only the primary network is actually trained using the optimization function. As per Double Q learning, the target network is instead moved slowly “towards” the primary network by the gradual merging of weight values. Initially however, the target network trainable weights are set to be equal to the primary network trainable variables, using the TensorFlow assign function.

Other functions

The next function to discuss is the target network updating which is performed during training. In Double Q network training, there are two options for transitioning the target network weights towards the primary network weights. The first is to perform a wholesale copy of the weights every N training steps. Alternatively, the weights can be moved towards the primary network gradually every training iteration as follows:

def update_network(primary_network, target_network):
    # update target network parameters slowly from primary network
    for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
        t.assign(t * (1 - TAU) + e * TAU)

As can be observed, the new target weight variables are a weighted average between the current weight values and the primary network weights – with the weighting factor equal to TAU. The next code snippet is the definition of the memory class:

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._samples = []

    def add_sample(self, sample):
        self._samples.append(sample)
        if len(self._samples) > self._max_memory:
            self._samples.pop(0)

    def sample(self, no_samples):
        if no_samples > len(self._samples):
            return random.sample(self._samples, len(self._samples))
        else:
            return random.sample(self._samples, no_samples)

    @property
    def num_samples(self):
        return len(self._samples)


memory = Memory(500000)

This class takes tuples of (state, action, reward, next state) values and appends them to a memory list, which is randomly sampled from when required during training. The next function defines the epsilon-greedy action selection policy:

def choose_action(state, primary_network, eps):
    if random.random() < eps:
        return random.randint(0, num_actions - 1)
    else:
        return np.argmax(primary_network(state.reshape(1, -1)))

If a random number sampled between the interval 0 and 1 falls below the current epsilon value, a random action is selected. Otherwise, the current state is passed to the primary model – from which the Q values for each action are returned. The action with the highest Q value, selected by the numpy argmax function, is returned.

The next function is the train function, where the training of the primary network takes place:

def train(primary_network, memory, target_network):
    batch = memory.sample(BATCH_SIZE)
    states = np.array([val[0] for val in batch])
    actions = np.array([val[1] for val in batch])
    rewards = np.array([val[2] for val in batch])
    next_states = np.array([(np.zeros(state_size)
                             if val[3] is None else val[3]) for val in batch])
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = np.array(next_states).sum(axis=1) != 0
    batch_idxs = np.arange(BATCH_SIZE)
    # extract the best action from the next state
    prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
    # get all the q values for the next state
    q_from_target = target_network(next_states)
    # add the discounted estimated reward from the selected action (prim_action_tp1)
    updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    # update the q target to train towards
    target_q[batch_idxs, actions] = updates
    # run a training batch
    loss = primary_network.train_on_batch(states, target_q)
    return loss

For a more detailed explanation of this function, see my Double Q tutorial. However, the basic operations that are performed are expressed in the following formulas: $$a* = argmax Q(s_{t+1}, a; theta_t)$$ $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*; theta^-_t)$$ The best action from the next state, a*,  is selected from the primary network (weights = $theta_t$). However, the Q value for this action in the next state ($s_{t+1}$) is extracted from the target network (weights = $theta^-_t$). A Keras train_on_batch operation is performed by passing a batch of states and subsequent target Q values, and the loss is finally returned from this function.

The main Dueling Q training loop

The main training loop which trains our Dueling Q network is shown below:

num_episodes = 1000000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQ_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
steps = 0
for i in range(num_episodes):
    cnt = 1
    avg_loss = 0
    tot_reward = 0
    state = env.reset()
    while True:
        if render:
            env.render()
        action = choose_action(state, primary_network, eps)
        next_state, _, done, info = env.step(action)
        reward = np.random.normal(1.0, RANDOM_REWARD_STD)
        tot_reward += reward
        if done:
            next_state = None
        # store in memory
        memory.add_sample((state, action, reward, next_state))

        if steps > DELAY_TRAINING:
            loss = train(primary_network, memory, target_network)
            update_network(primary_network, target_network)
        else:
            loss = -1
        avg_loss += loss

        # linearly decay the eps value
        if steps > DELAY_TRAINING:
            eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * 
                  (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else 
                MIN_EPSILON
        steps += 1

        if done:
            if steps > DELAY_TRAINING:
                avg_loss /= cnt
                print(f"Episode: {i}, Reward: {cnt}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}")
                with train_writer.as_default():
                    tf.summary.scalar('reward', cnt, step=i)
                    tf.summary.scalar('avg loss', avg_loss, step=i)
            else:
                print(f"Pre-training...Episode: {i}")
            break

        state = next_state
        cnt += 1

Again, this training loop has been explained in detail in the Double Q tutorial. However, some salient points are worth highlighting. First, Double and Dueling Q networks are superior to vanilla Deep Q networks especially in the cases where there is some stochastic component to the environment. As the CartPole environment is deterministic, some stochasticity is added in the reward. Normally, every time step in the episode results in a reward of 1 i.e. the CartPole has survived another time step – good job. However, in this case I’ve added a reward which is sampled from a normal distribution with a mean of 1.0 but a standard deviation of RANDOM_REWARD_STD. This adds the requisite uncertainty which makes Double and Dueling Q networks clearly superior to Deep Q networks – see my Double Q tutorial for a demonstration of this.

Another point to highlight is that training of the primary (and by extension target) network until DELAY_TRAINING steps have been exceeded. Also, the epsilon value for the epsilon-greedy action selection policy doesn’t decay until these DELAY_TRAINING steps have been exceeded.

Dueling Q vs Double Q results

A comparison of the training progress with respect to the deterministic reward of the agent in the CartPole environment under Double Q and Dueling Q architectures can be observed in the figure below, with the x-axis being the number of episodes:

Dueling Q (light blue) vs Double Q (dark blue) rewards in the CartPole environment

Dueling Q (light blue) vs Double Q (dark blue) rewards in the CartPole environment

As can be observed, there is a slightly higher performance of the Double Q network with respect to the Dueling Q network. However, the performance difference is fairly marginal, and may be within the variation arising from the random weight initialization of the networks. There is also the issue of the Dueling Q network being slightly more complicated due to the additional value stream. As such, on a fairly simple environment like the CartPole environment, the benefits of Dueling Q over Double Q may not be realized. However, in more complex environments like Atari environments, it is likely that the Dueling Q architecture will be superior to Double Q (this is what the original Dueling Q paper has shown). Future posts will demonstrate the Dueling Q architecture in Atari environments.


Eager to build deep learning systems in TensorFlow 2? Get the book here

The post Dueling Q networks in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Introduction to ResNet in TensorFlow 2

In previous tutorials, I’ve explained convolutional neural networks (CNN) and shown how to code them. The convolutional layer has proven to be a great success in the area of image recognition and processing in machine learning. However, state of the art techniques don’t involve just a few CNN layers. Rather, they can be very deep, consisting of 10s to >100 numbers of layers. One of the most successful CNN architectures developed has been the ResNet architecture. It was first introduced in 2015 (see this paper) and won the ILSVRC 2015 image classification task. The winning ResNet consisted of a whopping 152 layers, and in order to successfully make a network that deep, a significant innovation in CNN architecture was developed for ResNet. This innovation will be discussed in this post, and an example ResNet architecture will be developed in TensorFlow 2 and compared to a standard architecture. Because of the training requirements for this task, I have developed the code in Google Colaboratory (which gives free GPU time – see my tutorial here), and the notebook can be found on this site’s Github repository.


Eager to build deep learning systems in TensorFlow 2? Get the book here


Introduction to the ResNet architecture

The degradation problem

The vanishing gradient problem was an initial barrier to making neural networks deeper and more powerful. However, as explained in this post, the problem has now largely been solved through the use of ReLU activations and batch normalization. Given this is true, and given enough computational power and data, we should be able to stack many CNN layers and dramatically increase classification accuracy, right? Well – to a degree. An early architecture, called the VGG-19 architecture, had 19 layers. However, this is a long way off the 152 layers of the version of ResNet that won the ILSVRC 2015 image classification task. The reason deeper networks were not successful prior to the ResNet architecture was due to something called the degradation problem. Note, this is not the vanishing gradient problem, but something else. It was observed that making the network deeper led to higher classification errors. One might think this is due to overfitting of the data – but not so fast, the degradation problem leads to higher training errors too! Consider the diagrams below from the original ResNet paper:

Illustration of degradation problem that ResNet solves

Illustration of degradation problem that ResNet solves

Note that the 56-layer network has higher test and training errors. Theoretically, this doesn’t make much sense. Let’s say the 20-layer network learns some mapping H(x) that gives a training error of 10%. If another 36 layers are added, we would expect that the error would at least not be any worse than 10%. Why? Well, the 36 extra layers, at worst, could just learn identity functions. In other words, the extra 36 layers could just learn to pass through the output from the first 20-layers of the network. This would give the same error of 10%. This doesn’t seem to happen though. It appears neural networks aren’t great at learning the identity function in deep architectures. Not only don’t they learn the identity function (and hence pass through the 20 layer error rate), they make things worse. Beyond a certain number of layers, they begin to degrade the performance of the network compared to shallower implementations. Here is where the ResNet architecture comes in.

The ResNet solution

The ResNet solution relies on making the identity function option explicit in the architecture, rather than relying on the network itself to learn the identity function where appropriate. It consists of building networks which consist of the following CNN blocks:

ResNet building block

ResNet building block from here

In the diagram above, the input tensor x enters the building block. This input then splits. On one path, the input is processed by two stacked convolutional layers (called a “weight layer” in the above). This path is the “standard” CNN processing part of the building block. The ResNet innovation is the “identity” path. Here, the input x is simply added to the output of the CNN component of the building block, F(x). The output from the block is then F(x) + x with a final ReLU activation applied at the end. This identity path in the ResNet building block allows the neural network to more easily pass through any abstractions learnt in previous layers. Alternatively, it can more easily build incremental abstractions on top of the abstractions learnt in the previous layers. What do I mean by this? The diagram below may help:

ResNet layers and abstractions

Layers and abstractions

Generally speaking, as CNN layers are added to a network, the network during training will learn lower level abstractions in the early layers (i.e lines, colours, corners, basic shapes etc.) and higher level abstractions in the later layers (groups of geometries, objects etc.). Let’s say that, when trying to classify an aircraft in an image, there are some mid-level abstractions which reliably signal that an aircraft is present. Say the shape of a jet engine near a wing (this is just an example). These abstractions might be able to be learnt in, say, 10 layers.  

However, if we add an additional 20 or more layers after these first 10 layers, these reliable signals may get degraded / obfuscated. The ResNet architecture gives the network a more explicit chance of muting further CNN abstractions on some filters by driving F(x) to zero, with the output of the block defaulting to its input x. Not only that, the ResNet architecture allows blocks to “tinker” more easily with the input. This is because the block only has to learn the incremental difference between the previous layer abstraction and the optimal output H(x). In other words, it has to learn F(x) = H(x) – x. This is a residual expression, hence the name ResNet. This, theoretically at least, should be easier to learn than the full expression H(x).

An (somewhat tortured) analogy might assist here. Say you are trying to draw the picture of a tree. Someone hands you a picture of a pencil outline of the main structure of the tree – the trunk, large branches, smaller branches etc. Now say you are somewhat proud, and you don’t want too much help in drawing the picture. So, you rub out parts of the pencil outline of the tree that you were handed. You then proceed to add some detail to the picture you were handed, but you have to redraw parts that you already rubbed out. This is kind of like the case of a standard non-ResNet network. Because layers seem to struggle to reproduce an identity function, at each subsequent layer they essentially erase or degrade some of the previous level abstractions and these need to be re-estimated (at least to an extent).

Alternatively, you, the artist, might not be too proud and you happily accept the pencil outline that you received. It is much easier to then add new details to what you have already been given. This is like what the ResNet blocks do – they take what they are give i.e. x and just make tweaks to it by adding F(x). This analogy isn’t perfect, but it should give you an idea of what is going on here, and how the ResNet blocks help the learning along.

A full 34-layer version of ResNet is (partially) illustrated below (from the original paper):

ResNet architecture

ResNet-34 architecture (partial)

The diagram above shows roughly the first half of the ResNet 34-layer architecture, along with the equivalent layers of the VGG-19 architecture and a “plain” version of the ResNet architecture. The “plain” version has the same CNN layers, but lacks the identity path previously presented in the ResNet building block. These identity paths can be seen looping around every second CNN layer on the right hand side of the ResNet (“residual”) architecture.

In the next section, I’m going to show you how to build a ResNet architecture in TensorFlow 2/Keras. In the example, we’ll compare both the “plain” and “residual” networks on the CIFAR-10 classification task. Note that for computational ease, I’ll only include 10 ResNet blocks.

Building ResNet in TensorFlow 2

As discussed previously, the code for this example can be found on this site’s Github repository. Importing the CIFAR-10 dataset can be performed easily by using the Keras datasets API:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import datetime as dt

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

We then perform some pre-processing of the training and test data. This pre-processing includes image renormalization (converting the data so it resides in the range [0,1]) and centrally cropping the image to 75% of it’s normal extents. Data augmentation is also performed by randomly flipping the image about the centre axis. This is performed using the TensorFlow Dataset API – more details on the code below can be found in this, this post and my book.

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(64).shuffle(10000)
train_dataset = train_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
train_dataset = train_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
train_dataset = train_dataset.map(lambda x, y: (tf.image.random_flip_left_right(x), y))
train_dataset = train_dataset.repeat()

valid_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(5000).shuffle(10000)
valid_dataset = valid_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
valid_dataset = valid_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
valid_dataset = valid_dataset.repeat()

In this example, to build the network, we’re going to use the Keras Functional API, in the TensorFlow 2 context. Here is what the ResNet model definition looks like:

inputs = keras.Input(shape=(24, 24, 3))
x = layers.Conv2D(32, 3, activation='relu')(inputs)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D(3)(x)

num_res_net_blocks = 10
for i in range(num_res_net_blocks):
    x = res_net_block(x, 64, 3)

x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(10, activation='softmax')(x)

res_net_model = keras.Model(inputs, outputs)

First, we specify the input dimensions to Keras. The raw CIFAR-10 images have a size of (32, 32, 3) – but because we are performing central cropping of 75%, the post-processed images are of size (24, 24, 3). Next, we create 2 standard CNN layers, with 32 and 64 filters respectively (for more on convolutional layers, see this post and my book). The filter window sizes are 3 x 3, in line with the original ResNet architectures. Next some max pooling is performed and then it is time to produce some ResNet building blocks. In this case, 10 ResNet blocks are created by calling the res_net_block() function:

def res_net_block(input_data, filters, conv_size):
  x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data)
  x = layers.BatchNormalization()(x)
  x = layers.Conv2D(filters, conv_size, activation=None, padding='same')(x)
  x = layers.BatchNormalization()(x)
  x = layers.Add()([x, input_data])
  x = layers.Activation('relu')(x)
  return x

The first few lines of this function are standard CNN layers with Batch Normalization, except the 2nd layer does not have an activation function (this is because one will be applied after the residual addition part of the block). After these two layers, the residual addition part, where the input data is added to the CNN output (F(x)), is executed. Here we can make use of the Keras Add layer, which simply adds two tensors together. Finally, a ReLU activation is applied to the result of this addition and the outcome is returned.

After the ResNet block loop is finished, some final layers are added. First, a final CNN layer is added, followed by a Global Average Pooling (GAP) layer (for more on GAP layers, see here). Finally, we have a couple of dense classification layers with a dropout layer in between. This model was trained over 30 epochs and then an alternative “plain” model was also created. This was created by taking the same architecture but replacing the res_net_block function with the following function:

def non_res_block(input_data, filters, conv_size):
  x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data)
  x = layers.BatchNormalization()(x)
  x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(x)
  x = layers.BatchNormalization()(x)
  return x

Note that this function is simply two standard CNN layers, with no residual components included. The training code is as follows:

callbacks = [
  # Write TensorBoard logs to `./logs` directory
  keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")), write_images=True),
]

res_net_model.compile(optimizer=keras.optimizers.Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])
res_net_model.fit(train_dataset, epochs=30, steps_per_epoch=195,
          validation_data=valid_dataset,
          validation_steps=3, callbacks=callbacks)

ResNet training and validation results

The accuracy results of the training of these two models can be observed below:

ResNet vs "plain" training accuracy

ResNet (red) vs “plain” (pink) training accuracy

 

ResNet vs "plain" testing accuracy

ResNet (blue) vs “plain” (green) training accuracy

As can be observed there is around a 5-6% improvement in the training accuracy from a ResNet architecture compared to the “plain” non-ResNet architecture. I have run this comparison a number of times and the 5-6% gap is consistent across the runs. These results illustrate the power of the ResNet idea, even for a relatively shallow 10 layer ResNet architecture. As demonstrated in the original paper, this effect will be more pronounced in deeper networks. Note that this network is not very well optimized, and the accuracy could be improved by running for more iterations. However, it is enough to show the benefits of the ResNet architecture. In future posts, I’ll demonstrate other ResNet-based architectures which can achieve even better results.  


Eager to build deep learning systems in TensorFlow 2? Get the book here

The post Introduction to ResNet in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Double Q reinforcement learning in TensorFlow 2

In previous posts (here and here), deep Q reinforcement learning was introduced. In these posts, examples were presented where neural networks were used to train an agent to act within an environment to maximize rewards. The neural network was trained using something called Q-learning. However, deep Q learning (DQN) has a flaw – it can be unstable due to biased estimates of future rewards, and this slows learning. In this post, I’ll introduce Double Q learning which can solve this bias problem and produce better Q-learning outcomes. We’ll be running a Double Q network on a modified version of the Cartpole reinforcement learning environment. We’ll also be developing the network in TensorFlow 2 – at the time of writing, TensorFlow 2 is in beta and installation instructions can be found here. The code examined in this post can be found here.


Eager to build deep learning systems in TensorFlow 2? Get the book here


A recap of deep Q learning

As mentioned above, you can go here and here to review deep Q learning. However, a quick recap is in order. The goal of the neural network in deep Q learning is to learn the function $Q(s_t, a_t; theta_t)$. At a given time in the game / episode, the agent will be in a state $s_t$. This state is fed into the network, and various Q values will be returned for each of the possible actions $a_t$ from state $s_t$. The $theta_t$ refers to the parameters of the neural network (i.e. all the weight and bias values).

The agent chooses an action based on an epsilon-greedy policy $pi$. This policy is a combination of randomly selected actions combined with the output of the deep Q neural network – with the probability of a randomly selected action decreasing over the training time. When the deep Q network is used to select an action, it does so by taking the maximum Q value returned over all the actions, for state $s_t$. For example, if an agent is in state 1, and this state has 4 possible actions which the agent can perform, it will output 4 Q values. The action which has the highest Q value is the action which will be selected. This can be expressed as:

$$a = argmax Q(s_t, a; theta_t)$$

Where the argmax is performed over all the actions / output nodes of the neural network. That’s how actions are chosen in deep Q learning. How does training occur? It occurs by utilising the Q-learning / Bellman equation. The equation looks like this:

$$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$

How does this read? For a given action a from state $s_{t}$, we want to train the network to predict the following:

  • The immediate reward for taking this action $r_{t+1}$, plus
  • The discounted reward for the best possible action in the subsequent state ($s_{t+1}$)

If we are successful in training the network to predict these values, the agent will consistently chose the action which gives the best immediate reward ($r_{t+1}$) plus the discounted future rewards of future states $gamma max_{{a}}Q(s_{t+1}, a;theta_t)$. The $gamma$ term is the discount term, which places less value on future reward than present rewards (but usually only marginally).

In deep Q learning, the game is repeatedly played and the states, actions and rewards are stored in memory as a list of tuples or an array – ($s_t$, $a$, $r_t$, $s_{t+1}$). Then, for each training step, a random batch of these tuples is extracted from memory and the $Q_{target}(s_t, a_t)$ is calculated and compared to the value produced from the current network $Q(s_t, a_t)$ – the mean squared difference between these two values is used as the loss function to train the neural network.

That’s a fairly brief recap of deep Q learning – for a more extended treatment see here and here. The next section will explain the problems with standard deep Q learning.

The problem with deep Q learning

The problem of deep Q learning has to do with the way it sets the target values:

$$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$

Namely, the issue is with the $max$ value. This part of the equation is supposed to estimate the value of the rewards for future actions if action is taken from the current state $s_t$. That’s a bit of a mouthful, but just consider it as trying to estimate the optimal future rewards $r_future$ if action a is taken.

The problem is that in many environments, there is random noise. Therefore, as an agent explores an environment, it is not directly observing or $r_future$, but something like $r + epsilon$, where $epsilon$ is the noise. In such an environment, after repeated playing of the game, we would hope that the network would learn to make unbiased estimates of the expected value of the rewards – so E[r]. If it can do this, we are in a good spot – the network should pick out the best actions for current and future rewards, despite the presence of noise.

This is where the $max$ operation is a problem – it produces biased estimates of the future rewards, not the unbiased estimates we require for optimal results. An example will help explain this better. Consider the environment below. The agent starts in state A and at each state can move left or right. The states C, D and F are terminal states – the game ends once these points are reached. The r values are the rewards the agent receives when transitioning from state to state.  

Deep Q network bias illustration - Double Q network tutorial

Deep Q network bias illustration

All the rewards are deterministic except for the rewards when transitioning from states B to C and B to D. The rewards for these transitions are randomly drawn from a normal distribution with a mean of 1 and a standard deviation of 4.

We know the expected rewards, E[r] from taking either action (B to C or B to D) is 1 – however, there is a lot of noise associated with these rewards. Regardless, on average, the agent should ideally learn to always move to the left from A, towards E and finally F where always equals 2.

Let’s consider the $Q_{target}$ expression for these cases. Let’s set $gamma$ to be 0.95. The $Q_target$ expression to move to the left from is: $Q_{target} = 0 + 0.95 * max([0, 2]) = 1.9$. The two action options from E are to either move right (r = 0) or left (r = 2). The maximum of these is obviously 2, and hence we get the result 1.9.

What about in the opposite direction, moving right from A? In this case, it is $Q_{target} = 0 + 0.95 * max([N(1, 4), N(1, 4)])$. We can explore the long term value of this “moving right” action by using the following code snippet:

import numpy as np

Ra = np.zeros((10000,))
Rc = np.random.normal(1, 4, 10000)
Rd = np.random.normal(1, 4, 10000)

comb = np.vstack((Ra, Rc, Rd)).transpose()

max_result = np.max(comb, axis=1)

print(np.mean(Rc))
print(np.mean(Rd))
print(np.mean(max_result))

Here a 10,000 iteration trial is created of what the $max$ term will yield in the long term of running a deep Q agent in this environment. Ra is the reward for moving back to the left towards A (always zero, hence np.zeros()). Rc and Rd are both normal distributions, with mean 1 and standard deviation of 4. Combining all these options together and taking the maximum for each trial gives us what the trial-by-trial $max$ term will be (max_result). Finally, the expected values (i.e. the means) of each quantity are printed. As expected, the mean of Rc and Rd are approximately equal to 1 – the mean which we set for their distributions. However, the expected value / mean from the $max$ term is actually around 3!

You can see the problem here. Because the $max$ term is always taking the maximum value from the random draws of the rewards, it tends to be positively biased and does not give a true indication of the expected values of the rewards for a move in this direction (i.e. 1). As such, an agent using the deep Q learning methodology will not chose the optimal action from (i.e. move left) but will rather tend to move right!

Therefore, in noisy environments, it can be seen that deep Q learning will tend to overestimate rewards. Eventually, deep Q learning will converge to a reasonable solution, but it is potentially much slower than it needs to be. A further problem occurs in deep Q learning which can cause instability in the training process. Consider that in deep Q learning the same network both choses the best action and determines the value of choosing said actions. There is a feedback loop here which can exacerbate the previously mentioned reward overestimation problem, and further slow down the learning process. This is clearly not ideal, and this is why Double Q learning was developed.

An introduction to Double Q reinforcement learning

The paper that introduced Double Q learning initially proposed the creation of two separate networks which predicted $Q^A$ and $Q^B$ respectively. These networks were trained on the same environment / problem, but were each randomly updated. So, say, 50% of the time, $Q^A$ was updated based on a certain random set of training tuples, and 50% of the time $Q^B$ was updated on a different random set of training tuples. Importantly, the update or target equation for network A had an estimate of the future rewards from network B – not itself. This new approach does two things:

  1. The A and B networks are trained on different training samples – this acts to remove the overestimation bias, as, on average, if network A sees a high noisy reward for a certain action, it is likely that network B will see a lower reward – hence the noise effects will cancel
  2. There is a decoupling between the choice of the best action and the evaluation of the best action

The algorithm from the original paper is as follows:

Original Double Q algorithm

Original Double Q algorithm

As can be observed, first an action is chosen from either $Q^A(s_t,.)$ or $Q^B(s_t,.)$ and the rewards, next state, action etc. are stored in the memory. Then either UPDATE(A) or UPDATE(B) is chosen randomly. Next, for the state $s_{t+1}$ (or s’ in the above) the predicted Q value for all actions from this state are taken from network A or B, and the action with the highest predicted Q value is chosen, a*. Note that, within UPDATE(A), this action is chosen from the output of the $Q^A$ network.

Next, you’ll see something interesting. Consider the update equation for $Q^A$ above – I’ll represent it in more familiar, neural network based notation below:

$$Q^A_{target} = r_{t+1} + gamma Q^B(s_{t+1}, a*)$$

Notice that, while the best action a* from the next state ($s_{t+1}$) is chosen from network A, the discounted reward for taking that future action is extracted from network B. This removes any bias associated with the $argmax$ from network A, and also decouples the choice of actions from the evaluation of the value of such actions (i.e. breaks the feedback loop). This is the heart of the Double Q reinforcement learning.

The Double DQN network

The same author of the original Double Q algorithm shown above proposed an update of the algorithm in this paper. This updated algorithm can still legitimately be called a Double Q algorithm, but the author called it Double DQN (or DDQN) to disambiguate. The main difference in this algorithm is the removal of the randomized back-propagation based updating of two networks A and B. There are still two networks involved, but instead of training both of them, only a primary network is actually trained via back-propagation. The other network, often called the target network, is periodically copied from the primary network. The update operation for the primary network in the Double DQN network looks like the following:

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; theta_t); theta^-_t)$$

Alternatively, keeping in line with the previous representation:

$$a* = argmax Q(s_{t+1}, a; theta_t)$$

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*); theta^-_t)$$

Notice that, as per the previous algorithm, the action a* with the highest Q value from the next state ($s_{t+1}$) is extracted from the primary network, which has weights $theta_t$. This primary network is also often called the “online” network – it is the network from which action decisions are taken. However, notice that, when determining $Q_{target}$, the discounted Q value is taken from the target network with weights $theta^-_t$. Therefore, the actions for the agent to take are extracted from the online network, but the evaluation of the future rewards are taken from the target network. So far, this is similar to the UPDATE(A) step shown in the previous Double Q algorithm.

The difference in this algorithm is that the target network weights ($theta^-_t$) are not trained via back-propagation – rather they are periodically copied from the online network. This reduces the computational overhead of training two networks by back-propagation. This copying can either be a periodic “hard copy”, where the weights are copied from the online network to the target network with no modification, or a more frequent “soft copy” can occur, where the existing target weight values and the online network values are blended. In the example which will soon follow, soft copying will be performed every training iteration, under the following rule:

$$theta^- = theta^- (1-tau) + theta tau$$

With $tau$ being a small constant (i.e. 0.05).

This DDQN algorithm achieves both decoupling between the action choice and evaluation, and it has been shown to remove the bias of deep Q learning. In the next section, I’ll present a code walkthrough of a training algorithm which contains options for both standard deep Q networks and Double DQNs.

A Double Q network example in TensorFlow 2

In this example, I’ll present code which trains a double Q network on the Cartpole reinforcement learning environment. This environment is implemented in OpenAI gym, so you’ll need to have that package installed before attempting to run or replicate. The code for this example can be found on this site’s Github repo.

First, we declare some constants and create the environment:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard'
MAX_EPSILON = 1
MIN_EPSILON = 0.01
LAMBDA = 0.0005
GAMMA = 0.95
BATCH_SIZE = 32
TAU = 0.08
RANDOM_REWARD_STD = 1.0

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

Notice the epsilon greedy policy parameters (MIN_EPSILON, MAX_EPSILON, LAMBDA) which dictate how long the exploration period of the training should last. GAMMA is the discount rate of future rewards. The final constant RANDOM_REWARD_STD will be explained later in more detail.

It can be observed that the CartPole environment has a state size of 4, and the number of actions available are extracted directly from the environment (there are only 2 of them). Next the primary (or online) network and the target network are created using the Keras Sequential API:

primary_network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions)
])

target_network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions)
])

primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')

The code above is fairly standard Keras model definitions, with dense layers and ReLU activations, and He normal initializations (for further information, see these posts: Keras, ReLU activations and initialization). Notice that only the primary network is compiled, as this is the only network which will be trained via the Adam optimizer.

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._samples = []

    def add_sample(self, sample):
        self._samples.append(sample)
        if len(self._samples) > self._max_memory:
            self._samples.pop(0)

    def sample(self, no_samples):
        if no_samples > len(self._samples):
            return random.sample(self._samples, len(self._samples))
        else:
            return random.sample(self._samples, no_samples)

    @property
    def num_samples(self):
        return len(self._samples)


memory = Memory(50000)

Next a generic Memory class object is created. This holds all the ($s_t$, a, $r_t$, $s_{t+1}$) tuples which are stored during training, and includes functionality to extract random samples for training. In this example, we’ll be using a Memory instance with a maximum sample buffer of 50,000 rows.

def choose_action(state, primary_network, eps):
    if random.random() < eps:
        return random.randint(0, num_actions - 1)
    else:
        return np.argmax(primary_network(state.reshape(1, -1)))

The function above executes the epsilon greedy action policy. As explained in previous posts on deep Q learning, the epsilon value is slowly reduced and the action selection moves from the random selection of actions to actions selected from the primary network. A final training function needs to be reviewed, but first we’ll examine the main training loop:

num_episodes = 1000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DoubleQ_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
double_q = False
steps = 0
for i in range(num_episodes):
    state = env.reset()
    cnt = 0
    avg_loss = 0
    while True:
        if render:
            env.render()
        action = choose_action(state, primary_network, eps)
        next_state, reward, done, info = env.step(action)
        reward = np.random.normal(1.0, RANDOM_REWARD_STD)
        if done:
            next_state = None
        # store in memory
        memory.add_sample((state, action, reward, next_state))

        loss = train(primary_network, memory, target_network if double_q else None)
        avg_loss += loss

        state = next_state

        # exponentially decay the eps value
        steps += 1
        eps = MIN_EPSILON + (MAX_EPSILON - MIN_EPSILON) * math.exp(-LAMBDA * steps)

        if done:
            avg_loss /= cnt
            print(f"Episode: {i}, Reward: {cnt}, avg loss: {avg_loss:.3f}, eps: {eps:.3f}")
            with train_writer.as_default():
                tf.summary.scalar('reward', cnt, step=i)
                tf.summary.scalar('avg loss', avg_loss, step=i)
            break

        cnt += 1

Starting from the num_episodes loop, we can observe that first the environment is reset, and the current state of the agent returned. A while True loop is then entered into, which is only exited when the environment returns the signal that the episode has been completed. The code will render the Cartpole environment if the relevant variable has been set to True.

The next line shows the action selection, where the primary network is fed into the previously examined choose_network function, along with the current state and the epsilon value. This action is then fed into the environment by calling the env.step() command. This command returns the next state that the agent has entered ($s_{t+1}$), the reward ($r_{t+1}$) and the done Boolean which signifies if the episode has been completed.

The Cartpole environment is completely deterministic, with no randomness involved except in the initialization of the environment. Because Double Q learning is superior to deep Q learning especially when there is randomness in the environment, the Cartpole environment has been externally transformed into a stochastic environment on the next line. Normally, the reward from the Cartpole environment is a deterministic value of 1.0 for every step the pole stays upright. Here, however, the reward is replaced with a sample from a normal distribution, with mean 1.0 and standard deviation equal to the constant RANDOM_REWARD_STD.

In the first pass – RANDOM_REWARD_STD is set to 0.0 to transform the environment back to a deterministic case, but this will be changed in the next example run.

After this, the memory is added to and the primary network is trained.

Notice that the target_network is only passed to the training function if the double_q variable is set to True. If double_q is set to False, the training function defaults to standard deep Q learning. Finally the state is updated, and if the environment has signalled the episode has ended, some logging is performed and the while loop is exited.

It is now time to review the train function, which is where most of the work takes place:

def train(primary_network, memory, target_network=None):
    if memory.num_samples < BATCH_SIZE * 3:
        return 0
    batch = memory.sample(BATCH_SIZE)
    states = np.array([val[0] for val in batch])
    actions = np.array([val[1] for val in batch])
    rewards = np.array([val[2] for val in batch])
    next_states = np.array([(np.zeros(state_size)
                             if val[3] is None else val[3]) for val in batch])
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = np.array(next_states).sum(axis=1) != 0
    batch_idxs = np.arange(BATCH_SIZE)
    if target_network is None:
        updates[valid_idxs] += GAMMA * np.amax(prim_qtp1.numpy()[valid_idxs, :], axis=1)
    else:
        prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
        q_from_target = target_network(next_states)
        updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    target_q[batch_idxs, actions] = updates
    loss = primary_network.train_on_batch(states, target_q)
    if target_network is not None:
        # update target network parameters slowly from primary network
        for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
            t.assign(t * (1 - TAU) + e * TAU)
    return loss

The first line is a bypass of this function if the memory does not contain more than 3 x the batch size – this is to ensure no training of the primary network takes place until there is a reasonable amount of samples within the memory.

Next, a batch is extracted from the memory – this is a list of tuples. The individual state, actions and reward values are then extracted and converted to numpy arrays using Python list comprehensions. Note that the next_state values are set to zeros if the raw next_state values are None – this only happens when the episode has terminated.

Next the sampled states ($s_t$) are passed through the network – this returns the values $Q(s_t, a; theta_t)$. The next line extracts the Q values from the primary network for the next states ($s_{t+1}$). Next, we want to start constructing our target_q values ($Q_target$). These are the “labels” which will be supplied to the primary network to train towards.

Note that the target_q values are the same as the prim_qt ($Q(s_t, a; theta_t)$) values except for the index corresponding to the action chosen. So, for instance, let’s say a single sample of the prim_qt values are [0.5, -0.5] – but the action chosen from $s_t$ was 0. We only want to update the 0.5 value  while training, the remaining values in target_q remain equal to prim_qt (i.e. [update, -0.5]). Therefore, in the next line, we create target_q by simply converting prim_qt from a tensor into its numpy equivalent. This is basically a copy of the values from prim_qt t0 target_q. We convert to numpy also, as it is easier to deal with indexing in numpy than TensorFlow at this stage.

To affect these updates, we create a new variable updates. The first step is to set the update values to the sampled rewards – the $r_{t+1}$ values are the same regardless of whether we are performing deep Q learning or Double Q learning. In the following lines, these update values will be added to in order to capture the discounted future reward terms. The next line creates an array called valid_idxs. This array is to hold all those samples in the batch which don’t include a case where next_state is zero. When next_state is zero, this means that the episode has terminated. In those cases, only the first term of the equation below remains ($r_{t+1}$):

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*); theta^-_t)$$

Seeing as update already includes the first term, any further additions to update need to exclude these indexes.

The next line, batch_idxs, is simply a numpy arange which counts out the number of samples within the batch. This is included to ensure that the numpy indexing / broadcasting to follow works properly.

The next line switches depending on whether Double Q learning has been enabled or not. If target_network is None, then standard deep Q learning ensures. In such a case, the following term is calculated and added to updates (which already includes the reward term):

$$gamma max Q(s_{t+1}, a; theta)$$

Alternatively, if target_network is not None, then Double Q learning is performed. The first line:

prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)

calculates the following equation shown earlier:

$$a* = argmax Q(s_{t+1}, a; theta_t)$$

The next line extracts the Q values from the target network for state $s_{t+1}$ and assigns this to variable q_from_target. Finally, the update term has the following added to it:

$$gamma Q(s_{t+1}, a*); theta^-_t)$$

Notice, that the numpy indexing extracts from q_from_target all the valid batch samples, and within those samples, all the highest Q actions drawn from the primary network (i.e. a*).

Finally, the target_q values corresponding to the actions from state $s_t$ are updated with the update array.

Following this, the primary network is trained on this batch of data using the Keras train_on_batch. The last step in the function involves copying the primary or online network values into the target network. This can be varied so that this step only occurs every X amount of training steps (especially when one is doing a “hard copy”). However, as stated previously, in this example we’ll be doing a “soft copy” and therefore every training step involves the target network weights being moved slightly towards the primary network weights. As can be observed, for every trainable variable in both the primary and target networks, the target network trainable variables are assigned new values updated via the previously presented formula:

$$theta^- = theta^- (1-tau) + theta tau$$

That (rather lengthy) explanation concludes the discussion of how Double Q learning can be implemented in TensorFlow 2. Now it is time to examine the results of the training.

Double Q results for a deterministic case

In the first case, we are going to examine the deterministic training case when RANDOM_REWARD_STD is set to 0.0. The TensorBoard graph below shows the results:

 

Double Q deterministic case

Double Q deterministic case (blue – Double Q, red – deep Q)

As can be observed, in both the Double Q and deep Q training cases, the networks converge on “correctly” solving the Cartpole problem – with eventual consistent rewards of 180-200 per episode (a total reward of 200 is the maximum available per episode in the Cartpole environment). The Double Q case shows slightly better performance in reaching the “solved” state than the deep Q network implementation. This is likely due to better stability in decoupling the choice and evaluation of the actions, but it is not a conclusive result in this rather simple deterministic environment.

However, what happens when we increase the randomness by elevated RANDOM_REWARD_STD > 0?

Double Q results for a stochastic case

The results below show the case when RANDOM_REWARD_STD is increased to 1.0 – in this case, the rewards are drawn from a random normal distribution of mean 1.0 and standard deviation of 1.0:

 

Double Q stochastic case

Double Q stochastic case (blue – Double Q, red – deep Q)

As can be seen, in this case, the Double Q network significantly outperforms the deep Q training methodology. This demonstrates the effect of biasing in the deep Q training methodology, and the advantages of using Double Q learning in your reinforcement learning tasks.

I hope this post was helpful in increasing your understanding of both deep Q and Double Q reinforcement learning. Keep an eye out for future posts on reinforcement learning.


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

 

The post Double Q reinforcement learning in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Transfer learning in TensorFlow 2 tutorial

In this post, I’m going to cover the very important deep learning concept called transfer learning. Transfer learning is the process whereby one uses neural network models trained in a related domain to accelerate the development of accurate models in your more specific domain of interest. For instance, a deep learning practitioner can use one of the state-of-the-art image classification models, already trained, as a starting point for their own, more specialized, image classification task. In this tutorial, I’ll be showing you how to perform transfer learning using an advanced, pre-trained image classification model – ResNet50 – to improve a more specific image classification task – the cats vs dogs classification problem. In particular, I’ll be showing you how to do this using TensorFlow 2. The code for this tutorial, in a Google Colaboratory notebook format, can be found on this site’s Github repository here. This code borrows some components from the official TensorFlow tutorial.


Eager to build deep learning systems? Get the book here


What are the benefits of transfer learning?

Transfer learning has many benefits, these are:

  1. It speeds up learning: For state of the art results in deep learning, one often needs to build very deep networks with many layers. In order to train such networks, one needs lots of data, computational power and time. These three things are often not readily available.
  2. It needs less data: As will be shown, transfer learning usually only adds a few extra layers to the pre-trained model, and the weights in the pre-trained model are generally fixed. Therefore, during the fine tuning of the model, only those few extra layers, or a small subset of the total number of layers, is subjected to training. This requires much less data to get good results.
  3. You can leverage the expert tuning of state-of-the-art models: As anyone who has been involved in building deep learning systems can tell you, it requires a lot of patience and tuning of the models to get the best results. By utilizing pre-trained, state-of-the-art models, you can skip a lot of this arduous work and rely on the efforts of experts in the field.

For these reasons, if you are performing some image recognition task, it may be worth using some of the pre-trained, state-of-the-art image classification models, like ResNet, DenseNet, InceptionNet and so on. How does one use these pre-trained models?

How to create a transfer learning model

To create a transfer learning model, all that is required is to take the pre-trained layers and “bolt on” your own network. This could be either at the beginning or end of the pre-trained model. Usually, one disables the pre-trained layer weights and only trains the “bolted on” layers which have been added. For image classification transfer learning, one usually takes the convolutional neural network (CNN) layers from the pre-trained model and adds one or more densely connected “classification” layers at the end (for more on convolutional neural networks, see this tutorial).  The pre-trained CNN layers act as feature extractors / maps, and the classification layer/s at the end can be “taught” to “interpret” these image features. The transfer learning model architecture that will be used in this example is shown below:  

 

Transfer learning TensorFlow 2 architecture

ResNet50 transfer learning architecture

The full ResNet50 model shown in the image above, in addition to a Global Average Pooling (GAP) layer, contains a 1000 node dense / fully connected layer which acts as a “classifier” of the 2048 (4 x 4) feature maps output from the ResNet CNN layers. For more on Global Average Pooling, see my tutorial. In this transfer learning task, we’ll be removing these last two layers (GAP and Dense layer) and replacing these with our own GAP and dense layer (in this example, we have a binary classification task – hence the output size is only 1). The GAP layer has no trainable parameters, but the dense layer obviously does – these will be the only parameters trained in this example. All of this is performed quite easily in TensorFlow 2, as will be shown in the next section.

Transfer learning in TensorFlow 2

In this example, we’ll be using the pre-trained ResNet50 model and transfer learning to perform the cats vs dogs image classification task. I’ll also train a smaller CNN from scratch to show the benefits of transfer learning. To access the image dataset, we’ll be using the tensorflow_datasets package which contains a number of common machine learning datasets. To load the data, the following commands can be run:

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)

(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)

A few things to note about the code snippet above. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. The variables cat_train, cat_valid and cat_test are TensorFlow Dataset objects – to learn more about these, check out my previous post. In order to examine the images in the data set, the following code can be run:

import matplotlib.pylab as plt

for image, label in cat_train.take(2):
  plt.figure()
  plt.imshow(image)

This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:

IMAGE_SIZE = 100
def pre_process_image(image, label):
  image = tf.cast(image, tf.float32)
  image = image / 255.0
  image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
  return image, label

In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:

TRAIN_BATCH_SIZE = 64
cat_train = cat_train.map(pre_process_image).shuffle(1000).repeat().batch(TRAIN_BATCH_SIZE)
cat_valid = cat_valid.map(pre_process_image).repeat().batch(1000)

First, we’ll build a smaller CNN image classifier which will be trained from scratch.

A smaller CNN model

In the code below, a 3 x CNN layer head, a GAP layer and a final densely connected output layer is created. The Keras API, which is the encouraged approach for TensorFlow 2, is used in the model definition below. For more on Keras, see this and this tutorial.

head = tf.keras.Sequential()
head.add(layers.Conv2D(32, (3, 3), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(32, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(64, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

average_pool = tf.keras.Sequential()
average_pool.add(layers.AveragePooling2D())
average_pool.add(layers.Flatten())
average_pool.add(layers.Dense(1, activation='sigmoid'))

standard_model = tf.keras.Sequential([
    head, 
    average_pool
])

To train the model we run:

standard_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/standard_model', update_freq='batch')]

standard_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=7, 
               validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Note that the loss function is ‘binary cross-entropy’, due to the fact that the cats vs dogs image classification task is a binary classification problem (i.e. 0 = cat, 1 = dog or vice-versa). Running the code above, after 7 epochs, gives a training accuracy of around 89% and a validation accuracy of around 85%. Next we’ll see how this compares to the transfer learning case.

ResNet50 transfer learning example

To download the ResNet50 model, you can utilize the tf.keras.applications object to download the ResNet50 model in Keras format with trained parameters. To do so, run the following code:

IMG_SHAPE = (IMAGE_SIZE, IMAGE_SIZE, 3)
res_net = tf.keras.applications.ResNet50(weights='imagenet', include_top=False, input_shape=IMG_SHAPE)

The weights argument ‘imagenet’ denotes that the weights to be used are those generated by being trained on the ImageNet dataset. The include_top argument states that we only want the CNN-feature maps part of the ResNet50 model – not its final GAP and dense connected layers. Finally, we need to specify what input shape we want the model being setup to receive. Next, we need to disable the training of the parameters within this Keras model. This is performed really easily:

res_net.trainable = False

Next we create a Global Average Pooling layer, along with a final densely connected output layer with sigmoid activation. Then the model is combined using the Keras sequential framework where Keras models can be chained together:

global_average_layer = layers.GlobalAveragePooling2D()
output_layer = layers.Dense(1, activation='sigmoid')
tl_model = tf.keras.Sequential([
  res_net,
  global_average_layer,
  output_layer
])

That’s all that’s required – TensorFlow 2 and Keras make many deep learning tasks quite easy. Running tl_model.summary() gives the following output:

Layer (type)                 Output Shape              Param #   
=================================================================
resnet50 (Model)             (None, 4, 4, 2048)        23587712  
_________________________________________________________________
global_average_pooling2d (Gl (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2049      
=================================================================
Total params: 23,589,761
Trainable params: 2,049
Non-trainable params: 23,587,712
_________________________________________________________________

As can be observed, while the total number of parameters is large (i.e. 23 million) the number of trainable parameters, corresponding to the weights of the final output layer, is only 2,049. 

To train the model we run:

tl_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/transer_learning_model', update_freq='batch')]

tl_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=7, 
             validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Comparing the models

The graphs below from TensorBoard show the relative performance of the small CNN model trained from scratch and the ResNet50 transfer learning model:

Transfer learning TensorFlow 2 training accuracy comparison

Transfer learning training accuracy comparison (blue – ResNet50, pink – smaller CNN model)

Transfer learning TensorFlow 2 validation accuracy comparison

Transfer learning validation accuracy comparision (red – ResNet50 model, green – smaller CNN model)

The results above show that the ResNet50 model reaches higher levels of both training and validation accuracy much quicker than the smaller CNN model that was trained from scratch. This illustrates the benefit of using these powerful pre-trained models as a starting point for your more domain specific deep learning tasks. I hope this post has been a help and given you a good understanding of the benefits of transfer learning, and also how to implement it easily in TensorFlow 2.  


Eager to build deep learning systems? Get the book here

The post Transfer learning in TensorFlow 2 tutorial appeared first on Adventures in Machine Learning.

Categories
Offsites

An introduction to Global Average Pooling in convolutional neural networks

For those familiar with convolutional neural networks (if you’re not, check out this post), you will know that, for many architectures, the final set of layers are often of the fully connected variety. This is like bolting a standard neural network classifier onto the end of an image processor. The convolutional neural network starts with a series of convolutional (and, potentially, pooling) layers which create feature maps which represent different components of the input images. The fully connected layers at the end then “interpret” the output of these features maps and make category predictions. However, as with many things in the fast moving world of deep learning research, this practice is starting to fall by the wayside in favor of something called Global Average Pooling (GAP). In this post, I’ll introduce the benefits of Global Average Pooling and apply it on the Cats vs Dogs image classification task using TensorFlow 2. In the process, I’ll compare its performance to the standard fully connected layer paradigm. The code for this tutorial can be found in a Jupyter Notebook on this site’s Github repository, ready for use in Google Colaboratory.


Eager to build deep learning systems? Get the book here


Global Average Pooling

Global Average Pooling is an operation that calculates the average output of each feature map in the previous layer. This fairly simple operation reduces the data significantly and prepares the model for the final classification layer. It also has no trainable parameters – just like Max Pooling (see here for more details). The diagram below shows how it is commonly used in a convolutional neural network:

Global Average Pooling in a CNN architecture

Global Average Pooling in a CNN architecture

As can be observed, the final layers consist simply of a Global Average Pooling layer and a final softmax output layer. As can be observed, in the architecture above, there are 64 averaging calculations corresponding to the 64, 7 x 7 channels at the output of the second convolutional layer. The GAP layer transforms the dimensions from (7, 7, 64) to (1, 1, 64) by performing the averaging across the 7 x 7 channel values. Global Average Pooling has the following advantages over the fully connected final layers paradigm:

  • The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
  • The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
  • The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
  • Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.

To test out these ideas in practice, in the next section I’ll show you an example comparing the benefits of the Global Average Pooling with the historical paradigm. This example problem will be the Cats vs Dogs image classification task and I’ll be using TensorFlow 2 to build the models. At the time of writing, only TensorFlow 2 Alpha is available, and the reader can follow this link to find out how to install it.

Global Average Pooling with TensorFlow 2 and Cats vs Dogs

To download the Cats vs Dogs data for this example, you can use the following code:

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)

(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)

The code above utilizes the TensorFlow Datasets repository which allows you to import common machine learning datasets into TF Dataset objects.  For more on using Dataset objects in TensorFlow 2, check out this post. A few things to note. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. In order to examine the images in the data set, the following code can be run:

import matplotlib.pylab as plt

for image, label in cat_train.take(2):
  plt.figure()
  plt.imshow(image)

This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:

IMAGE_SIZE = 100
def pre_process_image(image, label):
  image = tf.cast(image, tf.float32)
  image = image / 255.0
  image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
  return image, label

In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:

TRAIN_BATCH_SIZE = 64
cat_train = cat_train.map(pre_process_image).shuffle(1000).repeat().batch(TRAIN_BATCH_SIZE)
cat_valid = cat_valid.map(pre_process_image).repeat().batch(1000)

For more on TensorFlow datasets, see this post. Now it is time to build the model – in this example, we’ll be using the Keras API in TensorFlow 2. In this example, I’ll be using a common “head” model, which consists of layers of standard convolutional operations – convolution and max pooling, with batch normalization and ReLU activations:

head = tf.keras.Sequential()
head.add(layers.Conv2D(32, (3, 3), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(32, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(64, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

Next, we need to add the “back-end” of the network to perform the classification.

Standard fully connected classifier results

In the first instance, I’ll show the results of a standard fully connected classifier, without dropout. Because, for this example, there are only two possible classes – “cat” or “dog” – the final output layer is a dense / fully connected layer with a single node and a sigmoid activation.

standard_classifier = tf.keras.Sequential()
standard_classifier.add(layers.Flatten())
standard_classifier.add(layers.BatchNormalization())
standard_classifier.add(layers.Dense(100))
standard_classifier.add(layers.Activation('relu'))
standard_classifier.add(layers.BatchNormalization())
standard_classifier.add(layers.Dense(100))
standard_classifier.add(layers.Activation('relu'))
standard_classifier.add(layers.Dense(1))
standard_classifier.add(layers.Activation('sigmoid'))

As can be observed, in this case, the output classification layers includes 2 x 100 node dense layers. To combine the head model and this standard classifier, the following commands can be run:

standard_model = tf.keras.Sequential([
    head, 
    standard_classifier
])

Finally, the model is compiled, a TensorBoard callback is created for visualization purposes, and the Keras fit command is executed:

standard_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")))]

standard_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=10, validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Note that the loss used is binary crossentropy, due to the binary classes for this example. The training progress over 7 epochs can be seen in the figure below:

Standard classifier without average pooling

Standard classifier accuracy (red – training, blue – validation)

Standard classifier loss without average pooling

Standard classifier loss (red – training, blue – validation)

As can be observed, with a standard fully connected classifier back-end to the model (without dropout), the training accuracy reaches high values but it overfits with respect to the validation dataset. The validation dataset accuracy stagnates around 80% and the loss begins to increase – a sure sign of overfitting.

Global Average Pooling results

The next step is to test the results of the Global Average Pooling in TensorFlow 2. To build the GAP layer and associated model, the following code is added:

average_pool = tf.keras.Sequential()
average_pool.add(layers.AveragePooling2D())
average_pool.add(layers.Flatten())
average_pool.add(layers.Dense(1, activation='sigmoid'))

pool_model = tf.keras.Sequential([
    head, 
    average_pool
])

The accuracy results for this model, along with the results of the standard fully connected classifier model, are shown below:

Global Average Pooling accuracy

Global average pooling accuracy vs standard fully connected classifier model (pink – GAP training, green – GAP validation, blue – FC classifier validation)

As can be observed from the graph above, the Global Average Pooling model has a higher validation accuracy by the 7th epoch than the fully connected model. The training accuracy is lower than the FC model, but this is clearly due to overfitting being reduced in the GAP model. A final comparison including the case of the FC model with a dropout layer inserted is shown below:

standard_classifier_with_do = tf.keras.Sequential()
standard_classifier_with_do.add(layers.Flatten())
standard_classifier_with_do.add(layers.BatchNormalization())
standard_classifier_with_do.add(layers.Dense(100))
standard_classifier_with_do.add(layers.Activation('relu'))
standard_classifier_with_do.add(layers.Dropout(0.5))
standard_classifier_with_do.add(layers.BatchNormalization())
standard_classifier_with_do.add(layers.Dense(100))
standard_classifier_with_do.add(layers.Activation('relu'))
standard_classifier_with_do.add(layers.Dense(1))
standard_classifier_with_do.add(layers.Activation('sigmoid'))
Global Average Pooling accuracy vs FC with dropout

Global average pooling validation accuracy vs FC classifier with and without dropout (green – GAP model, blue – FC model without DO, orange – FC model with DO)

As can be seen, of the three model options sharing the same convolutional front end, the GAP model has the best validation accuracy after 7 epochs of training (x – axis in the graph above is the number of batches). Dropout improves the validation accuracy of the FC model, but the GAP model is still narrowly out in front. Further tuning could be performed on the fully connected models and results may improve. However, one would expect Global Average Pooling to be at least equivalent to a FC model with dropout – even though it has hundreds of thousands of fewer parameters. I hope this short tutorial gives you a good understanding of Global Average Pooling and its benefits. You may want to consider it in the architecture of your next image classifier design.


Eager to build deep learning systems? Get the book here

The post An introduction to Global Average Pooling in convolutional neural networks appeared first on Adventures in Machine Learning.

Categories
Offsites

Metrics and summaries in TensorFlow 2

In this relatively short post, I’m going to show you how to deal with metrics and summaries in TensorFlow 2. Metrics, which can be used to monitor various important variables during the training of deep learning networks (such as accuracy or various losses), were somewhat unwieldy in TensorFlow 1.X. Thankfully in the new TensorFlow 2.0 they are much easier to use. Summary logging, for visualization of training in the TensorBoard interface, has also undergone some changes in TensorFlow 2 that I will be demonstrating. Please note – at time of writing, only the alpha version of TensorFlow 2 is available, but it is probably safe to assume that the syntax and forms demonstrated in this tutorial will remain the same in TensorFlow 2.0. To install the alpha version, use the following command:
pip install tensorflow==2.0.0-alpha0
In this tutorial, I’ll be using a generic MNIST Convolutional Neural Network example, but utilizing full TensorFlow 2 design paradigms. To learn more about CNNs, see this tutorial – to understand more about TensorFlow 2 paradigms, see this tutorial. All the code for this tutorial can be found as a Google Colaboratory file on my Github repository.

Eager to build deep learning systems? Get the book here

 

TensorFlow 2 metrics

Metrics in TensorFlow 2 can be found in the TensorFlow Keras distribution – tf.keras.metrics. Metrics, along with the rest of TensorFlow 2, are now computed in an Eager fashion. In TensorFlow 1.X, metrics were gathered and computed using the imperative declaration, tf.Session style. All that is required now is to declare the metrics as a Python variable, use the method update_state() to add a state to the metric, result() to summarize the metric, and finally reset_states() to reset all the states of the metric.  The code below shows a simple implementation of a Mean metric:
mean_metric = tf.keras.metrics.Mean()
mean_metric.update_state(2.0)
mean_metric.update_state(3.0)
mean_metric.update_state(4.0)
print(mean_metric.result().numpy())
This will print the average result -> 3.0. As can be observed, there is an internal memory for the metric, which can be appended to using update_state(). The Mean metric operation is executed when result() is called. Finally, to reset the memory of the metric, we can use reset_states() as follows:
mean_metric.reset_states()
print(mean_metric.result().numpy())
This will print the default response of an empty metric – 0.0.

TensorFlow 2 summaries

Metrics fit hand-in-glove with summaries in TensorFlow 2. In order to log summaries in TensorFlow 2, the developer uses the with Python context manager. First, one creates a summary_writer object like so:
summary_writer = tf.summary.create_file_writer('/log')
To log something to the summary writer, the developer must first enclose the “space” within your code which does the logging with a Python with statement. The logging looks like so:
with summary_writer.as_default():
  tf.summary.scalar('mean', mean_metric.result(), step=1)
The with context can surround the full training loop, or just the area of the code where you are storing the summaries. As can be observed, the logged scalar value is set by using the metric result() method. The step value needs to be provided to the summary – this allows TensorBoard to plot the variation of various values, images etc. between training steps. The step number can be tracked manually, but the easiest way is to use the iterations property of whatever optimizer you are using. This will be demonstrated in the example below.

TensorFlow 2 metrics and summaries – CNN example

In this example, I’ll show how to use metrics and summaries in the context of a CNN MNIST classification example. In this example, I’ll use a custom training loop, rather than a Keras fit loop. In the next section, I’ll show you how to implement custom metrics even within the Keras fit functionality. As usual for any machine learning task, the first step is to prepare the training and validation data. In this case, we’ll be using the prepackaged Keras MNIST dataset, then converting the numpy data arrays into a TensorFlow dataset (for more on TensorFlow datasets, see here and here). This looks like the following:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
BATCH_SIZE=64
# first the training set
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(BATCH_SIZE).shuffle(10000)
train_dataset = train_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
train_dataset = train_dataset.map(lambda x, y: (tf.expand_dims(x, -1) / 255.0, y))
train_dataset = train_dataset.repeat()
# now the validation set
valid_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(5000).shuffle(10000)
valid_dataset = valid_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
valid_dataset = valid_dataset.map(lambda x, y: (tf.expand_dims(x, -1) / 255.0, y))
valid_dataset = valid_dataset.repeat()
In the lines above, some preprocessing is applied to the image data to normalize it (divide the pixel values by 255, make the tensors 4D for consumption into CNN layers). Next I define the CNN model, using the Keras sequential paradigm:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, 2, 1, activation='relu', input_shape=(28, 28, 1)))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2D(32, 2, 1, activation='relu'))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(10))
The model declaration above is all standard Keras – for more on the sequential model type of Keras, see here. Next, we create a custom training loop function in TensorFlow. It is now best practice to encapsulate core parts of your code in Python functions – this is so that the @tf.function decorator can be applied easily to the function. This signals to TensorFlow to perform Just In Time (JIT) compilation of the relevant code into a graph, which allows the performance benefits of a static graph as per TensorFlow 1.X. Otherwise, the code will execute eagerly, which is not a big deal, but if one is building production or performance dependent code it is better to decorate with @tf.function. Here’s the training loop and optimization/loss function definitions:
optimizer = tf.keras.optimizers.Adam()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
def train(ds_train, optimizer, loss_fn, model, num_batches, log_freq=10):
  avg_loss = tf.keras.metrics.Mean()
  avg_acc = tf.keras.metrics.SparseCategoricalAccuracy()
  batch_idx = 0
  for batch_idx, (images, labels) in enumerate(ds_train):
    images = tf.expand_dims(images, -1)
    with tf.GradientTape() as tape:
      logits = model(images)
      loss_value = loss_fn(labels, logits)
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    avg_loss.update_state(loss_value)
    avg_acc.update_state(labels, logits)
    if batch_idx % log_freq == 0:
      print(f"Batch {batch_idx}, average loss is {avg_loss.result().numpy()}, average accuracy is {avg_acc.result().numpy()}")
      tf.summary.scalar('loss', avg_loss.result(), step=optimizer.iterations)
      tf.summary.scalar('acc', avg_acc.result(), step=optimizer.iterations)
      avg_loss.reset_states()
      avg_acc.reset_states()
    if batch_idx > num_batches:
      break
As can be observed, I have created two metrics for use in this training loop – avg_loss and avg_acc. These are Mean and SparseCategoricalAccuracy metrics, respectively. The Mean metric has been discussed previously. The SparseCategoricalAccuracy metric takes, as input, the training labels and logits (raw, unactivated outputs from your model). Because it is a sparse categorical accuracy measure, it can take the training labels in scalar integer form, rather than one-hot encoded label vectors. Calling result() on this metric will calculate the average accuracy of all the labels/logits pairs passed during the update_state() call – see line 15 above. Every log_freq number of batches, the results of the metrics are printed and also passed as summary scalars. After the metrics are logged in the summaries, their states are reset. You will notice that I have not provided a with context for these summaries – this is applied in the outer epoch loop is shown below:
num_epochs = 10
summary_writer = tf.summary.create_file_writer('./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")))
for i in range(num_epochs):
  print(f"Epoch {i + 1} of {num_epochs}")
  with summary_writer.as_default():
    train(train_dataset, optimizer, loss_fn, model, 10000//BATCH_SIZE)
As can be observed, the summary_writer.as_default() is supplied as context to the whole train function. So far so good. However, this is utilizing a “manual” TensorFlow training loop, which is no longer the easiest way to train in TensorFlow 2, given the tight Keras integration. In the next example, I’ll show you how to include run of the mill metrics in the Keras API, but also custom metrics.

TensorFlow 2 Keras metrics and summaries

To include normal metrics such as the accuracy in Keras is straight-forward – one supplies a list of metrics to be logged in the compile statement like so:
metric_model.compile(optimizer=tf.optimizers.Adam(),
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                     metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
However, if one wishes to log more complicated or custom metrics, it becomes difficult to see how to set this up in Keras. One easy way of doing so is by creating a custom Keras layer whose sole purpose is to add a metric to the model / training. In the example below, I have created a custom layer which adds the standard deviation of the kernel weights as a metric:
class MetricLayer(tf.keras.layers.Layer):
  def __init__(self, layer_to_log):
    super(MetricLayer, self).__init__()
    self.layer_to_log = layer_to_log
    
  def call(self, input):
    self.add_metric(tf.keras.backend.std(self.layer_to_log.variables[0]),
                    name=f'std_of_{self.layer_to_log.name}_kernel',
                    aggregation='mean')
    return input
A few things to notice about the creation of the custom layer above. First, notice that the layer is defined as a Python class object which inherits from the keras.layers.Layer object. The only variable passed to the initialization of this custom class is the layer with the kernel weights which we wish to log. The call method tells Keras / TensorFlow what to do when the layer is called in a feed forward pass. In this case, the input is passed straight through to the output – it is, in essence, a dummy layer. However, you’ll notice within the call a metric is added. The value of the metric is the standard deviation of layer_to_log.variables[0]. For a CNN layer, the zero index [0] of the layer variables is the kernel weights. A name is provided to the metric for ease of viewing during training, and finally the aggregation method of the metric is specified – in this case, a ‘mean’ aggregation of the standard deviations. To include this layer, one can just add it as a sequential element in the Keras model. In the below I take the existing CNN model created in the previous example, and create a new model with the custom metric layer appended to the end:
metric_model = tf.keras.Sequential()
metric_model.add(model)
metric_model.add(MetricLayer(model.layers[0]))
As can be observed in the above, the first layer of the previous model is passed to the custom MetricLayer. Running the fit training method on this model will now generate both the SparseCategoricalAccuracy metric, along with the custom standard deviation from the first layer. To monitor in TensorBoard, one must also include the TensorBoard callback. All of this looks like the following:
metric_model.compile(optimizer=tf.optimizers.Adam(),
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                     metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

callbacks = [
  # Write TensorBoard logs to `./logs` directory
  tf.keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")), update_freq='batch')
]

metric_model.fit(train_dataset, steps_per_epoch=10000//BATCH_SIZE, epochs=5,
                 validation_data=valid_dataset, validation_steps=5,
                 callbacks=callbacks)
The code above will perform the training and ensure all the metrics (including the metric added in the custom metric layer) are output to TensorBoard via the TensorBoard callback. This concludes my quick introduction to metrics and summaries in TensorFlow 2. Watch out for future posts and updates of existing posts as the transition to TensorFlow 2 develops.  

Eager to build deep learning systems? Get the book here

 

The post Metrics and summaries in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Python TensorFlow Tutorial – Build a Neural Network

Updated for TensorFlow 2

Google’s TensorFlow has been a hot topic in deep learning recently.  The open source software, designed to allow efficient computation of data flow graphs, is especially suited to deep learning tasks.  It is designed to be executed on single or multiple CPUs and GPUs, making it a good option for complex deep learning tasks.  In its most recent incarnation – version 1.0 – it can even be run on certain mobile operating systems.  This introductory tutorial to TensorFlow will give an overview of some of the basic concepts of TensorFlow in Python.  These will be a good stepping stone to building more complex deep learning networks, such as Convolution Neural Networks, natural language models, and Recurrent Neural Networks in the package.  We’ll be creating a simple three-layer neural network to classify the MNIST dataset.  This tutorial assumes that you are familiar with the basics of neural networks, which you can get up to scratch with in the neural networks tutorial if required.  To install TensorFlow, follow the instructions here. The code for this tutorial can be found in this site’s GitHub repository.  Once you’re done, you also might want to check out a higher level deep learning library that sits on top of TensorFlow called Keras – see my Keras tutorial.

First, let’s have a look at the main ideas of TensorFlow.

1.0 TensorFlow graphs

TensorFlow is based on graph based computation – “what on earth is that?”, you might say.  It’s an alternative way of conceptualising mathematical calculations.  Consider the following expression $a = (b + c) * (c + 2)$.  We can break this function down into the following components:

begin{align}
d &= b + c \
e &= c + 2 \
a &= d * e
end{align}

Now we can represent these operations graphically as:

TensorFlow tutorial - simple computational graph

Simple computational graph

This may seem like a silly example – but notice a powerful idea in expressing the equation this way: two of the computations ($d=b+c$ and $e=c+2$) can be performed in parallel.  By splitting up these calculations across CPUs or GPUs, this can give us significant gains in computational times.  These gains are a must for big data applications and deep learning – especially for complicated neural network architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).  The idea behind TensorFlow is to the ability to create these computational graphs in code and allow significant performance improvements via parallel operations and other efficiency gains.

We can look at a similar graph in TensorFlow below, which shows the computational graph of a three-layer neural network.

TensorFlow tutorial - data flow graph

TensorFlow data flow graph

The animated data flows between different nodes in the graph are tensors which are multi-dimensional data arrays.  For instance, the input data tensor may be 5000 x 64 x 1, which represents a 64 node input layer with 5000 training samples.  After the input layer, there is a hidden layer with rectified linear units as the activation function.  There is a final output layer (called a “logit layer” in the above graph) that uses cross-entropy as a cost/loss function.  At each point we see the relevant tensors flowing to the “Gradients” block which finally flows to the Stochastic Gradient Descent optimizer which performs the back-propagation and gradient descent.

Here we can see how computational graphs can be used to represent the calculations in neural networks, and this, of course, is what TensorFlow excels at.  Let’s see how to perform some basic mathematical operations in TensorFlow to get a feel for how it all works.

2.0 A Simple TensorFlow example

So how can we make TensorFlow perform the little example calculation shown above – $a = (b + c) * (c + 2)$? First, there is a need to introduce TensorFlow variables.  The code below shows how to declare these objects:

import tensorflow as tf
# create TensorFlow variables
const = tf.Variable(2.0, name="const")
b = tf.Variable(2.0, name='b')
c = tf.Variable(1.0, name='c')

As can be observed above, TensorFlow variables can be declared using the tf.Variable function.  The first argument is the value to be assigned to the variable. The second is an optional name string which can be used to label the constant/variable – this is handy for when you want to do visualizations.  TensorFlow will infer the type of the variable from the initialized value, but it can also be set explicitly using the optional dtype argument.  TensorFlow has many of its own types like tf.float32, tf.int32 etc.

The objects assigned to the Python variables are actually TensorFlow tensors. Thereafter, they act like normal Python objects – therefore, if you want to access the tensors you need to keep track of the Python variables. In previous versions of TensorFlow, there were global methods of accessing the tensors and operations based on their names. This is no longer the case.

To examine the tensors stored in the Python variables, simply call them as you would a normal Python variable. If we do this for the “const” variable, you will see the following output:

<tf.Variable ‘const:0′ shape=() dtype=float32, numpy=2.0>

This output gives you a few different pieces of information – first, is the name ‘const:0’ which has been assigned to the tensor. Next is the data type, in this case, a TensorFlow float 32 type. Finally, there is a “numpy” value. TensorFlow variables in TensorFlow 2 can be converted easily into numpy objects. Numpy stands for Numerical Python and is a crucial library for Python data science and machine learning. If you don’t know Numpy, what it is, and how to use it, check out this site. The command to access the numpy form of the tensor is simply .numpy() – the use of this method will be shown shortly.

Next, some calculation operations are created:

# now create some operations
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')

Note that d and e are automatically converted to tensor values upon the execution of the operations. TensorFlow has a wealth of calculation operations available to perform all sorts of interactions between tensors, as you will discover as you progress through this book.  The purpose of the operations shown above are pretty obvious, and they instantiate the operations b + c, c + 2.0, and d * e. However, these operations are an unwieldy way of doing things in TensorFlow 2. The operations below are equivalent to those above:

d = b + c
e = c + 2
a = d * e

To access the value of variable a, one can use the .numpy() method as shown below:

print(f”Variable a is {a.numpy()}”)

The computational graph for this simple example can be visualized by using the TensorBoard functionality that comes packaged with TensorFlow. This is a great visualization feature and is explained more in this post. Here is what the graph looks like in TensorBoard:

TensorFlow tutorial - simple graph

Simple TensorFlow graph

The larger two vertices or nodes, b and c, correspond to the variables. The smaller nodes correspond to the operations, and the edges between the vertices are the scalar values emerging from the variables and operations.

The example above is a trivial example – what would this look like if there was an array of b values from which an array of equivalent a values would be calculated? TensorFlow variables can easily be instantiated using numpy variables, like the following:

b = tf.Variable(np.arange(0, 10), name='b')

Calling b shows the following:

<tf.Variable ‘b:0′ shape=(10,) dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

Note the numpy value of the tensor is an array. Because the numpy variable passed during the instantiation is a range of int32 values, we can’t add it directly to c as c is of float32 type. Therefore, the tf.cast operation, which changes the type of a tensor, first needs to be utilized like so:

d = tf.cast(b, tf.float32) + c

Running the rest of the previous operations, using the new b tensor, gives the following value for a:

Variable a is [ 3.  6.  9. 12. 15. 18. 21. 24. 27. 30.]

In numpy, the developer can directly access slices or individual indices of an array and change their values directly. Can the same be done in TensorFlow 2? Can individual indices and/or slices be accessed and changed? The answer is yes, but not quite as straight-forwardly as in numpy. For instance, if b was a simple numpy array, one could easily execute the following b[1] = 10 – this would change the value of the second element in the array to the integer 10.

b[1].assign(10)

This will then flow through to a like so:

Variable a is [ 3. 33.  9. 12. 15. 18. 21. 24. 27. 30.]

The developer could also run the following, to assign a slice of b values:

b[6:9].assign([10, 10, 10])

A new tensor can also be created by using the slice notation:

f = b[2:5]

The explanations and code above show you how to perform some basic tensor manipulations and operations. In the section below, an example will be presented where a neural network is created using the Eager paradigm in TensorFlow 2. It will show how to create a training loop, perform a feed-forward pass through a neural network and calculate and apply gradients to an optimization method.

3.0 A Neural Network Example

In this section, a simple three-layer neural network build in TensorFlow is demonstrated.  In following chapters more complicated neural network structures such as convolution neural networks and recurrent neural networks are covered.  For this example, though, it will be kept simple.

In this example, the MNIST dataset will be used that is packaged as part of the TensorFlow installation. This MNIST dataset is a set of 28×28 pixel grayscale images which represent hand-written digits.  It has 60,000 training rows, 10,000 testing rows, and 5,000 validation rows. It is a very common, basic, image classification dataset that is used in machine learning.

The data can be loaded by running the following:

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

As can be observed, the Keras MNIST data loader returns Python tuples corresponding to the training and test set respectively (Keras is another deep learning framework, now tightly integrated with TensorFlow, as mentioned earlier). The data sizes of the tuples defined above are:

  • x_train: (60,000 x 28 x 28)
  • y_train: (60,000)
  • x_test: (10,000 x 28 x 28)
  • y_test: (10,000)

The x data is the image information – 60,000 images of 28 x 28 pixels size in the training set. The images are grayscale (i.e black and white) with maximum values, specifying the intensity of whites, of 255. The x data will need to be scaled so that it resides between 0 and 1, as this improves training efficiency. The y data is the matching image labels – signifying what digit is displayed in the image. This will need to be transformed to “one-hot” format.

When using a standard, categorical cross-entropy loss function (this will be shown later), a one-hot format is required when training classification tasks, as the output layer of the neural network will have the same number of nodes as the total number of possible classification labels. The output node with the highest value is considered as a prediction for that corresponding label. For instance, in the MNIST task, there are 10 possible classification labels – 0 to 9. Therefore, there will be 10 output nodes in any neural network performing this classification task. If we have an example output vector of [0.01, 0.8, 0.25, 0.05, 0.10, 0.27, 0.55, 0.32, 0.11, 0.09], the maximum value is in the second position / output node, and therefore this corresponds to the digit “1”. To train the network to produce this sort of outcome when the digit “1” appears, the loss needs to be calculated according to the difference between the output of the network and a “one-hot” array of the label 1. This one-hot array looks like [0, 1, 0, 0, 0, 0, 0, 0, 0, 0].

This conversion is easily performed in TensorFlow, as will be demonstrated shortly when the main training loop is covered.

One final thing that needs to be considered is how to extract the training data in batches of samples. The function below can handle this:

def get_batch(x_data, y_data, batch_size):
    idxs = np.random.randint(0, len(y_data), batch_size)
    return x_data[idxs,:,:], y_data[idxs]

As can be observed in the code above, the data to be batched i.e. the x and y data is passed to this function along with the batch size. The first line of the function generates a random vector of integers, with random values between 0 and the length of the data passed to the function. The number of random integers generated is equal to the batch size. The x and y data are then returned, but the return data is only for those random indices chosen. Note, that this is performed on numpy array objects – as will be shown shortly, the conversion from numpy arrays to tensor objects will be performed “on the fly” within the training loop.

There is also the requirement for a loss function and a feed-forward function, but these will be covered shortly.

# Python optimisation variables
epochs = 10
batch_size = 100

# normalize the input images by dividing by 255.0
x_train = x_train / 255.0
x_test = x_test / 255.0
# convert x_test to tensor to pass through model (train data will be converted to
# tensors on the fly)
x_test = tf.Variable(x_test)

First, the number of training epochs and the batch size are created – note these are simple Python variables, not TensorFlow variables. Next, the input training and test data, x_train and x_test, are scaled so that their values are between 0 and 1. Input data should always be scaled when training neural networks, as large, uncontrolled, inputs can heavily impact the training process. Finally, the test input data, x_test is converted into a tensor. The random batching process for the training data is most easily performed using numpy objects and functions. However, the test data will not be batched in this example, so the full test input data set x_test is converted into a tensor.

The next step is to setup the weight and bias variables for the three-layer neural network.  There are always L1 number of weights/bias tensors, where L is the number of layers.  These variables are defined in the code below:

# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random.normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random.normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random.normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random.normal([10]), name='b2')

The weight and bias variables are initialized using the tf.random.normal function – this function creates tensors of random numbers, drawn from a normal distribution. It allows the developer to specify things like the standard deviation of the distribution from which the random numbers are drawn.

Note the shape of the variables. The W1 variable is a [784, 300] tensor – the 784 nodes are the size of the input layer. This size comes from the flattening of the input images – if we have 28 rows and 28 columns of pixels, flattening these out gives us 1 row or column of 28 x 28 = 784 values.  The 300 in the declaration of W1 is the number of nodes in the hidden layer. The W2 variable is a [300, 10] tensor, connecting the 300-node hidden layer to the 10-node output layer. In each case, a name is given to the variable for later viewing in TensorBoard – the TensorFlow visualization package. The next step in the code is to create the computations that occur within the nodes of the network. If the reader recalls, the computations within the nodes of a neural network are of the following form:

$$z = Wx + b$$

$$h=f(z)$$

Where W is the weights matrix, x is the layer input vector, b is the bias and f is the activation function of the node. These calculations comprise the feed-forward pass of the input data through the neural network. To execute these calculations, a dedicated feed-forward function is created:

def nn_model(x_input, W1, b1, W2, b2):
    # flatten the input image from 28 x 28 to 784
    x_input = tf.reshape(x_input, (x_input.shape[0], -1))
    x = tf.add(tf.matmul(tf.cast(x_input, tf.float32), W1), b1)
    x = tf.nn.relu(x)
    logits = tf.add(tf.matmul(x, W2), b2)
    return logits

Examining the first line, the x_input data is reshaped from (batch_size, 28, 28) to (batch_size, 784) – in other words, the images are flattened out. On the next line, the input data is then converted to tf.float32 type using the TensorFlow cast function. This is important – the x­_input data comes in as tf.float64 type, and TensorFlow won’t perform a matrix multiplication operation (tf.matmul) between tensors of different data types. This re-typed input data is then matrix-multiplied by W1 using the TensorFlow matmul function (which stands for matrix multiplication). Then the bias b1 is added to this product. On the line after this, the ReLU activation function is applied to the output of this line of calculation. The ReLU function is usually the best activation function to use in deep learning – the reasons for this are discussed in this post.

The output of this calculation is then multiplied by the final set of weights W2, with the bias b2 added. The output of this calculation is titled logits. Note that no activation function has been applied to this output layer of nodes (yet). In machine/deep learning, the term “logits” refers to the un-activated output of a layer of nodes.

The reason no activation function has been applied to this layer is that there is a handy function in TensorFlow called tf.nn.softmax_cross_entropy_with_logits. This function does two things for the developer – it applies a softmax activation function to the logits, which transforms them into a quasi-probability (i.e. the sum of the output nodes is equal to 1). This is a common activation function to apply to an output layer in classification tasks. Next, it applies the cross-entropy loss function to the softmax activation output. The cross-entropy loss function is a commonly used loss in classification tasks. The theory behind it is quite interesting, but it won’t be covered in this book – a good summary can be found here. The code below applies this handy TensorFlow function, and in this example,  it has been nested in another function called loss_fn:

def loss_fn(logits, labels):
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels,
                                                                              logits=logits))
    return cross_entropy

The arguments to softmax_cross_entropy_with_logits are labels and logits. The logits argument is supplied from the outcome of the nn_model function. The usage of this function in the main training loop will be demonstrated shortly. The labels argument is supplied from the one-hot y values that are fed into loss_fn during the training process. The output of the softmax_cross_entropy_with_logits function will be the output of the cross-entropy loss value for each sample in the batch. To train the weights of the neural network, the average cross-entropy loss across the samples needs to be minimized as part of the optimization process. This is calculated by using the tf.reduce_mean function, which, unsurprisingly, calculates the mean of the tensor supplied to it.

The next step is to define an optimizer function. In many examples within this book, the versatile Adam optimizer will be used. The theory behind this optimizer is interesting, and is worth further examination (such as shown here) but won’t be covered in detail within this post. It is basically a gradient descent method, but with sophisticated averaging of the gradients to provide appropriate momentum to the learning. To define the optimizer, which will be used in the main training loop, the following code is run:

# setup the optimizer
optimizer = tf.keras.optimizers.Adam()

The Adam object can take a learning rate as input, but for the present purposes, the default value is used.

3.1 Training the network

Now that the appropriate functions, variables and optimizers have been created, it is time to define the overall training loop. The training loop is shown below:

total_batch = int(len(y_train) / batch_size)
for epoch in range(epochs):
    avg_loss = 0
    for i in range(total_batch):
        batch_x, batch_y = get_batch(x_train, y_train, batch_size=batch_size)
        # create tensors
        batch_x = tf.Variable(batch_x)
        batch_y = tf.Variable(batch_y)
        # create a one hot vector
        batch_y = tf.one_hot(batch_y, 10)
        with tf.GradientTape() as tape:
            logits = nn_model(batch_x, W1, b1, W2, b2)
            loss = loss_fn(logits, batch_y)
        gradients = tape.gradient(loss, [W1, b1, W2, b2])
        optimizer.apply_gradients(zip(gradients, [W1, b1, W2, b2]))
        avg_loss += loss / total_batch
    test_logits = nn_model(x_test, W1, b1, W2, b2)
    max_idxs = tf.argmax(test_logits, axis=1)
    test_acc = np.sum(max_idxs.numpy() == y_test) / len(y_test)
    print(f"Epoch: {epoch + 1}, loss={avg_loss:.3f}, test set      accuracy={test_acc*100:.3f}%")

print("nTraining complete!")

Stepping through the lines above, the first line is a calculation to determine the number of batches to run through in each training epoch – this will ensure that, on average, each training sample will be used once in the epoch.  After that, a loop for each training epoch is entered. An avg_cost variable is initialized to keep track of the average cross entropy cost/loss for each epoch. The next line is where randomised batches of samples are extracted (batch_x and batch_y) from the MNIST training dataset, using the get_batch() function that was created earlier.

Next, the batch_x and batch_y numpy variables are converted to tensor variables. After this, the label data stored in batch_y as simple integers (i.e. 2 for handwritten digit “2” and so on) needs to be converted to “one hot” format, as discussed previously. To do this, the tf.one_hot function can be utilized – the first argument to this function is the tensor you wish to convert, and the second argument is the number of distinct classes. This transforms the batch_y tensor from size (batch_size, 1) to (batch_size, 10).

The next line is important. Here the TensorFlow GradientTape API is introduced. In previous versions of TensorFlow a static graph of all the operations and variables was constructed. In this paradigm, the gradients that were required to be calculated could be determined by reading from the graph structure. However, in Eager mode, all tensor calculations are performed on the fly, and TensorFlow doesn’t know which variables and operations you are interested in calculating gradients for. The Gradient Tape API is the solution for this. Whatever variables and operations you wish to calculate gradients over you supply to the “with GradientTape() as tape:” context manager. In a neural network, this involves all the variables and operations involved in the feed-forward pass through your network, along with the evaluation of the loss function. Note that if you call a function within the gradient tape context, all the operations performed within that function (and any further nested functions), will be captured for gradient calculation as required.

As can be observed in the code above, the feed forward pass and the loss function evaluation are encapsulated in the functions which were explained earlier: nn_model and loss_fn. By executing these functions within the gradient tape context manager, TensorFlow knows to keep track of all the variables and operation outcomes to ensure they are ready for gradient computations. Following the function calls nn_model and loss_fn within the gradient tape context, we have the place where the gradients of the neural network are calculated.

Here, the gradient tape is accessed via its name (tape in this example) and the gradient function is called tape.gradient(). The first argument to this function is the dependent variable of the differentiation, and the second argument is the independent variable/s. In other words, if we were trying to calculate the derivative dy/dx, the first argument would be y and the second would be x for this function.  In the context of a neural network, we are trying to calculate dL/dw and dL/db where L is the loss, w represents the weights and b the weights of the bias connections. Therefore, in the code above, the reader can observe that the first argument is the loss output from loss_fn and the second argument is a list of all the weight and bias variables through-out the simple neural network.

The next line is where these gradients are zipped together with the weight and bias variables and passed to the optimizer to perform the gradient descent step. This is executed easily using the optimizer’s apply_gradients() function.

The line following this is the accumulation of the average loss within the epoch. This constitutes the inner-epoch training loop. In the outer epoch training loop, after each epoch of training, the accuracy of the model on the test set is evaluated.

To determine the accuracy, first the test set images are passed through the neural network model using nn_model. This returns the logits from the model (the un-activated outputs from the last layer). The “prediction” of the model is then calculated from these logits – whatever output node has the highest logits value, this constitutes the digit prediction of the model. To determine what the highest logit value is for each test image, we can use the tf.argmax() function. This function mimics the numpy argmax() function, which returns the index of the highest value in an array/tensor. The logits output from the model in this case will be of the following dimensions: (test_set_size, 10) – we want the argmax function to find the maximum in each of the “column” dimensions i.e. across the 10 output nodes. The “row” dimension corresponds to axis=0, and the column dimension corresponds to axis=1. Therefore, supplying the axis=1 argument to tf.argmax() function creates (test_set_size, 1) integer predictions.

In the following line, these max_idxs are converted to a numpy array (using .numpy()) and asserted to be equal to the test labels (also integers – you will recall that we did not convert the test labels to a one-hot format). Where the labels are equal, this will return a “true” value, which is equivalent to an integer of 1 in numpy, or alternatively a “false” / 0 value. By summing up the results of these assertions, we obtain the number of correct predictions. Dividing this by the total size of the test set, the test set accuracy is obtained.

Note: if some of these explanations aren’t immediately clear, it is a good idea to jump over to the code supplied for this chapter and running it within a standard Python development environment. Insert a breakpoint in the code that you want to examine more closely – you can then inspect all the tensor sizes, convert them to numpy arrays, apply operations on the fly and so on. This is all possible within TensorFlow 2 now that the default operating paradigm is Eager execution.

The epoch number, average loss and accuracy are then printed, so one can observe the progress of the training. The average loss should be decreasing on average after every epoch – if it is not, something is going wrong with the network, or the learning has stagnated. Therefore, it is an important variable to monitor. On running this code, something like the following output should be observed:

Epoch: 1, cost=0.317, test set accuracy=94.350%

Epoch: 2, cost=0.124, test set accuracy=95.940%

Epoch: 3, cost=0.085, test set accuracy=97.070%

Epoch: 4, cost=0.065, test set accuracy=97.570%

Epoch: 5, cost=0.052, test set accuracy=97.630%

Epoch: 6, cost=0.048, test set accuracy=97.620%

Epoch: 7, cost=0.037, test set accuracy=97.770%

Epoch: 8, cost=0.032, test set accuracy=97.630%

Epoch: 9, cost=0.027, test set accuracy=97.950%

Epoch: 10, cost=0.022, test set accuracy=98.000%

Training complete!

As can be observed, the loss declines monotonically, and the test set accuracy steadily increases. This shows that the model is training correctly. It is also possible to visualize the training progress using TensorBoard, as shown below:

TensorFlow tutorial - TensorBoard accuracy plot

TensorBoard plot of the increase in accuracy over 10 epochs

I hope this tutorial was instructive and helps get you going on the TensorFlow journey.  Just a reminder, you can check out the code for this post here.  I’ve also written an article that shows you how to build more complex neural networks such as convolution neural networks, recurrent neural networks, and Word2Vec natural language models in TensorFlow.  You also might want to check out a higher level deep learning library that sits on top of TensorFlow called Keras – see my Keras tutorial.

Have fun!

The post Python TensorFlow Tutorial – Build a Neural Network appeared first on Adventures in Machine Learning.