Deep Q-Learning and TensorFlow Agents

10 min readApr 11, 2021

Reinforcement Learning

Before we start diving into the implementations, let’s go over some of the concepts.

Reinforcement learning is an area of machine learning where an agent learns a policy by interacting with its environment by maximizing the rewards it collects. The agent observes an input state or an input pattern and it produces and output signal (in the form of an action or a control signal) and then receives a scalar “reward” or ‘reinforcement’ feedback signal. This feedback is then used to update the policy or the behavior and then the whole process is repeated until the agent performs better (or we converge to an optimal policy). There are also different methods of reinforcement learning such as SARSA, Q-Learning, Deep Q-learning and Actor Critic Methods. There is also a method called Imitation Learning, where the agent learns by observing a human expert perform the task instead of exploring the environment by itself (Check out this article by Luis Gonzalez where he explains imitation learning and shows how to implement it in Unity).

In this article, we will go over deep q-learning with gym agents usingTensorFlow Agents, but you can also use the same framework for different methods and also with other environments.

Q-Learning

Q-Learning is a type of reinforcement learning where the agent operates in the environment with states, rewards and actions. It is a model-free environment meaning that the agent doesn’t try to learn about an underlying mathematical model or a probability distribution. Instead it tries to construct an optimal policy by directly interacting with the environment. In q-learning the number of possible states and the possible actions is finite, therefore, if we’re implementing it with continuous states, we need to find a way to discretize the environment.

In q-learning we assign q-values to the quality of a particular action at a given state [Q(a, s)]. These q-values reflect our current estimates of the sum of the future rewards if an action is taken.

We store the q-values in q-tables, which are tables of dimensions s x a (states x actions). By following these q-values, the agents try to take an optimal path to get the highest rewards. And the table itself represents the agents policy in the current environment.

As the agent explores the environment more and more, we need to update the q-values stored in the q-table. To do this we calculate the temporal difference, which represents how much the q-value should be changed after each action. After each step q-values are updated.

The temporal difference equation looks like this:

TD(s_t, a_t) = r_t + gamma x max(Q(s_t+1, a)) — Q(s_t, a_t)

TD(s_t, s_t) -> action taken in the previous state

r_t -> reward received for the previous action

gamma -> discount factor to determine how much we will value the previous actions

max(Q(s_t+1, a)) -> largest q-value available for any action in the current state

Q(s_t, a_t) -> q-value for the action taken in the previous state

After finding the temporal difference, we update the q-values using the Bellman equation:

Qnew(s_t, a_t) = Qold(s_t, a_t) + learning_rate x TD(s_t, a_t)

With these set of instructions, you could build a q-learning agent.

Deep Q-Learning

Even though q-learning is very useful to solve smaller, usually discrete problems, it isn’t always suitable for more complicated problems. In that case we need a mechanism to approximate the policy. Usually we replace the q-table with a neural network (Read more about universal approximation theorem here). We create a reward function very similarly to the q-learning, and feed that into a neural network to optimize its weights to direct behavior. Neural networks input layer represents the state space and the output layer represents the action space. After training, we expect the neural network to converge to the optimal policy and the output layer to predict the best case of actions (even though it doesn’t always happen).

Using this strategy also allows us to train agents on much more complicated games. We can take certain parameters from the game and use them as inputs. However, we can also use the whole game screen as inputs and use convolutions for the neural network to detect the locations of the objects and this allows the neural network to account for possibilities that we cannot potentially have parametrized before. With this reinforcement learning models can learn much complicated tasks, such as self-driving cars or playing complicated 3-D video games.

Replay Buffers

Another nice trick we can use with neural networks is experience replay, where instead of inputting the very last transition into the neural network we can store them in replay buffers and train the network on all of them at the same time, making it obvious which states lead to which states and how should the reward be distributed within trajectories. Also, we can input the same trajectories multiple times by resampling from the replay buffer, which allows for a more stable training, and in the cases where collecting data is costly it allows us to see similar states multiple times.

AI Gym

Now before we go through the implementation, let’s mention Gym. Gym is a package by Open AI, where they have multiple games that are suitable for reinforcement learning projects. With every environment you can get states, actions and rewards and train an agent based on those. In this example we will take a look at the Cartpole game, where you need to balance a pole on a cart. There two possible actions, moving the cart either to the left or to the right, and there are 4 state variables, the speed of the cart, the position of the cart, the angle of the pole and the angular velocity of the pole. The game screen looks like this:

For this implementation, we will use the adaptation made by TensorFlow, instead of the original game from Gym, but they are essentially the same. Ans you can try this with other games available in the library too, or adapt it to other tasks, since the methods are very suitable for generalization.

To see the full code we will be going through check this link

Implementation with TensorFlow Agents

To train our agent we will use the code put together by Google’s TensorFlow. To see the complete code you can look at the guide. Here I will be doing my best to illustrate what happens at different stages and how it relates to the things we mentioned above.

To start, you need to install the following dependencies and create a visual display for the visualizations:

sudo apt-get install -y xvfb ffmpeg
pip install -q 'imageio==2.4.0'
pip install -q pyvirtualdisplay
pip install -q tf-agentsfrom __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

Next we will define the hyperparamters. num_iterations is the number of times the weights will be updated in total. In the original code this was 20000, but I realized, for the game here the performance of the model might actually start to deteriorate after training for too long so I cut it down by half. initial_collect_steps is the number of steps we collect before starting to train the network. Then every iteration we collect a single step to input to the replay buffer. You can increase this as well, it definitely improves the performance as more states overall will be visited, but keep in mind that it will increase the training time as at every iteration we will need to play the game for longer. Then we set the maximum length of the replay buffer to be 100000. Technically with these hyperparameters, we never reach this maximum length, but if you are training for longer or for more steps the replay buffer might start using too much memory and with this you start replacing the initial observations with the new ones as they become available. Then we set the batch size to be 64. This is how many samples we will draw from the replay buffer at each training step. The size of this affects how much a single observation will affect the weight updates. Learning rate is set to be 0.001, which in this realm, is a reasonable rate. A too big of a learning rate and we will have an unstable network. We want to gradually but consistently converge. This learning rate is very similar to the learning rate we had in the Bellman equation, but this time is used to update the weights.

num_iterations = 10000 # @param {type:"integer"}

initial_collect_steps = 100  # @param {type:"integer"} 
collect_steps_per_iteration = 1  # @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}

batch_size = 64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}
log_interval = 200  # @param {type:"integer"}

num_eval_episodes = 10  # @param {type:"integer"}
eval_interval = 1000  # @param {type:"integer"}

Then you go ahead and set up the environment. You can also use the same model with other games.

env_name = 'CartPole-v0'
env = suite_gym.load(env_name)env.reset()
PIL.Image.fromarray(env.render())

Some things to keep in mind with the environment is that, env.step(), steps up the game. env.time_step_spec() gives us the information about the game states, and you can call env.time_step_spec().observation to get the shape of the obeservations and env.time_step_spec().reward to get the details about the rewards. You can also call env.action_spec() to show how the actions look like.

Then we instantiate two environments, one for training and one for evaluation. And convert them to tensors from numpy arrays to work better with TensorFlow:

train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

Then we initialize a neural network. The key point here is that the output layer has the size of the action space. And we can put bunch of dense layers before that. Using this code you can also tweak around with the architecture.

#The desired sizes of each layer
fc_layer_params = (100, 50)#Getting the action space for the output layer
action_tensor_spec = tensor_spec.from_spec(env.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
def dense_layer(num_units):
  return tf.keras.layers.Dense(
      num_units,
      activation=tf.keras.activations.relu,
      kernel_initializer=tf.keras.initializers.VarianceScaling(
          scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with `num_actions` units to generate one q_value per available action as
# it's output.
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]
q_values_layer = tf.keras.layers.Dense(
    num_actions,
    activation=None,
    kernel_initializer=tf.keras.initializers.RandomUniform(
        minval=-0.03, maxval=0.03),
    bias_initializer=tf.keras.initializers.Constant(-0.2))
q_net = sequential.Sequential(dense_layers + [q_values_layer])

Next, we create an optimizer and create an instance of our agent.

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

We create a random policy to be used for comparison

random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())

We use average returns to evaluate a policy. The return is the sum of rewards collected while running a policy in the environment for an episode. After several episodes we can find an average return. With the following code we get average returns.

def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]

Now we create a replay buffer as explained above. We can do this easily with TensorFlow Agents with a single line of code

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_max_length)

We can then create functions to collect data from the environment and put them inside the replay buffer. We initially go through the environment for 100 steps with a random policy to fill in the replay buffer initially.

def collect_step(environment, policy, buffer):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)

  # Add trajectory to the replay buffer
  buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
  for _ in range(steps):
    collect_step(env, policy, buffer)

collect_data(train_env, random_policy, replay_buffer, initial_collect_steps)

Then we need a way to access the collected data. The TensorFlow guide does this by converting the collected data into a dataset.

# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, 
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)

The following code trains the agent. At every iteration, we collect new steps, get a new sample from the replay buffer, calculate the loss and then update our weights to minimize the loss.

try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  collect_data(train_env, agent.collect_policy, replay_buffer, collect_steps_per_iteration)

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)