Deep Learning Tech Blog

Reinforcement Learning Research

Kevin Mandich

2018.04.23

Motivation

Reinforcement learning (RL) is a field of machine learning experiencing rapid change. At Incubit, part of our time is spent keeping up to date with the latest research so that we can deliver the best possible AI solutions to our customers. RL applications are not yet as prevalent as other areas of machine learning. However, the skills acquired while solving these problems are not only invaluable to have as an AI engineer, but also overlap heavily with these other areas.

We chose to recreate some of the recent results produced for Atari games, both because it is fascinating and technically challenging, and because open-source simulations are available. The results shown are only for Space Invaders, but the agent trained on this game also performs well on other Atari games.

Deep Q Network

The agent used is based upon a deep Q network architecture. This approach utilizes several powerful methods which allow the agent to choose an action given a raw image:

Convolutional filtering to create a vector representation of features present in an image
Combination of consecutive frames
An intelligent replay mechanism which utilizes important memories to accelerate training
A value iteration method (Q-learning) to map states to future rewards
Separate determination of value and advantage within the network
Use of separate networks to choose an action and to generate target Q values for that action

Architecture

In general, a Q value is a function of the value at a specific state and the advantage to be gained by following each of the possible actions. To obtain a state representation, we have to reduce the dimensionality of the input data. Each input is a set of 4 frames of size 210 x 160 pixels with 3 color channels. With 256 values per pixel, this gives us a state space of 10 ^ 971004. We make this problem tractable through the use of a 4-layer convolutional neural network, which reduces this input space to a vector of size 512, a much more manageable number.

Multiple frames are utilized because it is generally not possible to determine the velocity of moving objects from a single frame. See two consecutive frames below:

By running multiple frames through the network, we can obtain a state representation not only of the current frame, but also of the velocity and acceleration of objects within the frame.

Here is a representation of the networks used to generate actions from a batch of consecutive frames:

The output of the last convolutional layer is fully-connected to a vector of size 512. This is split down the middle, and each resulting vector of size 256 is multiplied by a weight matrix to obtain the value and advantage functions. These are combined to obtain the final Q-value vector of length 6 for each of the actions. The action taken by the agent corresponds to the maximum Q-value in this output vector.

The second Target network is introduced for training stability. For each training step, the Main network generates the action to choose, and the Target network generates the target Q value for that action. The Main network is trained every step by minimizing the loss between generated and target Q values, and the Target network is updated with the network weights of the Main network every 500 steps.

Experience Replay

An experience buffer along with a prioritized replay scheme was used to accelerate training. The buffer consists of a maximum of 100,000 individual game frames, along with the action taken and the reward realized at that frame. During a network training step, a batch of replay examples is sampled from the buffer corresponding to a priority value assigned to each entry.

This value is a function of the difference between the Q values predicted by the network at time t-1 and the Q values predicted at time t. This is a measure of how “surprising” the transition is to the network. More surprising transitions are more likely to be chosen during the network training step.

Exploration

Exploration was achieved by using a simple random strategy: choose a random action with probability e. The value e was reduced from 1.0 to 0.025 linearly over 100,000 frames. Maintaining this minimum exploration helps to prevent the agent from settling on a sub-optimal policy. For Space Invaders, it is also useful for learning correct behavior near the end of a stage, where there is oftentimes only one alien ship left and it moves quickly. In this scenario, it is useful to learn both correct and incorrect behaviors (e.g. shooting and killing the alien vs. hesitating and letting it touch down).

Training

Training steps were performed every 8 time steps on a batch of 32 frames sampled from the priority replay buffer. The target network was fully updated with the weights from the primary network every 500 time steps. Training was arbitrarily performed for 100,000,000 frames, which took about 32 hours of runtime on a Titan X GPU.

Results

To get an idea of what the network is doing at every frame, we plotted the gameplay as well as the following information at every frame:

Value
Advantage for each of the 6 actions
A running percentage of the actions taken by the modelH
A handful of the activations at each of the layers in the convolutional network

Here’s a video showing an agent, which was trained for 15 hours, playing a game of Space Invaders:

Here are a few key takeaways from the visualizations:

The value is a function of the amount of enemies on the map, the missiles fired by the avatar, and the location of the avatar. If the avatar moves beneath a shield, for example, the value decreases.
Most actions constitute a move, a missile firing, or both. This means that the agent rarely spends any time simply waiting.
Oftentimes the advantage of a certain action is only slightly higher than others. In the screenshot above, for example, the highest advantage is given to Move Left & Fire Missile. However, the constituent actions are also considered advantageous to the agent.
The convolutional layers show glimpses into some of the features that the agent sees. Missiles being fired are visible as activated neurons, as are the enemy ships and the avatar itself.

It has been a fascinating and challenging project, and has been delightful to catch a glimpse of what is happening inside of the trained network. In the future we hope to continue training AI agents to play Atari games, as well as other benchmark simulations.

Other blog

2018.07.31

Deep Learning for Image Segmentation of Tomato Plants

MotivationAt Incubit we are always looking for new ways to apply AI and machine learning technology to interesting problems. A recent project has found us applying image segmentation to the problem of identifying parts of a tomato plant. Tomate pruning is a technique often used to increase the yield of tomato plants by removing small, non-tomato-blooming branches from the plant. Here we describe a method to determine the location of prunable tomato branches, as well as critical parts of the tomato plant which should not be touched, such as primary trunks, branches supporting tomatoes, and the tomatoes themselves.Figure 1: Pruning non-critical branches from a tomato plant. Photo courtesy of gardeningknowhow.com.In this project we apply techniques for Image Segmentation to locate objects of interest. The goal of this analysis method is to locate different segments, or contiguous sets of pixels, within an image which denote some meaningful entity within the image. There are various ways to accomplish this using both computer vision and model-based approaches. We opted for the supervised model-based approach, with the hopes of obtaining both higher accuracy and greater generalization to new images.Obtaining labeled dataThe first step in developing an image model is obtaining labeled training data. For this task we used Incubit’s AI Platform (IAP) to create segmentation labels. Figure 2 shows an example of how we annotated an image to show segments of four classes of interest: main trunk, sucker branch, tomato branch, and tomato.Figure 2: On left – raw image. On right – annotated image. Annotations are: red = main trunk, blue = sucker branch, purple = tomato branch, yellow = tomato.These annotations were stored as JSON files and used to create segmentation masks. Figure 3 shows an example of these masks, created from a crop of the above annotations, which was fed directly into the model as labels:Figure 3: Masks created from the annotated data. The white pixels in each mask act as the target for the segmentation of each respective class.The modelWe based on model architecture on SegNet, a well-known deep neural network which excels in image segmentation applications. Figure 4 shows a simplified overview of the architecture used.A high-level summary of the architecture is this: the original image is passed through a number of encoding blocks, each of which consists of several convolutional layers, batch normalization, and ReLU activation layers, followed by a pooling layer. The reduced features are then passed through a series of upsampling layers. Loss is computed from the cross entropy between the sigmoid output of the final convolutional layer and the segmentation targets (labels).TrainingTraining was performed with a constant learning rate of 0.00001 until there was no improvement in the test error rate for 10 consecutive epochs. A random parameter search yielded the following combination of optimal hyperparameters: 5 encoding and decoding blocks, 32 initial filters, a dropout rate of 0.25, and independent pooling indexes between the encoding and decoding layers.Image augmentation was used to both increase the size of the training pool and to help generalize the model. A combination of flips, crops, random noise, Gaussian blur, fine and coarse dropout, perspective transformations, piecewise affines, rotations, shears, and elastic transformations were used from the imgaug library to reach this end.Figure 5 shows an example of an annotated output frame produced by the trained model.Figure 5: Annotated output showing different segmentation classes of a tomato plant.Post-ProcessingOne of the expected outcomes of this project is the ability to automatically locate the origin and direction of branches growing from the main trunk. To do this, we can utilize the segmentation outputs for the trunk and branch classes. We wrote an algorithm to detect branches which are attached to the main trunk and to extract this information. The steps are:Use the connected components algorithm to identify individual branches and trunks within the image.For each possible pair of branch and trunk segments which partially overlap (full overlaps represent non-connected branch/trunk pairs), record these as connected pairs.For each connected pair, mark the base of the branch as the centroid of the overlapped region. This acts as the starting point of the branch direction vector.Define the end point of the branch direction vector as the halfway point of the shortest line connecting the following entities:The least-squares fit line of the branch segmentThe centroid of the branch segmentFigure 6 shows an example of a branch direction vector drawn from the base of the branch, at the trunk, to the midpoint of the branch:Figure 6: Drawing a branch direction vector.Branch vectors, along with the segmented classes, are superimposed on the original raw image.ResultHere is a video showing the results of this analysis on a tomato garden. Visible are the different class segments and the branch direction vectors.We look forward to applying this technology to other interesting and novel use cases.