GSoC Summary

A look back into the summer...

The GSoC coding period is coming to an end and it is time to reflect upon what all has been achieved, what could’nt be done and what can be done more in future. My project was aimed at enriching the model zoo of Flux.jl by making additions in the form of gans, reinforcement learning agents and an image-captioning network. Precisely the following models have been implemented as a part of the coding period :

  1. CycleGAN
  2. pix2pix
  3. SRGAN
  4. Proximal Policy Optimization
  5. Trust Region Policy Optimization
  6. Deep Recurrent Q Learning
  7. Neural Image Captioning

These were exactly the part of the proposal as well. Let’s go one by one over explaining the work, analysing what has been achieved and what was lacking in each period.

The Generative Adversarial Networks

Clubbed under these are the models CycleGAN, pix2pix and Super-Resolution-GAN. These have been added to model-zoo in the following PR : #157

CycleGAN and pix2pix are multi-modal image-to-image translation networks. pix2pix learns the translation in one direction at a time during training and works with paired labels. Some cool instances of it’s working are colorising sketches or generating buildings from segmentation maps. CycleGAN learns a combined translation between both the domains during training. The improvement aided here is that it does not need any paired labels. pix2pix has a generator and a discriminator and apart from normal adversarial loss, works with a L1-constraint to improve the quality of the images. CycleGAN has a generator and a discriminator for each of the domains. The complete tutorial was covered in a previous post and can be found here.

Super-Resolution-GAN is another cool application of GANs. It basically takes a downgraded version of a high-quality image and learns to map it back to the high resolution. It uses an interesting Pixel-Shuffle layer which permutes a feature map along channels and resizes it to obtain an upsampled version of the output. As this layer was not a part of Flux, it was used manually. The generator extracts the features from the low-resolution image and uses the pixel-shuffle operation to obtain the high-resolution output.

pix2pix_arch

The pix2pix architecture (Image taken from the paper)

cyclegan_arch

The CycleGAN architecture

srgan_arch

The SRGAN architecture (Image taken from the paper)

Sample Results

pix2pix_1
Sample output of `pix2pix` trained on `facades` dataset

pix2pix_2
Sample output of `pix2pix` trained on `facades` dataset

Challenges faced

During the initial phase of training, the network was taking too long to train when compared with the pytorch and tensorflow counterparts. As discussed with my mentors, the suggested hack was to integrate #335 to the speedup. But due to time constraints and a plethora of work left to do, the problem was circumvented by making the networks smaller. The new networks are known to give decent results on other frameworks and so the reduction could be justified. The final status of training was that the models were overfitting in case of pix2pix and manually tuning hyperparameters was’nt that successful. So all in all the models have’nt been trained and tuning of hyper-parameters is still left.

What has been achieved

  • Implemented the models and PR sent
  • Pixel-Shuffle layer [#112]

What has’nt been achieved

  • Training the models successfully

Future Work

  • Tune the hyper-parameters and try training on a small dataset first
  • Experiment with architectures from stable implementations in other libraries and use the hyperparameters therein

The Reinforcement Learning Baselines

As a part of my proposal, I had to implement Trust Region Policy Optimization, Proximal Policy Optimization and Deep Recurrent Q-Networks. This was the part that most appealed to me, because the consequences were two fold :

  • There were quite a few networks as a part of model zoo for computer-vision models, but not many diverse models in terms of reinforcement learning.
  • RL has shifted focus to new state-of-the art models for policy gradients like TRPO and PPO and adding them would attract many people to do RL in julia.

Realising that a general API was required now for the policy gradient methods which TRPO and PPO heavily share, I decided to create the package along the lines of the openai-baselines.

The code is maintained in the this repository. The code for the recurrent-q-network can be found in the drqn branch.

A PR regarding the same has been added here.

A Peek Into The API

Before talking about the models, let’s look at what the structure of the package is like.

In general, with the type of environments available currently in Gym.jl, the action space can either be Discrete or Continuous. We define a CategoricalPolicy structure to handle the Discrete action spaces and DiagonalGaussianPolicy to handle Continuous action spaces. The functions for entropy, kl-divergence, etc. are a part of each policy and are defined in policy.jl in the common/ directory.

In order to create a policy, an environment wrapper instance is to be created. It takes in just the environment name and accordingly obtains all the information such as STATE_SIZE and ACTION_SIZE from the envionment. This is used as an abstraction for initializing the policy.

# Define environment wrapper instance
env_wrap = EnvWrap(ENV_NAME)

# Define policy 
# The policy can currently either be `Catrgorical` or `DiagonalGaussian`
policy = get_policy(env_wrap) # Returns the policy based on the environment type

next_state = action(policy,state) # Take an action in current state

# Get log-probability of current action
logp = log_prob(policy,states,actions)

# Extract out the neural network parameters for the policy
policy_params = get_policy_params(policy)
value_params = get_value_params(policy)

# I/O
save_policy(policy,"PATH_TO_SAVE")
policy = load_policy(env_wrap,"LOAD_PATH")

Similarly other utilities in terms of kl-divergence and entropy are defined in policy.jl.

This is done to abstract out the policy definition to allow for easy extensibility of the code in future with newer environments and models. We also define a Buffer structure to log different variables and their values. For both trpo and ppo a file called rollout.jl handles all code for parallelly running agents, collecting trajectories of experiences and processing the trajectories to extract out all relevant information from them. The basic structure being covered, let us understand about the model implementations.

eb = Buffer() # Create an episode buffer
# This will be used to log different variables as time progresses
register(eb,"states")
register(eb,"actions")
register(eb,"rewards")
register(eb,"next_states")
register(eb,"dones")
register(eb,"returns")
register(eb,"advantages")
register(eb,"log_probs")
register(eb,"kl_params")

# Clear and reset all variables in buffer
clear(eb)

# Access the data stored for a particular key in the buffer 
eb.exp_dict["key"] = ... # You can directly modify this variable
# OR
add(eb,"rewards",1.0) # Assuming the value for the key is an array, this will push the value into it

As defined in rollout.jl :

function collect_and_process_rollouts(policy,episode_buffer::Buffer,num_steps::Int,stats_buffer::Buffer)
"""
This function collects rollouts along multiple parallel actors and processes them
It populates the values in the different variables of the `episode_buffer`

episode_buffer : Buffer to store collect and store information along trajectories
num_steps : Number of timesteps to collect in a trajectory
stats_buffer : To log variables for analysis
"""

TRPO tries to solve the problem of policy updates by not letting the policy change by a large margin in each step and at the same time theoretically giving a garuntee on the improvement in the total expected return at each time step. It does this by solving the constrained optimization problem by making approximations to the contraint and a formulated surrogate loss function. The implementation involves using some nifty details like Hessian-Vector-Product computation and optimization using Conjugate-Gradients. It took quite a lot of time to digest these concepts and after reading throgh stable pieces of code and gluing it all up, the model was built in pure julia.

PPO sort of simplifies the objective of TRPO by allowing for violation of the KL-Divergence constraint, but giving a much simpler policy update. It is easier to code up and in practice does quite well in comparision fo TRPO.

With the above API in place, the groud-up implementation of PPO is as simple as :

# Clear episode buffer
clear(episode_buffer)

# Collect information along trajectories
collect_and_process_rollouts(policy,episode_buffer,EPISODE_LENGTH,stats_buffer)

# Extract mini-batch information
i = minibatch_idx # Stores the batch indices for a particular mini-batch
mb_states = episode_buffer.exp_dict["states"][:,i] 
mb_actions = episode_buffer.exp_dict["actions"][:,i] 
mb_advantages = episode_buffer.exp_dict["advantages"][:,i] 
mb_returns = episode_buffer.exp_dict["returns"][:,i] 
mb_log_probs = episode_buffer.exp_dict["log_probs"][:,i]
mb_kl_vars = episode_buffer.exp_dict["kl_params"][i]

# Update the policy and value function parameters
ppo_update(policy,mb_states,mb_actions,mb_advantages,mb_returns,mb_log_probs,mb_kl_vars)

# Save the policy
save_policy(policy,"PATH_TO_SAVE")

Deep-Recurrent-Q-Networks is an extension of the work done last year by Tejan Karmali [ @tejank10 ]. The idea was basically to add a LSTM layer in the Q-network to process timesteps of experience than just a single timestep. The advantage of these models is that they can hangle Partially Observable Markov Decision Processes as well.

Alright, show me the results!

Cartpole Pendulum

What has been achieved

  • PPO and TRPO trained on Pendulum-v0 and CartPole-v0. [Note : TRPO was’nt trained on CartPole-v0 as nested AD is currently not supported for softmax which is used for this environment and in TRPO]
  • A package along the lines of openai-baselines for the model-zoo [#158]

What has’nt been achieved

  • Proper convergence of training on Deep-Recurrent-Q-Networks

Future work

With the Arcade Learning Environment being setup, it would be good to try out the models on those environments and obtain a few benchmarks with Flux

The Image Captioning Network

Coming to the final leg of my proposal, the idea was to impelement the paper Show and Tell.

The idea is pretty straightforward. Given an image, we extract features using a pre-trained CNN like VGG19 and cache them for efficiency. This is used as an input to a LSTM cell which further gets the subsequent caption words and learns to predict, sequence-wise, the next word in the caption. The training was done on the MSCOCO dataset. Overall, results were average, and can be found in the Test.ipynb jupyter notebook here.

imgcap_arch

Neural Image Captioning Architecture (Image taken from the paper)

Let’s describe some images

Here is the output of the network on random images from the internet :

imgcap_1
A young man is walking his bicycle across the street

imgcap_2
A kitchen with a sink and a refrigerator

imgcap_3
A group of men with his cap next to a woman in a kitchen

What has been achieved

  • Model trained on 30000 captions of the MSCOCO dataset. Work is to be added to the model-zoo with the reference to this PR.

Future work

  • Add attention to the captioning network
  • Train the network on a greater number of captions

The code and the results are available here.

Conclusion

Summarsing the past three months, it was indeed a great learning experience. I got to learn by implementing actual code and training complex RL-Agents from scratch, something I had never done before. It was filled with reading papers, some advanced math and trying to put it all into code. The community was equally helping and supportive at all times and I am really thankful to my mentors Dhairya Gandhi, Elliot Saba and Deepak Suresh, fellow GSoC/JSoC students especially Manjunat Bhat, Ayush Kaushal and Tejan Karmali and the entire Julia community for providing me with an invaluable experience over the summer. I look forward to contribute to the community in the coming times and will do justice to my love for open-source.