End of phase two

A tale of extensive training...

The second phase of the coding period was filled with the air of eagerness to get results out. As for the previous post, Proximal Policy Optimization was complete, however the model did not fit well to the Pendulum-v0 environment. That was mainly due to incorrect hyper-parameters. But that aside, Trust Region Policy Optimization was yet to be implemented. As I started to write code for it, I realised that the code written so far was not as modular as it seemed to be. So I set out to make the code more generic while I would proceed with TRPO. I looked at the basic structure of the OpenAI baselines and started work with RL-baselines.jl.

I talked to my mentor and he was willing to get a demo on the Pendulum-v0 environment delivered from my side. To bring the work status at that time into perspective, this was what the model trained performed like.

pendulum-before

Clearly the output is not very stable and definitely underestimates the power of an algorithm like Proximal Policy Optimization.

Having completed the code, I spent quite a bit of time tuning the hyper-parameters. And that did take some time! But eventually things fell into place and I am happy that the efforts finally did pay off somewhow. It was a great learning experience with a first-experience of writing code from scratch and training a reinforcement learning model.Here is a glimpse of what the final output looks like.

pendulum-stable

It is fun to see that the learned policy swings the pendulum up by first generating momentum with alternate swings and finally puts it in place.

Here is the final trained output on the CartPole-v0 environment.

cartpole-stable

Once these were done, I proceeded to read up the paper on Deep Recurrent Q Networks and writing code for that. As of now, the code is completed and training and debugging of the model is left. The code has been updated in the drqn branch of the RL-baselines.jl repository.

Coming to the status of pix2pix.jl, the problem of long training time was circumvented by reducing the model size. I began training that on different dataset sizes but it turned out that the model was always overfitting. I added some jittering as described by the paper authors but it turns out that all the learning the model does is solely due to the L1 loss term. Currently I am trying to overfit the model using only the adversarial loss which is currently not working well and it is what I am investigating.