World Models is a framework described by David Ha and Jürgen Schmidhuber: https://arxiv.org/abs/1803.10122. The framework aims to train an AI agent that can perform well in virtual gaming environments. World Models consists of three main components: Vision (V), Model (M), and Controller (C).

As part of my MSc Artificial Intelligence dissertation at the University of Edinburgh, I implemented World Models from the ground up in Chainer. My implementation was picked up by Chainer and tweeted:

A reimplementation of our paper from scratch using Chainer! Unlike other reimplementation attempts, he also reproduced the generative environment part, using much less compute. He also describes clearly Mixture Density Networks, CMA-ES, and implemented those from scratch as well. https://t.co/EzfrdH7cKL

According to OpenAI, Evolution Strategies are a scalable alternative to Reinforcement Learning. Where Reinforcement Learning is a guess and check on the actions, Evolution Strategies are a guess and check on the model parameters themselves. A “population” of “mutations” to seed parameters is created, and all mutated parameters are checked for fitness, and the seed adjusted towards the mean of the fittest mutations. CMA-ES is a particular evolution strategy where the covariance matrix is adapted, to cast a wider net for the mutations, in an attempt to search for the solution.

To demonstrate, here is a toy problem. Consider a shifted Schaffer function with a global minimum (solution) at f(x=10,y=10):

The fitness function F() for CMA-ES can be treated as the negative square error between the solution being tested, and the actual solution, against the Schaffer function:

Therefore, the task for CMA-ES is to find the solution . Given the right population size and the right for CMA-ES, it eventually converges to a solution. With and , a visualization of CMA-ES as it evolves a population over generations can be seen below.

The animation below depicts how CMA-ES creates populations of parameters that are tested against the fitness function. The blue dot represents the solution. The red dots the entire population being tested. And the green dot the mean of the population as it evolves, which eventually fits the solution. You see the “net” the algorithm casts (the covariance matrix) from which the population is sampled, is adapted as it is further or closer to the solution based on the fitness score.

There are excellent blog posts and guides available on Mixture Density Networks, so I will not try to replicate the effort. This post provides a quick summary, and implementation code in the Chainer deep learning framework.

In summary, Mixture Density Networks are Neural Networks that output the parameters of a Mixture Model, such as a Gaussian Mixture Model, instead of the desired output itself. The Mixture Model is then sampled from to get the final output. This is particularly useful when given a certain input, there could be multiple outputs based on some probability.

The outputs from the neural network, in the case of Gaussian Mixture Models, include a set of probabilities (coefficients), set of means , and set of standard deviations . For example, if output is y given an x, and you choose to have 3 Gaussian mixtures, the output of your neural network would be: . The sum to 1 and represent the probability of each mixture being the likely candidate, and the and represent the distribution of y within the given mixture and can be used to sample y.

Take as a toy example consider the dataset of (x, y) coordinates represented in the graph below:

The blue dots represent the desired y value given an x value. So at x=0.25, y could be {0, 0.5, 1} (roughly). In this case, training a neural network using Mean Squared Error to output y directly will cause the network to learn the average of the y values given x as inputs, as shown below (red dots represent y output from the neural network):

For this type of problem, a Mixture Density Network is perfectly suited. If trained properly, and sampled from enough times given all x values in the dataset, the Mixture Density Network produces the following y outputs. It better learns the distribution of the data:

The output of the neural network is simply the number of dimensions in your output (1 in this example), times the number of desired mixtures, times 3 (the coefficient, mean, and standard distribution). In Chainer, a Linear layer can be used to output these numbers.

The loss function is essentially the negative log of the Gaussian equation multiplied by the softmax’d coefficients: , where [1]. This can be represented in Chainer easily as follows:

As a 4-month research project for a course at the University of Edinburgh, my group decided to use Generative Adversarial Nets to generate new paintings using the Painter by Numbers dataset. Additionally, we conditioned the paintings, allowing the generation of paintings with certain attributes (such as gender in the case of paintings of portraits). The project turned out to be a big success, and won 2nd place out of 124 other projects in a competition hosted by IBM!

The deep convolutional GAN architecture we chose for the basis of our model enforced a Lipschitz continuity constraint on the spectral norm of the weights for each layer in the discriminator, as proposed by Miyato et al. (2018). This technique, called SN-GAN by the authors, stabilizes the discriminator, allowing the generator to learn the data distribution more optimally.

We built upon the SN-GAN architecture to add conditioning, which involved introducing a one-hot encoded y vector of supervised labels to the generator and discriminator. The y vector provides some information about the image, such as gender in the case of portrait paintings, allowing the generator to get conditioned on the label to generate the class of image specified during test time. Our contribution was a novel model which we called SN-conditional-GAN or SNcGAN, which is illustrated below.

More details can be found in our paper linked above.

After the publication of the paper, we additionally trained a SNcGAN model on the celebA dataset, allowing the generation of conditioned celebrity images.