## Mixture Density Networks in Chainer

There are excellent blog posts and guides available on Mixture Density Networks, so I will not try to replicate the effort. This post provides a quick summary, and implementation code in the Chainer deep learning framework.

In summary, Mixture Density Networks are Neural Networks that output the parameters of a Mixture Model, such as a Gaussian Mixture Model, instead of the desired output itself. The Mixture Model is then sampled from to get the final output. This is particularly useful when given a certain input, there could be multiple outputs based on some probability.

The outputs from the neural network, in the case of Gaussian Mixture Models, include a set of probabilities $\alpha$ (coefficients), set of means $\mu$, and set of standard deviations $\sigma$. For example, if output is y given an x, and you choose to have 3 Gaussian mixtures, the output of your neural network would be: $\alpha_1, \alpha_2, \alpha_3, \mu_1, \mu_2, \mu_2, \sigma_1, \sigma_2, \sigma_3$. The $\alpha$ sum to 1 and represent the probability of each mixture being the likely candidate, and the $\mu$ and $\sigma$ represent the distribution of y within the given mixture and can be used to sample y.

Take as a toy example consider the dataset of (x, y) coordinates represented in the graph below:

The blue dots represent the desired y value given an x value. So at x=0.25, y could be {0, 0.5, 1} (roughly). In this case, training a neural network using Mean Squared Error to output y directly will cause the network to learn the average of the y values given x as inputs, as shown below (red dots represent y output from the neural network):

For this type of problem, a Mixture Density Network is perfectly suited. If trained properly, and sampled from enough times given all x values in the dataset, the Mixture Density Network produces the following y outputs. It better learns the distribution of the data:

The output of the neural network is simply the number of dimensions in your output (1 in this example), times the number of desired mixtures, times 3 (the coefficient, mean, and standard distribution). In Chainer, a Linear layer can be used to output these numbers.

The loss function is essentially the negative log of the Gaussian equation multiplied by the softmax’d coefficients: $-\ln\{\alpha\frac{1}{\sqrt{2\pi}\sigma}\exp{-\frac{(y-\mu)^2}{2\sigma^2}}\}$, where $\sum\alpha=1$ [1]. This can be represented in Chainer easily as follows:

alpha = F.softmax(alpha)
density = F.sum(
alpha *
(1 / (np.sqrt(2 * np.pi) * F.sqrt(var))) *
F.exp(-0.5 * F.square(y - mu) / var)
, axis=1)
nll = -F.sum(F.log(density))


Full implementation of a Mixture Density Network in Chainer, for the toy problem shown above, can be found at: https://github.com/AdeelMufti/WorldModels/blob/master/toy/mdn.py

[1] Christopher M. Bishop, Mixture Density Networks (1994)