There are excellent blog posts and guides available on Mixture Density Networks, so I will not try to replicate the effort. This post provides a quick summary, and implementation code in the Chainer deep learning framework.

In summary, Mixture Density Networks are Neural Networks that output the parameters of a Mixture Model, such as a Gaussian Mixture Model, instead of the desired output itself. The Mixture Model is then sampled from to get the final output. This is particularly useful when given a certain input, there could be multiple outputs based on some probability.

The outputs from the neural network, in the case of Gaussian Mixture Models, include a set of probabilities (coefficients), set of means , and set of standard deviations . For example, if output is **y** given an **x**, and you choose to have 3 Gaussian mixtures, the output of your neural network would be: . The sum to 1 and represent the probability of each mixture being the likely candidate, and the and represent the distribution of **y** within the given mixture and can be used to sample **y**.

Take as a toy example consider the dataset of (x, y) coordinates represented in the graph below:

The blue dots represent the desired **y** value given an **x** value. So at **x**=0.25, **y** could be {0, 0.5, 1} (roughly). In this case, training a neural network using Mean Squared Error to output **y** directly will cause the network to learn the average of the **y** values given **x** as inputs, as shown below (red dots represent **y** output from the neural network):

For this type of problem, a Mixture Density Network is perfectly suited. If trained properly, and sampled from enough times given all x values in the dataset, the Mixture Density Network produces the following **y** outputs. It better learns the distribution of the data:

The output of the neural network is simply the number of dimensions in your output (1 in this example), times the number of desired mixtures, times 3 (the coefficient, mean, and standard distribution). In Chainer, a Linear layer can be used to output these numbers.

The loss function is essentially the negative log of the Gaussian equation multiplied by the softmax’d coefficients: , where [1]. This can be represented in Chainer easily as follows:

alpha = F.softmax(alpha) density = F.sum( alpha * (1 / (np.sqrt(2 * np.pi) * F.sqrt(var))) * F.exp(-0.5 * F.square(y - mu) / var) , axis=1) nll = -F.sum(F.log(density))

Full implementation of a Mixture Density Network in Chainer, for the toy problem shown above, can be found at: https://github.com/AdeelMufti/WorldModels/blob/master/toy/mdn.py

* [1] Christopher M. Bishop, **Mixture Density Networks** (1994)*