There are excellent blog posts and guides available on Mixture Density Networks, so I will not try to replicate the effort. This post provides a quick summary, and implementation code in the Chainer deep learning framework.
In summary, Mixture Density Networks are Neural Networks that output the parameters of a Mixture Model, such as a Gaussian Mixture Model, instead of the desired output itself. The Mixture Model is then sampled from to get the final output. This is particularly useful when given a certain input, there could be multiple outputs based on some probability.
The outputs from the neural network, in the case of Gaussian Mixture Models, include a set of probabilities (coefficients), set of means
, and set of standard deviations
. For example, if output is y given an x, and you choose to have 3 Gaussian mixtures, the output of your neural network would be:
. The
sum to 1 and represent the probability of each mixture being the likely candidate, and the
and
represent the distribution of y within the given mixture and can be used to sample y.
Take as a toy example consider the dataset of (x, y) coordinates represented in the graph below:

The blue dots represent the desired y value given an x value. So at x=0.25, y could be {0, 0.5, 1} (roughly). In this case, training a neural network using Mean Squared Error to output y directly will cause the network to learn the average of the y values given x as inputs, as shown below (red dots represent y output from the neural network):

For this type of problem, a Mixture Density Network is perfectly suited. If trained properly, and sampled from enough times given all x values in the dataset, the Mixture Density Network produces the following y outputs. It better learns the distribution of the data:

The output of the neural network is simply the number of dimensions in your output (1 in this example), times the number of desired mixtures, times 3 (the coefficient, mean, and standard distribution). In Chainer, a Linear layer can be used to output these numbers.
The loss function is essentially the negative log of the Gaussian equation multiplied by the softmax’d coefficients: , where
[1]. This can be represented in Chainer easily as follows:
alpha = F.softmax(alpha) density = F.sum( alpha * (1 / (np.sqrt(2 * np.pi) * F.sqrt(var))) * F.exp(-0.5 * F.square(y - mu) / var) , axis=1) nll = -F.sum(F.log(density))
Full implementation of a Mixture Density Network in Chainer, for the toy problem shown above, can be found at: https://github.com/AdeelMufti/WorldModels/blob/master/toy/mdn.py
[1] Christopher M. Bishop, Mixture Density Networks (1994)