There are excellent blog posts and guides available on Mixture Density Networks, so I will not try to replicate the effort. This post provides a quick summary, and implementation code in the Chainer deep learning framework.

In summary, Mixture Density Networks are Neural Networks that output the parameters of a Mixture Model, such as a Gaussian Mixture Model, instead of the desired output itself. The Mixture Model is then sampled from to get the final output. This is particularly useful when given a certain input, there could be multiple outputs based on some probability.

The outputs from the neural network, in the case of Gaussian Mixture Models, include a set of probabilities [latex]\alpha[/latex] (coefficients), set of means [latex]\mu[/latex], and set of standard deviations [latex]\sigma[/latex]. For example, if output is **y** given an **x**, and you choose to have 3 Gaussian mixtures, the output of your neural network would be: [latex]\alpha_1, \alpha_2, \alpha_3, \mu_1, \mu_2, \mu_2, \sigma_1, \sigma_2, \sigma_3[/latex]. The [latex]\alpha[/latex] sum to 1 and represent the probability of each mixture being the likely candidate, and the [latex]\mu[/latex] and [latex]\sigma[/latex] represent the distribution of **y** within the given mixture and can be used to sample **y**.

Take as a toy example consider the dataset of (x, y) coordinates represented in the graph below:

The blue dots represent the desired **y** value given an **x** value. So at **x**=0.25, **y** could be {0, 0.5, 1} (roughly). In this case, training a neural network using Mean Squared Error to output **y** directly will cause the network to learn the average of the **y** values given **x** as inputs, as shown below (red dots represent **y** output from the neural network):

For this type of problem, a Mixture Density Network is perfectly suited. If trained properly, and sampled from enough times given all x values in the dataset, the Mixture Density Network produces the following **y** outputs. It better learns the distribution of the data:

The output of the neural network is simply the number of dimensions in your output (1 in this example), times the number of desired mixtures, times 3 (the coefficient, mean, and standard distribution). In Chainer, a Linear layer can be used to output these numbers.

The loss function is essentially the negative log of the Gaussian equation multiplied by the softmax’d coefficients: [latex]-\ln\{\alpha\frac{1}{\sqrt{2\pi}\sigma}\exp{-\frac{(y-\mu)^2}{2\sigma^2}}\}[/latex], where [latex]\sum\alpha=1[/latex] [1]. This can be represented in Chainer easily as follows:

alpha = F.softmax(alpha) density = F.sum( alpha * (1 / (np.sqrt(2 * np.pi) * F.sqrt(var))) * F.exp(-0.5 * F.square(y - mu) / var) , axis=1) nll = -F.sum(F.log(density))

Full implementation of a Mixture Density Network in Chainer, for the toy problem shown above, can be found at: https://github.com/AdeelMufti/WorldModels/blob/master/toy/mdn.py

* [1] Christopher M. Bishop, **Mixture Density Networks** (1994)*