Calculating running estimate of mean and standard deviation in Python

Say you have a stream of means and standard deviations for a random variable x that you want to combine. So essentially you’re combining two groups of means and standard deviations, G_{x,1} = \mu_{x,1}, \sigma_{x,1} and
G_{x,2} = \mu_{x,2}, \sigma_{x,2} .

If you have access to the random variable x‘s value coming in as a stream, you can collect the values for some n_{x,1} number of values and calculate the mean and standard deviation to form a group G_{x,1}, and then combine it with the mean and standard deviation of the next group G_{x,2} consisting of the next n_{x,2} values of x to form: G_{x,1:2} = \mu_{x,1:2}, \sigma_{x,1:2}

The formulas for the combined means and standard deviations are:

n_{x,1:2}=n_{x,1}+n_{x,2}
\mu_{x,1:2}=\frac{(n_{x,1}\mu_{x,1}) + (n_{x,2}\mu_{x,2})}{ n_{x,1:2} }
\sigma_{x,1:2}=\sqrt{\frac{(n_{x,1}-1)\sigma_{x,1}^{2} + (n_{x,2}-1)\sigma_{x,2}^{2} + n_{x,1}(\mu_{x,1} - \mu_{x,1:2})^{2} + n_{x,2}(\mu_{x,2} - \mu_{x,1:2})^{2} }{n_{x,1:2}-1}}

Note that this is the Bessel corrected standard deviation calculation according to https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation, which I found leads to a better estimate.

In Python code, this is what it looks like:

import numpy as np
np.random.seed(31337)

def combine_mean_std(mean_x_1, std_x_1, n_x_1, mean_x_2, std_x_2, n_x_2):
    n_x_1_2 = n_x_1 + n_x_2
    mean_x_1_2 = (mean_x_1 * n_x_1 + mean_x_2 * n_x_2) / n_x_1_2
    std_x_1_2 = np.sqrt(((n_x_1 - 1) * (std_x_1 ** 2) + (n_x_2 - 1) * (
                            std_x_2 ** 2) + n_x_1 * ((mean_x_1_2 - mean_x_1) ** 2)
                            + n_x_2 * ((mean_x_1_2 - mean_x_2) ** 2))
                        / (n_x_1_2 - 1))
    return mean_x_1_2, std_x_1_2, n_x_1_2

total_mean_x = None
total_std_x = None
total_n_x = 0

all_x = None # For getting the actual mean and std for comparison with the running estimate

for i in range(10):
    x = np.random.randint(0, 100, np.random.randint(1, 100))
    if all_x is None:
        all_x = x
    else:
        all_x = np.concatenate((all_x,x),axis=0)
    mean_x = x.mean()
    std_x = x.std()
    n_x = x.shape[0]
    if total_mean_x is None and total_std_x is None:
        total_mean_x = mean_x
        total_std_x = std_x
        total_n_x = n_x
    else:
        total_mean_x, total_std_x, total_n_x = combine_mean_std(total_mean_x, total_std_x, total_n_x,
                                                                mean_x, std_x, n_x)

print(total_mean_x, total_std_x, total_n_x)
print(all_x.mean(), all_x.std(), all_x.shape[0])

If you run the code above and inspect the values printed at the end, you’ll note that the running estimate in total_mean_x and total_std_x are almost exactly the same as the actual mean and std output by literally collecting all x values and calculating the two values (but which may not be possible or feasible in your task).

Leave a Reply