Calculating running estimate of mean and standard deviation in Python

Say you have a stream of means and standard deviations for a random variable x that you want to combine. So essentially you’re combining two groups of means and standard deviations, $G_{x,1} = \mu_{x,1}, \sigma_{x,1}$ and
$G_{x,2} = \mu_{x,2}, \sigma_{x,2}$ .

If you have access to the random variable x‘s value coming in as a stream, you can collect the values for some $n_{x,1}$ number of values and calculate the mean and standard deviation to form a group $G_{x,1}$, and then combine it with the mean and standard deviation of the next group $G_{x,2}$ consisting of the next $n_{x,2}$ values of x to form: $G_{x,1:2} = \mu_{x,1:2}, \sigma_{x,1:2}$

The formulas for the combined means and standard deviations are:

$n_{x,1:2}=n_{x,1}+n_{x,2}$
$\mu_{x,1:2}=\frac{(n_{x,1}\mu_{x,1}) + (n_{x,2}\mu_{x,2})}{ n_{x,1:2} }$
$\sigma_{x,1:2}=\sqrt{\frac{(n_{x,1}-1)\sigma_{x,1}^{2} + (n_{x,2}-1)\sigma_{x,2}^{2} + n_{x,1}(\mu_{x,1} - \mu_{x,1:2})^{2} + n_{x,2}(\mu_{x,2} - \mu_{x,1:2})^{2} }{n_{x,1:2}-1}}$

Note that this is the Bessel corrected standard deviation calculation according to https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation, which I found leads to a better estimate.

In Python code, this is what it looks like:

import numpy as np
np.random.seed(31337)

def combine_mean_std(mean_x_1, std_x_1, n_x_1, mean_x_2, std_x_2, n_x_2):
n_x_1_2 = n_x_1 + n_x_2
mean_x_1_2 = (mean_x_1 * n_x_1 + mean_x_2 * n_x_2) / n_x_1_2
std_x_1_2 = np.sqrt(((n_x_1 - 1) * (std_x_1 ** 2) + (n_x_2 - 1) * (
std_x_2 ** 2) + n_x_1 * ((mean_x_1_2 - mean_x_1) ** 2)
+ n_x_2 * ((mean_x_1_2 - mean_x_2) ** 2))
/ (n_x_1_2 - 1))
return mean_x_1_2, std_x_1_2, n_x_1_2

total_mean_x = None
total_std_x = None
total_n_x = 0

all_x = None # For getting the actual mean and std for comparison with the running estimate

for i in range(10):
x = np.random.randint(0, 100, np.random.randint(1, 100))
if all_x is None:
all_x = x
else:
all_x = np.concatenate((all_x,x),axis=0)
mean_x = x.mean()
std_x = x.std()
n_x = x.shape[0]
if total_mean_x is None and total_std_x is None:
total_mean_x = mean_x
total_std_x = std_x
total_n_x = n_x
else:
total_mean_x, total_std_x, total_n_x = combine_mean_std(total_mean_x, total_std_x, total_n_x,
mean_x, std_x, n_x)

print(total_mean_x, total_std_x, total_n_x)
print(all_x.mean(), all_x.std(), all_x.shape[0])


If you run the code above and inspect the values printed at the end, you’ll note that the running estimate in total_mean_x and total_std_x are almost exactly the same as the actual mean and std output by literally collecting all x values and calculating the two values (but which may not be possible or feasible in your task).