Say you have a stream of means and standard deviations for a random variable **x** that you want to combine. So essentially you’re combining two groups of means and standard deviations, and

.

If you have access to the random variable **x**‘s value coming in as a stream, you can collect the values for some number of values and calculate the mean and standard deviation to form a group , and then combine it with the mean and standard deviation of the next group consisting of the next values of **x** to form:

The formulas for the combined means and standard deviations are:

Note that this is the Bessel corrected standard deviation calculation according to https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation, which I found leads to a better estimate.

In Python code, this is what it looks like:

import numpy as np
np.random.seed(31337)
def combine_mean_std(mean_x_1, std_x_1, n_x_1, mean_x_2, std_x_2, n_x_2):
n_x_1_2 = n_x_1 + n_x_2
mean_x_1_2 = (mean_x_1 * n_x_1 + mean_x_2 * n_x_2) / n_x_1_2
std_x_1_2 = np.sqrt(((n_x_1 - 1) * (std_x_1 ** 2) + (n_x_2 - 1) * (
std_x_2 ** 2) + n_x_1 * ((mean_x_1_2 - mean_x_1) ** 2)
+ n_x_2 * ((mean_x_1_2 - mean_x_2) ** 2))
/ (n_x_1_2 - 1))
return mean_x_1_2, std_x_1_2, n_x_1_2
total_mean_x = None
total_std_x = None
total_n_x = 0
all_x = None # For getting the actual mean and std for comparison with the running estimate
for i in range(10):
x = np.random.randint(0, 100, np.random.randint(1, 100))
if all_x is None:
all_x = x
else:
all_x = np.concatenate((all_x,x),axis=0)
mean_x = x.mean()
std_x = x.std()
n_x = x.shape[0]
if total_mean_x is None and total_std_x is None:
total_mean_x = mean_x
total_std_x = std_x
total_n_x = n_x
else:
total_mean_x, total_std_x, total_n_x = combine_mean_std(total_mean_x, total_std_x, total_n_x,
mean_x, std_x, n_x)
print(total_mean_x, total_std_x, total_n_x)
print(all_x.mean(), all_x.std(), all_x.shape[0])

If you run the code above and inspect the values printed at the end, you’ll note that the running estimate in total_mean_x and total_std_x are almost exactly the same as the actual mean and std output by literally collecting all x values and calculating the two values (but which may not be possible or feasible in your task).

### Like this:

Like Loading...