Say you have a stream of means and standard deviations for a random variable x that you want to combine. So essentially you’re combining two groups of means and standard deviations, and
.
If you have access to the random variable x‘s value coming in as a stream, you can collect the values for some number of values and calculate the mean and standard deviation to form a group , and then combine it with the mean and standard deviation of the next group consisting of the next values of x to form:
The formulas for the combined means and standard deviations are:
Note that this is the Bessel corrected standard deviation calculation according to https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation, which I found leads to a better estimate.
In Python code, this is what it looks like:
import numpy as np np.random.seed(31337) def combine_mean_std(mean_x_1, std_x_1, n_x_1, mean_x_2, std_x_2, n_x_2): n_x_1_2 = n_x_1 + n_x_2 mean_x_1_2 = (mean_x_1 * n_x_1 + mean_x_2 * n_x_2) / n_x_1_2 std_x_1_2 = np.sqrt(((n_x_1 - 1) * (std_x_1 ** 2) + (n_x_2 - 1) * ( std_x_2 ** 2) + n_x_1 * ((mean_x_1_2 - mean_x_1) ** 2) + n_x_2 * ((mean_x_1_2 - mean_x_2) ** 2)) / (n_x_1_2 - 1)) return mean_x_1_2, std_x_1_2, n_x_1_2 total_mean_x = None total_std_x = None total_n_x = 0 all_x = None # For getting the actual mean and std for comparison with the running estimate for i in range(10): x = np.random.randint(0, 100, np.random.randint(1, 100)) if all_x is None: all_x = x else: all_x = np.concatenate((all_x,x),axis=0) mean_x = x.mean() std_x = x.std() n_x = x.shape[0] if total_mean_x is None and total_std_x is None: total_mean_x = mean_x total_std_x = std_x total_n_x = n_x else: total_mean_x, total_std_x, total_n_x = combine_mean_std(total_mean_x, total_std_x, total_n_x, mean_x, std_x, n_x) print(total_mean_x, total_std_x, total_n_x) print(all_x.mean(), all_x.std(), all_x.shape[0])
If you run the code above and inspect the values printed at the end, you’ll note that the running estimate in total_mean_x and total_std_x are almost exactly the same as the actual mean and std output by literally collecting all x values and calculating the two values (but which may not be possible or feasible in your task).