Say you have a stream of means and standard deviations for a random variable **x** that you want to combine. So essentially you’re combining two groups of means and standard deviations, [latex]G_{x,1} = \mu_{x,1}, \sigma_{x,1}[/latex] and

[latex]G_{x,2} = \mu_{x,2}, \sigma_{x,2}[/latex] .

If you have access to the random variable **x**‘s value coming in as a stream, you can collect the values for some [latex]n_{x,1}[/latex] number of values and calculate the mean and standard deviation to form a group [latex]G_{x,1}[/latex], and then combine it with the mean and standard deviation of the next group [latex]G_{x,2}[/latex] consisting of the next [latex]n_{x,2}[/latex] values of **x** to form: [latex]G_{x,1:2} = \mu_{x,1:2}, \sigma_{x,1:2}[/latex]

The formulas for the combined means and standard deviations are:

[latex]n_{x,1:2}=n_{x,1}+n_{x,2}[/latex]

[latex]\mu_{x,1:2}=\frac{(n_{x,1}\mu_{x,1}) + (n_{x,2}\mu_{x,2})}{ n_{x,1:2} }[/latex]

[latex]\sigma_{x,1:2}=\sqrt{\frac{(n_{x,1}-1)\sigma_{x,1}^{2} + (n_{x,2}-1)\sigma_{x,2}^{2} + n_{x,1}(\mu_{x,1} – \mu_{x,1:2})^{2} + n_{x,2}(\mu_{x,2} – \mu_{x,1:2})^{2} }{n_{x,1:2}-1}}[/latex]

Note that this is the Bessel corrected standard deviation calculation according to https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation, which I found leads to a better estimate.

In Python code, this is what it looks like:

import numpy as np np.random.seed(31337) def combine_mean_std(mean_x_1, std_x_1, n_x_1, mean_x_2, std_x_2, n_x_2): n_x_1_2 = n_x_1 + n_x_2 mean_x_1_2 = (mean_x_1 * n_x_1 + mean_x_2 * n_x_2) / n_x_1_2 std_x_1_2 = np.sqrt(((n_x_1 - 1) * (std_x_1 ** 2) + (n_x_2 - 1) * ( std_x_2 ** 2) + n_x_1 * ((mean_x_1_2 - mean_x_1) ** 2) + n_x_2 * ((mean_x_1_2 - mean_x_2) ** 2)) / (n_x_1_2 - 1)) return mean_x_1_2, std_x_1_2, n_x_1_2 total_mean_x = None total_std_x = None total_n_x = 0 all_x = None # For getting the actual mean and std for comparison with the running estimate for i in range(10): x = np.random.randint(0, 100, np.random.randint(1, 100)) if all_x is None: all_x = x else: all_x = np.concatenate((all_x,x),axis=0) mean_x = x.mean() std_x = x.std() n_x = x.shape[0] if total_mean_x is None and total_std_x is None: total_mean_x = mean_x total_std_x = std_x total_n_x = n_x else: total_mean_x, total_std_x, total_n_x = combine_mean_std(total_mean_x, total_std_x, total_n_x, mean_x, std_x, n_x) print(total_mean_x, total_std_x, total_n_x) print(all_x.mean(), all_x.std(), all_x.shape[0])

If you run the code above and inspect the values printed at the end, you’ll note that the running estimate in total_mean_x and total_std_x are almost exactly the same as the actual mean and std output by literally collecting all x values and calculating the two values (but which may not be possible or feasible in your task).