Inspecting gradients in Chainer

Chainer is my choice of framework when it comes to implementing Neural Networks. It makes working with and trouble shooting deep learning easy.

Printing out the gradients during back propagation to inspect their values is sometimes useful in deep learning, to see if your gradients are as expected and aren’t either exploding (numbers too large) or vanishing (numbers too small). Fortunately, this is easy to do in Chainer.

Chainer provides access to the parameters in your model, and for each parameter, you can check the gradient during the back propagation step, stored in the optimizer (such as SGD or Adam). To access these, you can extend to additionally output the gradients, by defining your own StandardUpdater like so:

class CustomStandardUpdater(
    def __init__(self, train_iter, optimizer, device):
        super(CustomStandardUpdater, self).__init__(
            train_iter, optimizer, device=device)

    def update_core(self):
        super(CustomStandardUpdater, self).update_core()
        optimizer = self.get_optimizer('main')
        for name, param in
            print(name, param.grad)

In lines 9-10 you can see the parameters (weights) of your neural network being accessed through the optimizer, and for each parameter, the name and gradient is being output. This StandardUpdater can be attached to your training module as follows:

model = MyChainerModel()
optimizer = chainer.optimizers.Adam()
train_iter = chainer.iterators.SerialIterator(train_dataset, batch_size=32, shuffle=True)
updater = CustomStandardUpdater(train_iter, optimizer, gpu)
trainer = training.Trainer(updater, stop_trigger=(100, 'epoch'))

Calculating running estimate of mean and standard deviation in Python

Say you have a stream of means and standard deviations for a random variable x that you want to combine. So essentially you’re combining two groups of means and standard deviations, G_{x,1} = \mu_{x,1}, \sigma_{x,1} and
G_{x,2} = \mu_{x,2}, \sigma_{x,2} .

If you have access to the random variable x‘s value coming in as a stream, you can collect the values for some n_{x,1} number of values and calculate the mean and standard deviation to form a group G_{x,1}, and then combine it with the mean and standard deviation of the next group G_{x,2} consisting of the next n_{x,2} values of x to form: G_{x,1:2} = \mu_{x,1:2}, \sigma_{x,1:2}

The formulas for the combined means and standard deviations are:

\mu_{x,1:2}=\frac{(n_{x,1}\mu_{x,1}) + (n_{x,2}\mu_{x,2})}{ n_{x,1:2} }
\sigma_{x,1:2}=\sqrt{\frac{(n_{x,1}-1)\sigma_{x,1}^{2} + (n_{x,2}-1)\sigma_{x,2}^{2} + n_{x,1}(\mu_{x,1} - \mu_{x,1:2})^{2} + n_{x,2}(\mu_{x,2} - \mu_{x,1:2})^{2} }{n_{x,1:2}-1}}

Note that this is the Bessel corrected standard deviation calculation according to, which I found leads to a better estimate.

In Python code, this is what it looks like:

import numpy as np

def combine_mean_std(mean_x_1, std_x_1, n_x_1, mean_x_2, std_x_2, n_x_2):
    n_x_1_2 = n_x_1 + n_x_2
    mean_x_1_2 = (mean_x_1 * n_x_1 + mean_x_2 * n_x_2) / n_x_1_2
    std_x_1_2 = np.sqrt(((n_x_1 - 1) * (std_x_1 ** 2) + (n_x_2 - 1) * (
                            std_x_2 ** 2) + n_x_1 * ((mean_x_1_2 - mean_x_1) ** 2)
                            + n_x_2 * ((mean_x_1_2 - mean_x_2) ** 2))
                        / (n_x_1_2 - 1))
    return mean_x_1_2, std_x_1_2, n_x_1_2

total_mean_x = None
total_std_x = None
total_n_x = 0

all_x = None # For getting the actual mean and std for comparison with the running estimate

for i in range(10):
    x = np.random.randint(0, 100, np.random.randint(1, 100))
    if all_x is None:
        all_x = x
        all_x = np.concatenate((all_x,x),axis=0)
    mean_x = x.mean()
    std_x = x.std()
    n_x = x.shape[0]
    if total_mean_x is None and total_std_x is None:
        total_mean_x = mean_x
        total_std_x = std_x
        total_n_x = n_x
        total_mean_x, total_std_x, total_n_x = combine_mean_std(total_mean_x, total_std_x, total_n_x,
                                                                mean_x, std_x, n_x)

print(total_mean_x, total_std_x, total_n_x)
print(all_x.mean(), all_x.std(), all_x.shape[0])

If you run the code above and inspect the values printed at the end, you’ll note that the running estimate in total_mean_x and total_std_x are almost exactly the same as the actual mean and std output by literally collecting all x values and calculating the two values (but which may not be possible or feasible in your task).