Loss and Gradients not correct when last batch has different size #20241

bermeitinger-b · 2024-09-09T14:42:54Z

Thank you for providing the gradient accumulation feature. It has helped me a lot in in developing a custom optimizer that needs accumulated gradients.

However, when looking through the code, especially the normalization line, I'm having questions about the correct implementation:

keras/keras/src/optimizers/base_optimizer.py

Line 394 in efec341

(g + acc_g) / steps for g, acc_g in zip(grads, acc_grads)

This assumes that all batches are of equal size because they contribute equally to the gradients. Typically, the last batch in an epoch is of unequal size (the remainder). By normalizing over the full number of batches, the gradients of the last batch get more weight than the others before. This might become negligible for very large datasets with many batches and a last batch of roughly the same size. However, the mean loss/gradients are not correct.

This is also true for the loss computation; the loss of the last batch has a relatively higher impact.

Here is a notebook showing this behavior:

https://colab.research.google.com/drive/1_Ih_xNjW4Ofv-vQr3v6xRZ0PuVGwxvS0?usp=sharing

fchollet · 2024-09-09T22:11:30Z

This is true for the non-accumulated case as well: each step of gradient update is weighted the same independently of the size of the batch.

It's not clear to me whether this should be changed -- this is the expected behavior when doing gradient descent, I would think.

If you want to change that for a custom training loop, I think the best way to achieve it would be to keep track of the batch size and multiply the gradients of each batch by the batch size (divided by some constant factor probably equal to the average batch size, to avoid having to change the learning rate by a lot). This can be done in a custom train_step().

Another alternative, which would be preferred most of the time, is to by-pass the issue by always having the same batch size. This can be achieved easily in tf.data, for instance by using drop_remainder when batch data, or using repeat() to make the dataset infinitely repeating.

github-actions bot assigned sachinprasadhs Sep 9, 2024

sachinprasadhs added type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. stat:awaiting response from contributor labels Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss and Gradients not correct when last batch has different size #20241

Loss and Gradients not correct when last batch has different size #20241

bermeitinger-b commented Sep 9, 2024

fchollet commented Sep 9, 2024

Loss and Gradients not correct when last batch has different size #20241

Loss and Gradients not correct when last batch has different size #20241

Comments

bermeitinger-b commented Sep 9, 2024

fchollet commented Sep 9, 2024