Loss and Gradients not correct when last batch has different size #20241
Labels
stat:awaiting response from contributor
type:support
User is asking for help / asking an implementation question. Stackoverflow would be better suited.
Thank you for providing the gradient accumulation feature. It has helped me a lot in in developing a custom optimizer that needs accumulated gradients.
However, when looking through the code, especially the normalization line, I'm having questions about the correct implementation:
keras/keras/src/optimizers/base_optimizer.py
Line 394 in efec341
This assumes that all batches are of equal size because they contribute equally to the gradients. Typically, the last batch in an epoch is of unequal size (the remainder). By normalizing over the full number of batches, the gradients of the last batch get more weight than the others before. This might become negligible for very large datasets with many batches and a last batch of roughly the same size. However, the mean loss/gradients are not correct.
This is also true for the loss computation; the loss of the last batch has a relatively higher impact.
Here is a notebook showing this behavior:
https://colab.research.google.com/drive/1_Ih_xNjW4Ofv-vQr3v6xRZ0PuVGwxvS0?usp=sharing
The text was updated successfully, but these errors were encountered: