Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss and Gradients not correct when last batch has different size #20241

Open
bermeitinger-b opened this issue Sep 9, 2024 · 1 comment
Open
Assignees
Labels
stat:awaiting response from contributor type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.

Comments

@bermeitinger-b
Copy link

Thank you for providing the gradient accumulation feature. It has helped me a lot in in developing a custom optimizer that needs accumulated gradients.

However, when looking through the code, especially the normalization line, I'm having questions about the correct implementation:

(g + acc_g) / steps for g, acc_g in zip(grads, acc_grads)

This assumes that all batches are of equal size because they contribute equally to the gradients. Typically, the last batch in an epoch is of unequal size (the remainder). By normalizing over the full number of batches, the gradients of the last batch get more weight than the others before. This might become negligible for very large datasets with many batches and a last batch of roughly the same size. However, the mean loss/gradients are not correct.

This is also true for the loss computation; the loss of the last batch has a relatively higher impact.

Here is a notebook showing this behavior:

https://colab.research.google.com/drive/1_Ih_xNjW4Ofv-vQr3v6xRZ0PuVGwxvS0?usp=sharing

@fchollet
Copy link
Member

fchollet commented Sep 9, 2024

This is true for the non-accumulated case as well: each step of gradient update is weighted the same independently of the size of the batch.

It's not clear to me whether this should be changed -- this is the expected behavior when doing gradient descent, I would think.

If you want to change that for a custom training loop, I think the best way to achieve it would be to keep track of the batch size and multiply the gradients of each batch by the batch size (divided by some constant factor probably equal to the average batch size, to avoid having to change the learning rate by a lot). This can be done in a custom train_step().

Another alternative, which would be preferred most of the time, is to by-pass the issue by always having the same batch size. This can be achieved easily in tf.data, for instance by using drop_remainder when batch data, or using repeat() to make the dataset infinitely repeating.

@sachinprasadhs sachinprasadhs added type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. stat:awaiting response from contributor labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response from contributor type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.
Projects
None yet
Development

No branches or pull requests

3 participants