Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot work when use DATA_PARALLEL with FusedEmbeddingBagCollection #2209

Open
imh966 opened this issue Jul 4, 2024 · 1 comment
Open

Cannot work when use DATA_PARALLEL with FusedEmbeddingBagCollection #2209

imh966 opened this issue Jul 4, 2024 · 1 comment

Comments

@imh966
Copy link

imh966 commented Jul 4, 2024

I am trying to apply DATA_PARALLEL on the small embedding tables and it can work in EmbeddingBagCollection. However, when it comes to FusedEmbeddingBagCollection, it doesn't work and gets an error on the second backward step. The error is like below:

[rank4]: Traceback (most recent call last):
[rank4]: File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[rank4]: return _run_code(code, main_globals, None,
[rank4]: File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
[rank4]: exec(code, run_globals)
[rank4]: File "/workdir/gen_rec/rec/app/main.py", line 41, in
[rank4]: main()
[rank4]: File "/workdir/gen_rec/rec/app/main.py", line 24, in main
[rank4]: train(args)
[rank4]: File "/workdir/gen_rec/rec/app/task.py", line 17, in train
[rank4]: trainer.train()
[rank4]: File "/workdir/gen_rec/rec/training/trainer.py", line 278, in train
[rank4]: loss.backward()
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/_tensor.py", line 525, in backward
[rank4]: torch.autograd.backward(
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/autograd/init.py", line 267, in backward
[rank4]: _engine_run_backward(
[rank4]: File "/usr/local/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank4]: RuntimeError: Detected at least one parameter gradient is not the expected DDP bucket view with gradient_as_bucket_view=True. This may happen (for example) if multiple allreduce hooks were registered onto the same parameter. If you hit this error, please file an issue with a minimal repro.

Moreover, I've read the code of FUSED compute kernel and DENSE(DATA_PARALLEL) compute kernel. I find that there are some code about optimizer in FUSED compute kernel but there are none in DENSE compute kernel. It seems that DATA_PARALLEL is incompatible with FusedEmbeddingBagCollection.

I am not sure what really happened. Could anyone help me to address this problem?

@JacoCheung
Copy link

One note: Fused* are going to be deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants