Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patches for remote sync waiting and log_local checkpointing #290

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jeffreywpli
Copy link
Collaborator

@jeffreywpli jeffreywpli commented Jul 4, 2024

Three main changes (originally motivated by making it easier to chain together consecutive calls to open_lm.main)

  1. Allow for user to specify their own custom NCCL timeout. The default is 10 minutes which may not be enough to run final remote sync in all scenarios (causing a timeout because all other processes hit a barrier).

  2. Remove unnecessary waits during the initial "test call" of remote sync and final sync by hard-passing in 0 instead of args.sync_every into remote_sync_with_expon_backoff. These are the only instances in main where remote_sync_with_expon_backoff is called directly (as opposed to in a subprocess) and therefore causes whole script to hang when sync_every is not 0.

  3. args.log_local perhaps does not work as originally intended, since many logging calls are still gated behind an if is_master(args) instead of if is_master(args, local=args.log_local).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant