Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mixing for pretokenized data. #230

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

GeorgiosSmyrnis
Copy link
Collaborator

This enables mixing of pretokenized data with the tokenize_shuffle.py script. This is allowed by the --pretok_tars flag, which assumes that the tarfiles that the script contain already tokenized data.

@GeorgiosSmyrnis
Copy link
Collaborator Author

This now also fixes a rare issue where the dataset produced by tokenize shuffle becomes broken due to duplicate file names within the tarfiles. While this could only happen if tokenizing the same sequence of tokens, this now converts the naming scheme within a tarfile to a simple name of the format {shard_index}_{iterator}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant