Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train on responses only does not work with TinyLlama-chat #1015

Open
akhlakm opened this issue Sep 11, 2024 · 2 comments
Open

Train on responses only does not work with TinyLlama-chat #1015

akhlakm opened this issue Sep 11, 2024 · 2 comments
Labels
currently fixing Am fixing now! URGENT BUG Urgent bug

Comments

@akhlakm
Copy link

akhlakm commented Sep 11, 2024

The following error occurs while using train_on_responses_only on the unsloth/tinyllama-chat-bnb-4bit model.

/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py in <listcomp>(.0)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

Link to the test notebook: https://colab.research.google.com/gist/akhlakm/c7c40b0c29d112f2544168be42d3410b/llama-3-1-8b-conversational-unsloth-2x-faster-finetuning.ipynb

Also, when the chat template defined in the tokenizer_config.json file is used, I get the following error if train_on_responses_only is used.

trainer_stats = trainer.train()
                    ^^^^^^^^^^^^^^^
  File "<string>", line 145, in train
  File "<string>", line 320, in _fast_inner_training_loop
  File "/home/user/unsloth_env/lib/python3.11/site-packages/accelerate/data_loader.py", line 550, in __iter__
    current_batch = next(dataloader_iter)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 45, in __call__
    return self.torch_call(features)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 806, in torch_call
    batch = pad_without_fast_tokenizer_warning(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
    padded = tokenizer.pad(*pad_args, **pad_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3560, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 227, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 778, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
@LostRuins
Copy link

I am getting the same error.

ValueError                                Traceback (most recent call last)
[<ipython-input-11-5017548030e3>](https://localhost:8080/#) in <cell line: 259>()
    257 # optionally train only on resps
    258 from unsloth.chat_templates import train_on_responses_only
--> 259 trainer = train_on_responses_only(
    260     trainer,
    261     instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",

2 frames
[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in train_on_responses_only(trainer, instruction_part, response_part)
   1754 
   1755     # Get most common tokens since tokenizers can tokenize stuff differently!
-> 1756     Q_must, Q_left, Q_right = _find_common_token_ids(instruction_part, tokenizer)
   1757     A_must, A_left, A_right = _find_common_token_ids(response_part,    tokenizer)
   1758 

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in _find_common_token_ids(component, tokenizer)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in <listcomp>(.0)
   1714     substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
   1715     substring = substring.split(", ")[:-1]
-> 1716     substring = [int(x) for x in substring]
   1717 
   1718     # Also get rest of tokenized string

ValueError: invalid literal for int() with base 10: ''

@danielhanchen danielhanchen added currently fixing Am fixing now! URGENT BUG Urgent bug labels Sep 14, 2024
@danielhanchen
Copy link
Contributor

Oh whoops ok just saw your other issue @LostRuins as well - will definitely investigate this - sorry about this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
currently fixing Am fixing now! URGENT BUG Urgent bug
Projects
None yet
Development

No branches or pull requests

3 participants