You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following error occurs while using train_on_responses_only on the unsloth/tinyllama-chat-bnb-4bit model.
/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py in <listcomp>(.0)
1714 substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
1715 substring = substring.split(", ")[:-1]
-> 1716 substring = [int(x) for x in substring]
1717
1718 # Also get rest of tokenized string
ValueError: invalid literal for int() with base 10: ''
Also, when the chat template defined in the tokenizer_config.json file is used, I get the following error if train_on_responses_only is used.
trainer_stats = trainer.train()
^^^^^^^^^^^^^^^
File "<string>", line 145, in train
File "<string>", line 320, in _fast_inner_training_loop
File "/home/user/unsloth_env/lib/python3.11/site-packages/accelerate/data_loader.py", line 550, in __iter__
current_batch = next(dataloader_iter)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 45, in __call__
return self.torch_call(features)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 806, in torch_call
batch = pad_without_fast_tokenizer_warning(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
padded = tokenizer.pad(*pad_args, **pad_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3560, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 227, in __init__
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/home/user/unsloth_env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 778, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
The text was updated successfully, but these errors were encountered:
ValueError Traceback (most recent call last)
[<ipython-input-11-5017548030e3>](https://localhost:8080/#) in <cell line: 259>()
257 # optionally train only on resps
258 from unsloth.chat_templates import train_on_responses_only
--> 259 trainer = train_on_responses_only(
260 trainer,
261 instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
2 frames
[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in train_on_responses_only(trainer, instruction_part, response_part)
1754
1755 # Get most common tokens since tokenizers can tokenize stuff differently!
-> 1756 Q_must, Q_left, Q_right = _find_common_token_ids(instruction_part, tokenizer)
1757 A_must, A_left, A_right = _find_common_token_ids(response_part, tokenizer)
1758
[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in _find_common_token_ids(component, tokenizer)
1714 substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
1715 substring = substring.split(", ")[:-1]
-> 1716 substring = [int(x) for x in substring]
1717
1718 # Also get rest of tokenized string
[/usr/local/lib/python3.10/dist-packages/unsloth/chat_templates.py](https://localhost:8080/#) in <listcomp>(.0)
1714 substring = _longest_common_substring([str(x + [0]) for x in all_input_ids])
1715 substring = substring.split(", ")[:-1]
-> 1716 substring = [int(x) for x in substring]
1717
1718 # Also get rest of tokenized string
ValueError: invalid literal for int() with base 10: ''
The following error occurs while using
train_on_responses_only
on theunsloth/tinyllama-chat-bnb-4bit
model.Link to the test notebook: https://colab.research.google.com/gist/akhlakm/c7c40b0c29d112f2544168be42d3410b/llama-3-1-8b-conversational-unsloth-2x-faster-finetuning.ipynb
Also, when the chat template defined in the
tokenizer_config.json
file is used, I get the following error iftrain_on_responses_only
is used.The text was updated successfully, but these errors were encountered: