Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higgs Dataset - ValueError on download_and_prepare() #5428

Open
zwouter opened this issue May 27, 2024 · 1 comment
Open

Higgs Dataset - ValueError on download_and_prepare() #5428

zwouter opened this issue May 27, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@zwouter
Copy link

zwouter commented May 27, 2024

Short description
The Higgs dataset cannot be used, probably because it contains unexpected missing values.

Environment information

  • Operating System: Windows 11

  • Python version: 3.11.1

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 4.9.4

  • tensorflow/tf-nightly version: tensorflow 2.16.1

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? Yes.

Reproduction instructions

ds_builder = tfds.builder('higgs')
ds_builder.download_and_prepare()

Logs

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\WSUIDGEE\tensorflow_datasets\higgs\2.0.0...
Extraction completed...: 0 file [00:00, ? file/s]████████████████████████████████████████| 1/1 [00:00<00:00, 157.03 url/s] 
Dl Size...: 100%|█████████████████████████████████████████████| 2816407858/2816407858 [00:00<00:00, 300620199629.49 MiB/s] 
Dl Completed...: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 96.44 url/s] 
Generating splits...:   0%|                                                                    | 0/1 [00:00<?, ? splits/s] 
Traceback (most recent call last):
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 105, in <module>
    evaluate_configuration(
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 87, in evaluate_configuration
    ds = Dataset(dataset)
         ^^^^^^^^^^^^^^^^
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 17, in __init__
    trains_ds, vals_ds, test_ds = self.__load_dataset(dataset_name, k_folds)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 46, in __load_dataset
    ds_builder.download_and_prepare()
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\logging\__init__.py", line 168, in __call__
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 691, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 1584, in _download_and_prepare
    future = split_builder.submit_split_generation(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 341, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 417, in _build_from_generator
    utils.reraise(e, prefix=f'Failed to encode example:\n{example}\n')
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 415, in _build_from_generator
    example = self._features.encode_example(example)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 243, in encode_example
    utils.reraise(
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 241, in encode_example
    example[k] = feature.encode_example(example_value)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\tensor_feature.py", line 175, in encode_example
    example_data = np.array(example_data, dtype=np_dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Failed to encode example:
{'class_label': '1.000000000000000000e+00', 'lepton_pT': '3.647371232509613037e-01', 'lepton_eta': '1.489144206047058105e+00', 'lepton_phi': '3.394368290901184082e-01', 'missing_energy_magnitude': '1.493860602378845215e+00', 'missing_energy_phi': '-1.723330497741699219e+00', 'jet_1_pt': '7.524616718292236328e-01', 'jet_1_eta': '-2.802605032920837402e-01', 'jet_1_phi': '-4.207125604152679443e-01', 'jet_1_b-tag': '2.173076152801513672e+00', 'jet_2_pt': '', 'jet_2_eta': None, 'jet_2_phi': None, 'jet_2_b-tag': None, 'jet_3_pt': None, 'jet_3_eta': None, 'jet_3_phi': None, 'jet_3_b-tag': None, 'jet_4_pt': None, 'jet_4_eta': None, 'jet_4_phi': None, 'jet_4_b-tag': None, 'm_jj': None, 'm_jjj': None, 'm_lv': None, 'm_jlv': None, 'm_bb': None, 'm_wbb': None, 'm_wwbb': None}
In <Tensor> with name "jet_2_pt":
could not convert string to float: ''

Expected behavior
I expect the dataset to be downloaded and prepared such that I can quickly load it in the future.

Additional context
I am new to using tfds, but other datasets (e.g. MNIST, CIFAR10) work as intended.
The dataset is not supposed to have missing values, according to https://archive.ics.uci.edu/dataset/280/higgs

@zwouter zwouter added the bug Something isn't working label May 27, 2024
copybara-service bot pushed a commit that referenced this issue May 31, 2024
Fixes issue #5428.

PiperOrigin-RevId: 639031084
copybara-service bot pushed a commit that referenced this issue May 31, 2024
Fixes issue #5428.

PiperOrigin-RevId: 639031084
@marcenacp
Copy link
Collaborator

marcenacp commented Jun 3, 2024

Could this be an issue with Windows? I don't reproduce locally and I can successfully download_and_prepare the dataset. If the problem persists, you could also try to filter missing values (example).

If you find a fix for windows, please feel free to push a PR that fixes the issue :) Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants