Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python tensorflow in nvidia-enabled tumbleweed and fedora distroboxes unable to talk to GPU #230

Open
alexispurslane opened this issue Apr 12, 2024 · 11 comments

Comments

@alexispurslane
Copy link

Symptoms

Whether I create an ephemeral fedora rawhide or 39 distrobox with --nvidia, or use the tumbleweed distrobox I created from a distrobox-assemble with nvidia=true, and whether I create a python venv and then pip install tensorflow[and-cuda] or just do pip install --break-system-packages tensorflow[and-cuda] publicly, when installing those packages afresh, I get this output when trying to use tensorflow with my gpu:

$ python3
Python 3.12.2 (main, Feb 21 2024, 00:00:00) [GCC 14.0.1 20240217 (Red Hat 14.0.1-0)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-04-12 14:29:54.089365: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-12 14:29:54.126693: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-12 14:29:54.786606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> tf.config.list_logical_devices()
2024-04-12 14:29:57.714446: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-12 14:29:57.714953: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[LogicalDevice(name='/device:CPU:0', device_type='CPU')]

Steps to reproduce

  1. Create a distrobox with nvidia enablement, either tumbleweed or fedora (and probably others)
  2. install tensorflow with cuda
  3. run import tensorflow as tf; tf.config.list_logical_devices()
  4. Observe results
@m2Giles
Copy link
Member

m2Giles commented Apr 12, 2024

Can you make sure that nvidia-smi can run inside the distrobox?

Next, make sure that the nvidia containers toolkit file is on disk. We have a systemd oneshot that makes sure it is written that starts with ublue-.

@alexispurslane
Copy link
Author

Can you make sure that nvidia-smi can run inside the distrobox?

It can, in both. I double checked that.

Next, make sure that the nvidia containers toolkit file is on disk. We have a systemd oneshot that makes sure it is written that starts with ublue-.

The systemd oneshot is there --- I ran sudo systemctl start on it, and it reports having done some things and deactivated successfully, let me check if that changes anything.

@alexispurslane
Copy link
Author

No, rerunning the nvidia toolkit systemd service didn't change anything. Nvidia-smi runs, but tensorflow says it's missing libraries.

@alexispurslane
Copy link
Author

@m2Giles using the official tensorflow container doesn't work either

$ distrobox ephemeral --nvidia --image tensorflow/tensorflow:latest
Creating 'distrobox-1RfsfYcHQM' using image tensorflow/tensorflow:latest	 [ OK ]
Distrobox 'distrobox-1RfsfYcHQM' successfully created.
To enter, run:

distrobox enter distrobox-1RfsfYcHQM

Starting container...                   	 [ OK ]
Executing pre-init hooks...             	 [ OK ]
Installing basic packages...            	 [ OK ]
Setting up devpts mounts...             	 [ OK ]
Setting up read-only mounts...          	 [ OK ]
Setting up read-write mounts...         	 [ OK ]
Setting up host's sockets integration...	 [ OK ]
Setting up host's nvidia integration... 	 [ OK ]
Integrating host's themes, icons, fonts...	 [ OK ]
Setting up package manager exceptions...	 [ OK ]
Setting up package manager hooks...     	 [ OK ]
Setting up dpkg exceptions...           	 [ OK ]
Setting up apt hooks...                 	 [ OK ]
Setting up distrobox profile...         	 [ OK ]
Setting up sudo...                      	 [ OK ]
Setting up user groups...               	 [ OK ]
Setting up kerberos integration...      	 [ OK ]
Setting up user's group list...         	 [ OK ]
Setting up existing user...             	 [ OK ]
Setting up user home...                 	 [ OK ]
Ensuring user's access...               	 [ OK ]
Executing init hooks...                 	 [ OK ]

Container Setup Complete!

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!

/sbin/ldconfig.real: Can't create temporary cache file /etc/ld.so.cache~: Permission denied
📦[alexispurslane@distrobox-1RfsfYcHQM ~]$ python3
Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-04-13 09:00:19.559240: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-13 09:00:19.732742: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.config.list_logical_devices()
[LogicalDevice(name='/device:CPU:0', device_type='CPU')]
>>> 

@alexispurslane
Copy link
Author

I've deleted and recreated the distrobox several times now and it still doesn't work.

@alexispurslane
Copy link
Author

It looks like the problem might be because the libcudnn and libcudart libraries aren't installed? I don't remember ever needing them in order for tensorflow to work though. NVCC is also missing.

@alexispurslane
Copy link
Author

Still no luck whatsoever with this. Going to temporarily rebased from my derivative image to silverblue-nvidia to see if it's that.

@alexispurslane
Copy link
Author

Update! When I tested the official tensorflow docker image, I was actually using the wrong image! With tensorflow/tensorflow:latest-gpu (I'd forgotten that last -gpu part!) tensorflow GPU detection actionally works:

📦[alexispurslane@distrobox-zdOEmebIu3 ~]$ nvidia-smi
Sat Apr 20 17:25:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8             15W /  115W |       4MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
📦[alexispurslane@distrobox-zdOEmebIu3 ~]$ python3
Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-04-20 17:25:28.934919: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-20 17:25:29.123949: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.config.list_logical_devices('GPU')
...
2024-04-20 17:25:32.286955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6272 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3070 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]
>>> 

So now I just need to investigate why installing tensorflow on tumbleweed, ubuntu, or fedora (39 or rawhide) manually doesn't work, but the official docker image does! And in the meantime I can just run things in the official docker image, and that's... acceptable.

@alexispurslane
Copy link
Author

I recreated my fedora container again and am also getting a slightly different and more enlightening error from tf this time:

2024-04-20 17:39:54.074067: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.

although:

$ whereis libcuda.so
libcuda.so: /usr/lib64/libcuda.so

@alexispurslane
Copy link
Author

alexispurslane commented Apr 21, 2024

It looks like the problem might be that the latest version of tensorflow (2.16.1) doesn't actually support the latest CUDA version (12.4), since the people here and here seem to be having a similar problem and the table here indicates CUDA 12.4 isn't officially supported, and the official container and my old tumbleweed container had CUDA 12.3 while the new ones have 12.4.

@sgkouzias
Copy link

sgkouzias commented Apr 22, 2024

It looks like the problem might be that the latest version of tensorflow (2.16.1) doesn't actually support the latest CUDA version (12.4), since the people here and here seem to be having a similar problem and the table here indicates CUDA 12.4 isn't officially supported, and the official container and my old tumbleweed container had CUDA 12.3 while the new ones have 12.4.

@alexispurslane from the table you displayed it is clear that TensorFlow version 2.16.1 is compatible with CUDA 12.3 (and not compatible with version 12.4). However, it turns out that when you pip install tensorflow[and-cuda] all compatible libs required to utilize GPU are actually installed with TensorFlow! The issue is that you must locate the paths. For example, if you have activated a virtual environment created to pip install tensorflow[and-cuda] with python=3.11 then you should manually:

  • export LD_LIBRARY_PATH=~/anaconda3/lib/python3.11/site-packages/nvidia/cudnn/lib/:$LD_LIBRARY_PATH
  • export PATH=~/anaconda3/lib/python3.11/site-packages/nvidia/cuda_nvcc/bin/:$PATH

Thus, it seems practically impossible for someone owning a PC with CUDA-enabled GPU to perform deep learning experiments with TensorFlow version 2.16.1 and utilize his GPU locally without manually performing some extra steps not included (until today) in the official TensorFlow documentation of the standard installation procedure of TensorFlow for Linux users with GPUs at least as a temporal fix! That's why I submitted a pull request in good faith and for the shake of all users as TensorFlow is "An Open Source Machine Learning Framework for Everyone".

tensorflow/docs#2299 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants