Why are slurm jobs frozen forever when they are TensorFlow scripts?

Question

Why are slurm jobs frozen forever when they are TensorFlow scripts?

I get this error when I use the slurm workload manager ( http://slurm.schedmd.com/ ). When I run python scripts with a tensor stream, sometimes this leads to an error (attached). It seems that it cannot find the cuda library, but I am running scripts that do not require GPUs. Therefore, I am confused why where it will be a problem at all. Why does installing cuda cause a problem if I don't need it?

The only useful information I received from the slurm-job_id file was the following:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047 I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047 I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) """ I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.

I always thought that tensor flow would not require a GPU. Therefore, I assume that the last error indicates that no GPU is causing the error (correct me if I am wrong).

I don’t understand why I need the CUDA library. I'm trying to run my jobs using the GPU, why do I need a cuda library if my jobs are jobs on the CPU?

I tried to enter the node directly and start the tensor, but I did not get an explicit error:

 I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally

although I was expecting an error:

 I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047 I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047 I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) """ I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.

I also made the official git problem in tensorflow library:

https://github.com/tensorflow/tensorflow/issues/3632

+5

linux tensorflow slurm

Charlie parker Aug 2 '16 at 17:55

source share

2 answers

Steven · Answer 1 · 2016-08-09T15:23:21+0000

There is some bug in the tensor stream working with slurm sent through a batch job.

Currently, I will get around it by running srun on slurm.

In your case, it also indicates that you installed the version of Tenorflow GPU and run it on a machine that does not have a GPU. This causes another error in your case.

Vlad Firoiu · Answer 2 · 2016-09-15T05:29:38+0000

I had a similar problem, and I was able to weld it to freeze when writing the model to the gloss file system. Still waiting for a real solution.

Why are slurm jobs frozen forever when they are TensorFlow scripts?

More articles: