FAQ

Error Messages

I am getting the following error message when executing the program in evaluation mode. What should I do?

Traceback (most recent call last):
  File "/global/homes/v/vdumont/hyppo/bin/hpo.py", line 35, in <module>
    main()
  File "/global/homes/v/vdumont/hyppo/bin/hpo.py", line 32, in main
    eval(args.operation)(config)
  File "/global/homes/v/vdumont/hyppo/hyppo/evaluation.py", line 24, in evaluation
    samples = make_samples(**config['prms'])
TypeError: make_samples() missing 1 required positional argument: 'nevals'

Solution

This error message may be displayed if the nevals information in the configuration file is placed in the wrong section. Make sure the nevals parameter is placed within the prms section in the configuration file.

Note

This parameter used to be placed within the hpo section in the early versions of the software. However, it was moved to the prms section starting version 0.1.0 to better accomodate newly built functions.

I am getting the following error message when executing the program in evaluation mode. What should I do?

2021-08-20 16:41:04.516502: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory

Solution

This can be caused if the anaconda environment was created within a base environment. Make sure you deactivate all environments before creating and activating a new one.

I am getting the following error message when executing the program in evaluation mode. What should I do?

Traceback (most recent call last):
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 35, in <module>
    main()
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 32, in main
    eval('hpo_uq.'+args.operation)(config)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/evaluation.py", line 40, in evaluation
    samples = make_samples(**config['prms'])
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/utils/sampling.py", line 41, in make_samples
    samples = numpy.loadtxt(record)
  File "/usr/common/software/pytorch/1.7.1/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1092, in loadtxt
    first_line = next(fh)
  File "/usr/common/software/pytorch/1.7.1/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Solution

Double-check that you are pointing to the correct samples.txt in the configuration file ( record under the params section).

I am getting the following error message when executing the program in evaluation mode. What should I do?

Traceback (most recent call last):
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 35, in <module>
    main()
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
    config = hpo_uq.load_config(**vars(args))
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/config.py", line 34, in load_config
    config['dist'] = {**config['dist'],**get_workers(**config)}
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/__init__.py", line 9, in get_workers
    return backend_from_library(dist,**model)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/__init__.py", line 17, in backend_from_library
    return module.init_workers(**dist)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/tensorflow.py", line 45, in init_workers
    return init_workers_gpu()
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/tensorflow.py", line 20, in init_workers_gpu
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
IndexError: list index out of range

Solution

If you are trying to use GPUs, none were found. Check that the GPU version of TensorFlow is being used.

Note

Double check that the module cgpu has been loaded as well ( module load cgpu ). If cgpu is not loaded, the following messages will likely appear:

sbatch: error: No architecture specified, cannot estimate job costs.
sbatch: error: Batch job submission failed: Unspecified error

I am getting the following error message when executing the program in surrogate mode. What should I do?

Traceback (most recent call last):
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 33, in <module>
    main()
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
    eval('hpo_uq.'+args.operation)(config)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/surrogate/__init__.py", line 23, in surrogate
    opti = extract_evals(opti)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/utils/extract.py", line 55, in extract_evals
    assert len(opti.samples)>0, 'No samples found for surrogate modeling. Abort.'
AssertionError: No samples found for surrogate modeling. Abort.

Solution

When running surrogate modeling, make sure that the configuration file is pointing to a folder containing the complete log files from running evaluations.

I am getting the following error message when executing the program in surrogate mode. What should I do?

Traceback (most recent call last):
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 33, in <module>
    main()
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
    eval('hpo_uq.'+args.operation)(config)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/surrogate/__init__.py", line 26, in surrogate
    opti = eval(opti.surrogate)(opti)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/surrogate/rbf_opt.py", line 156, in rbf
    if opti.config['uq']['uq_on']==True:
KeyError: 'uq'

Solution

When running RBF surrogate modeling, the verb|uq| section needs to be included in the configuration file, but can be set to False if UQ is not desired:

uq:
uq_on      : False
uq_hpo     : False
uq_weights : [0.5, 0.5]
data_noise : 0.0

I am getting the following error message when executing the program in evaluation mode. What should I do?

sbatch: error: Batch job submission failed: Node count specification invalid

Solution

This message will occur if the the choice of SLURM steps and SLURM tasks surpasses available resources. Adjust nsteps and ntasks in the configuration file. Make sure to regenerate the SLURM script once the configuration file has been modified.

I am getting the following error message when executing the program in evaluation mode. What should I do?

Traceback (most recent call last):
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 33, in <module>
    main()
  File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
    eval('hpo_uq.'+args.operation)(config)
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/evaluation.py", line 41, in evaluation
    samples = make_samples(**config['prms'])
  File "/global/homes/a/agt17/hpo_uq/hpo_uq/utils/sampling.py", line 69, in make_samples
    samples = numpy.zeros((nevals,len(names)),dtype=int)
ValueError: negative dimensions are not allowed

Solution

Make sure that the upper bound for each parameter is greater than the given parameter’s lower bound. Also check that each parameter is assigned a value in the list mult , xlow and xup , i.e., the length of each list should be the same as the total number of parameters.

I am getting the following error message when executing the program in evaluation mode. What should I do?

2021-08-20 20:56:49.378624: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8';
dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/esslurm/lib64:/opt/cray/job/2.2.4-7.0.1.1_3.50__g36b56f4.ari/lib64:/opt/intel/compilers_and_libraries_2019.3.199/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2019.3.199/linux/mkl/lib/intel64:/usr/common/software/darshan/3.2.1/lib

Solution

This error message may be displayed if the incorrect ML library has been loaded (PyTorch or TensorFlow), or if both PyTorch and TensorFlow are loaded.