FAQ ¶
Error Messages ¶
I am getting the following error message when executing the program in evaluation mode. What should I do?
Traceback (most recent call last):
File "/global/homes/v/vdumont/hyppo/bin/hpo.py", line 35, in <module>
main()
File "/global/homes/v/vdumont/hyppo/bin/hpo.py", line 32, in main
eval(args.operation)(config)
File "/global/homes/v/vdumont/hyppo/hyppo/evaluation.py", line 24, in evaluation
samples = make_samples(**config['prms'])
TypeError: make_samples() missing 1 required positional argument: 'nevals'
Solution
This error message may be displayed if the
nevals
information in the configuration file is placed in the wrong section. Make sure the
nevals
parameter is placed within the
prms
section in the configuration file.
Note
This parameter used to be placed within the
hpo
section in the early versions of the software. However, it was moved to the
prms
section starting version 0.1.0 to better accomodate newly built functions.
I am getting the following error message when executing the program in evaluation mode. What should I do?
2021-08-20 16:41:04.516502: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
Solution
This can be caused if the anaconda environment was created within a base environment. Make sure you deactivate all environments before creating and activating a new one.
I am getting the following error message when executing the program in evaluation mode. What should I do?
Traceback (most recent call last):
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 35, in <module>
main()
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 32, in main
eval('hpo_uq.'+args.operation)(config)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/evaluation.py", line 40, in evaluation
samples = make_samples(**config['prms'])
File "/global/homes/a/agt17/hpo_uq/hpo_uq/utils/sampling.py", line 41, in make_samples
samples = numpy.loadtxt(record)
File "/usr/common/software/pytorch/1.7.1/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1092, in loadtxt
first_line = next(fh)
File "/usr/common/software/pytorch/1.7.1/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Solution
Double-check that you are pointing to the correct
samples.txt
in the configuration file (
record
under the
params
section).
I am getting the following error message when executing the program in evaluation mode. What should I do?
Traceback (most recent call last):
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 35, in <module>
main()
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
config = hpo_uq.load_config(**vars(args))
File "/global/homes/a/agt17/hpo_uq/hpo_uq/config.py", line 34, in load_config
config['dist'] = {**config['dist'],**get_workers(**config)}
File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/__init__.py", line 9, in get_workers
return backend_from_library(dist,**model)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/__init__.py", line 17, in backend_from_library
return module.init_workers(**dist)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/tensorflow.py", line 45, in init_workers
return init_workers_gpu()
File "/global/homes/a/agt17/hpo_uq/hpo_uq/distribution/tensorflow.py", line 20, in init_workers_gpu
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
IndexError: list index out of range
Solution
If you are trying to use GPUs, none were found. Check that the GPU version of TensorFlow is being used.
Note
Double check that the module
cgpu
has been loaded as well (
module
load
cgpu
). If
cgpu
is not loaded, the following messages will likely appear:
sbatch: error: No architecture specified, cannot estimate job costs.
sbatch: error: Batch job submission failed: Unspecified error
I am getting the following error message when executing the program in surrogate mode. What should I do?
Traceback (most recent call last):
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 33, in <module>
main()
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
eval('hpo_uq.'+args.operation)(config)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/surrogate/__init__.py", line 23, in surrogate
opti = extract_evals(opti)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/utils/extract.py", line 55, in extract_evals
assert len(opti.samples)>0, 'No samples found for surrogate modeling. Abort.'
AssertionError: No samples found for surrogate modeling. Abort.
Solution
When running surrogate modeling, make sure that the configuration file is pointing to a folder containing the complete log files from running evaluations.
I am getting the following error message when executing the program in surrogate mode. What should I do?
Traceback (most recent call last):
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 33, in <module>
main()
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
eval('hpo_uq.'+args.operation)(config)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/surrogate/__init__.py", line 26, in surrogate
opti = eval(opti.surrogate)(opti)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/surrogate/rbf_opt.py", line 156, in rbf
if opti.config['uq']['uq_on']==True:
KeyError: 'uq'
Solution
When running RBF surrogate modeling, the verb|uq| section needs to be included in the configuration file, but can be set to
False
if UQ is not desired:
uq:
uq_on : False
uq_hpo : False
uq_weights : [0.5, 0.5]
data_noise : 0.0
I am getting the following error message when executing the program in evaluation mode. What should I do?
sbatch: error: Batch job submission failed: Node count specification invalid
Solution
This message will occur if the the choice of SLURM steps and SLURM tasks surpasses available resources. Adjust
nsteps
and
ntasks
in the configuration file. Make sure to regenerate the SLURM script once the configuration file has been modified.
I am getting the following error message when executing the program in evaluation mode. What should I do?
Traceback (most recent call last):
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 33, in <module>
main()
File "/global/homes/a/agt17/hpo_uq/bin/hpo.py", line 30, in main
eval('hpo_uq.'+args.operation)(config)
File "/global/homes/a/agt17/hpo_uq/hpo_uq/evaluation.py", line 41, in evaluation
samples = make_samples(**config['prms'])
File "/global/homes/a/agt17/hpo_uq/hpo_uq/utils/sampling.py", line 69, in make_samples
samples = numpy.zeros((nevals,len(names)),dtype=int)
ValueError: negative dimensions are not allowed
Solution
Make sure that the upper bound for each parameter is greater than the given parameter’s lower bound. Also check that each parameter is assigned a value in the list
mult
,
xlow
and
xup
, i.e., the length of each list should be the same as the total number of parameters.
I am getting the following error message when executing the program in evaluation mode. What should I do?
2021-08-20 20:56:49.378624: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8';
dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/esslurm/lib64:/opt/cray/job/2.2.4-7.0.1.1_3.50__g36b56f4.ari/lib64:/opt/intel/compilers_and_libraries_2019.3.199/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2019.3.199/linux/mkl/lib/intel64:/usr/common/software/darshan/3.2.1/lib
Solution
This error message may be displayed if the incorrect ML library has been loaded (PyTorch or TensorFlow), or if both PyTorch and TensorFlow are loaded.