Working on Cori ¶
Doing HPO on data-intensive scientific applications can require a lot of computational power. In order for our software to work efficiently on big datasets and complex neural network architectures, we made it compatible with large-scale computing environment such as the Cori supercomputer which is hosted by the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory . In this section, we show how to connect and run the software on Cori.
Login to cluster ¶
Via JupyterLab ¶
The most common way active Cori users are accessing the machine nowadays is through the JupyterHub platform . One can access the platform directly from the internet browser by clicking on this link which will redirect the user to the following login window:

Once logged in, one can select which type of nodes to be used by clicking the
start
button under the desired choice. If one is not sure what to select, the most common approach is to use Shared CPU node:

Finally, once the requested loging nodes are selected, the following interface will display. From there, the user can start browsing through the files, start a terminal or Jupyter notebook,…

Note
We personally recommend this option as it gives a much better interactive experience with the supercomputer (e.g. ability to work with Jupyter notebook, browse through files, easily upload/download of files).
Via SSH ¶
Another very easy way to connect to the Cori supercomputer is via Secure Shell Protocol (SSH), which can be done through the command line as follows:
ssh username@cori.nersc.gov
This will prompt a message asking you to type your password along with the One-Time Password (OTP). For more information about the required Multi-Factor Authentification, check out this link .
Software Setup ¶
If this is your first time using the HYPPO software, you should follow the instructions given here to install the package on your home directory.
Note
Make sure you open a new terminal for the change in the Python path to take effect and the package to be properly imported.
Cori Modules ¶
In order to facilitate the work done by scientists on Cori and to avoid the need to install new softwares frequently, a number of modules were created which can be loaded and unloaded straightforwardly using the
module
load
and
module
unload
commands. For the HYPPO software, only one of two modules is needed to run the program, either
tensorflow/intel-2.2.0-py37
or
pytorch/1.7.1
, which module needs to be loaded depends on the Python Machine Learning library that will be used, i.e. Tensorflow or PyTorch. For instance, if you decide to do training with the PyTorch library, we can load the relevant module as follows:
module load pytorch/1.7.1
Warning
From now on, you need to make sure you have either the
tensorflow/intel-2.2.0-py37
or the
pytorch/1.7.1
library loaded to the system. The list of currently loaded modules can be displayed by typing
module
list
in the command line.
The following versions of either TensorFlow or PyTorch must be loaded based on CPU or GPU usage:
CPU or GPU? |
Package |
Version |
---|---|---|
CPU |
TensorFlow |
tensorflow/intel-2.2.0-py37 |
PyTorch |
pytorch/1.7.1 |
|
GPU |
TensorFlow |
tensorflow/2.4.0-gpu |
PyTorch |
pytorch/1.9.0-gpu |
Danger
Do not load more than one version shown above. Loading both TensorFlow and PyTorch will result in a fatal error (see FAQ section on this topic for more details).
Python Libraries ¶
While most of the library dependencies are already included in the Python environment that is loaded by either of these modules, other libraries such as
deap
,
SALib
,
pyDOE
and
plotly
libraries need to be installed separately. Instead of setting up a new environment, one can use the
pip
command to install each package locally for each module, the following command can be executed through the command line:
pip install -r requirements.txt
SLURM Script ¶
Now that you have the software along with all its library dependencies properly installed on the Cori machine, you start sending job to the cluster. In order to make it hassle-free, the user the
slurm
option from the
hyppo
main executable to automatically generate a SLURM script based on the information provided within the input configuration file. For instance, using the
example configuration file
, one can execute the following command from the home directory:
hyppo slurm hyppo/config/example.yaml
The above command will create a
script.sh
with the following content:
#!/bin/bash
#SBATCH --account m0001
#SBATCH --nodes 1
#SBATCH --qos regular
#SBATCH --time 2
#SBATCH --constraint haswell
#SBATCH --job-name hpo
#SBATCH --error %x-%j.out
mkdirs -p logs/
module load pytorch/1.7.1
srun -n 32 -c 1 $HOME/hyppo/bin/hyppo evaluation example.yaml
The job described in the above newly created bash script can then be sent to the cluster using the
sbatch
command as follows:
sbatch script.sh
If the job is successfully submitted, a message similar to
Submitted
batch
job
43081050
will be displayed, you can then review your job queue using the
sqs
command which will display status information about your pending (PD) or running (R) jobs:
JOBID ST USER NAME NODES TIME_LIMIT TIME SUBMIT_TIME QOS START_TIME FEATURES NODELIST(REASON
43081050 PD vdumont hpo 1 2:00 0:00 2021-06-03T00:24:31 regular_1 N/A haswell (Priority)
Once completed, the log file for each of the 32 processors used in this job can be found under the newly created
logs
repository.
Using Cori GPUs ¶
Please see the following section for more information on running HYPPO on Cori GPUs.
Jupyter Notebook ¶
Warning
When starting a Jupyter notebook with a
tensorflow-v2.2.0-cpu
kernel, HYPPO will fail to import (even though it does import correctly if you load the module on the command line and import the package through IPython). In order to work around this issue, one can simply append the path to the software in the Python path at the beginning of the notebook as follows:
import sys,os
sys.path.append(os.path.expandvars('$HOME/hyppo'))