Working on Cori

Doing HPO on data-intensive scientific applications can require a lot of computational power. In order for our software to work efficiently on big datasets and complex neural network architectures, we made it compatible with large-scale computing environment such as the Cori supercomputer which is hosted by the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory . In this section, we show how to connect and run the software on Cori.

Login to cluster

Via JupyterLab

The most common way active Cori users are accessing the machine nowadays is through the JupyterHub platform . One can access the platform directly from the internet browser by clicking on this link which will redirect the user to the following login window:

../_images/jupyterlab_login.png

Once logged in, one can select which type of nodes to be used by clicking the start button under the desired choice. If one is not sure what to select, the most common approach is to use Shared CPU node:

../_images/hub_control_panel.png

Finally, once the requested loging nodes are selected, the following interface will display. From there, the user can start browsing through the files, start a terminal or Jupyter notebook,…

../_images/jupyterhub.png

Note

We personally recommend this option as it gives a much better interactive experience with the supercomputer (e.g. ability to work with Jupyter notebook, browse through files, easily upload/download of files).

Via SSH

Another very easy way to connect to the Cori supercomputer is via Secure Shell Protocol (SSH), which can be done through the command line as follows:

ssh username@cori.nersc.gov

This will prompt a message asking you to type your password along with the One-Time Password (OTP). For more information about the required Multi-Factor Authentification, check out this link .

Software Setup

If this is your first time using the HYPPO software, you should follow the instructions given here to install the package on your home directory.

Note

Make sure you open a new terminal for the change in the Python path to take effect and the package to be properly imported.

Cori Modules

In order to facilitate the work done by scientists on Cori and to avoid the need to install new softwares frequently, a number of modules were created which can be loaded and unloaded straightforwardly using the module load and module unload commands. For the HYPPO software, only one of two modules is needed to run the program, either tensorflow/intel-2.2.0-py37 or pytorch/1.7.1 , which module needs to be loaded depends on the Python Machine Learning library that will be used, i.e. Tensorflow or PyTorch. For instance, if you decide to do training with the PyTorch library, we can load the relevant module as follows:

module load pytorch/1.7.1

Warning

From now on, you need to make sure you have either the tensorflow/intel-2.2.0-py37 or the pytorch/1.7.1 library loaded to the system. The list of currently loaded modules can be displayed by typing module list in the command line.

The following versions of either TensorFlow or PyTorch must be loaded based on CPU or GPU usage:

CPU or GPU?

Package

Version

CPU

TensorFlow

tensorflow/intel-2.2.0-py37

PyTorch

pytorch/1.7.1

GPU

TensorFlow

tensorflow/2.4.0-gpu

PyTorch

pytorch/1.9.0-gpu

Danger

Do not load more than one version shown above. Loading both TensorFlow and PyTorch will result in a fatal error (see FAQ section on this topic for more details).

Python Libraries

While most of the library dependencies are already included in the Python environment that is loaded by either of these modules, other libraries such as deap , SALib , pyDOE and plotly libraries need to be installed separately. Instead of setting up a new environment, one can use the pip command to install each package locally for each module, the following command can be executed through the command line:

pip install -r requirements.txt

SLURM Script

Now that you have the software along with all its library dependencies properly installed on the Cori machine, you start sending job to the cluster. In order to make it hassle-free, the user the slurm option from the hyppo main executable to automatically generate a SLURM script based on the information provided within the input configuration file. For instance, using the example configuration file , one can execute the following command from the home directory:

hyppo slurm hyppo/config/example.yaml

The above command will create a script.sh with the following content:

#!/bin/bash
#SBATCH --account m0001
#SBATCH --nodes 1
#SBATCH --qos regular
#SBATCH --time 2
#SBATCH --constraint haswell
#SBATCH --job-name hpo
#SBATCH --error %x-%j.out
mkdirs -p logs/
module load pytorch/1.7.1
srun -n 32 -c 1 $HOME/hyppo/bin/hyppo evaluation example.yaml

The job described in the above newly created bash script can then be sent to the cluster using the sbatch command as follows:

sbatch script.sh

If the job is successfully submitted, a message similar to Submitted batch job 43081050 will be displayed, you can then review your job queue using the sqs command which will display status information about your pending (PD) or running (R) jobs:

JOBID            ST USER      NAME          NODES TIME_LIMIT       TIME  SUBMIT_TIME          QOS             START_TIME           FEATURES       NODELIST(REASON
43081050         PD vdumont   hpo           1           2:00       0:00  2021-06-03T00:24:31  regular_1       N/A                  haswell        (Priority)

Once completed, the log file for each of the 32 processors used in this job can be found under the newly created logs repository.

Using Cori GPUs

Please see the following section for more information on running HYPPO on Cori GPUs.

Jupyter Notebook

Warning

When starting a Jupyter notebook with a tensorflow-v2.2.0-cpu kernel, HYPPO will fail to import (even though it does import correctly if you load the module on the command line and import the package through IPython). In order to work around this issue, one can simply append the path to the software in the Python path at the beginning of the notebook as follows:

import sys,os
sys.path.append(os.path.expandvars('$HOME/hyppo'))