MPI for HYPPO

One of the biggest challenges in Hyperparameter Optimization is the computational load it involves. Indeed, in order to find the most optimal region in the hyperparameter space, a significant amount of different sets of hyperparameter might need to be evaluated and while our surrogate modeling approach will definitely help reducing the number of iterations, a single evaluation can take some time to process depending on the problem and the size of the dataset. The HYPPO software is designed to work in massively distributed setup across multiple CPU or GPU nodes. In this section, we describe how initial evaluations and individual trainings can be executed in parallel.

Nested Parallelization

Definition:

  • SLURM job: that is the whole SBATCH script to be submitted

  • SLURM steps: those are each srun instances that may be called in the SLURM job (equal to HYPPO parallel task)

  • SLURM task: this corresponds to what is usually know as MPI rank (one task/rank can have multiple CPU/GPU, in the context of HYPPO, all ranks will have a single GPU/CPU processor associated to it).

../_images/diagram.png

Nested parallelization can be used to execute multiple distributed training evaluations in parallel. The user can specify how many CPUs or GPUs to be used for each individual training

GPU limitations

If the total number of GPUs requested is larger than the number of GPUs available in a single node, only single-GPU training can be performed.

Using Cori GPUs