|
| 1 | +====================== |
| 2 | +Summit (Decomissioned) |
| 3 | +====================== |
| 4 | + |
| 5 | +Summit_ was an IBM AC922 system located at the Oak Ridge Leadership Computing |
| 6 | +Facility (OLCF). Each of the approximately 4,600 compute nodes on Summit contained two |
| 7 | +IBM POWER9 processors and six NVIDIA Volta V100 accelerators. |
| 8 | + |
| 9 | +Summit featured three tiers of nodes: login, launch, and compute nodes. |
| 10 | + |
| 11 | +Users on login nodes submit batch runs to the launch nodes. |
| 12 | +Batch scripts and interactive sessions run on the launch nodes. Only the launch |
| 13 | +nodes can submit MPI runs to the compute nodes via ``jsrun``. |
| 14 | + |
| 15 | +These docs are maintained to guide libEnsemble's usage on three-tier systems similar to Summit. |
| 16 | + |
| 17 | +Special note on resource sets and Executor submit options |
| 18 | +--------------------------------------------------------- |
| 19 | + |
| 20 | +When using the portable MPI run configuration options (e.g., num_nodes) to the |
| 21 | +:doc:`MPIExecutor<../executor/mpi_executor>` ``submit`` function, it is important |
| 22 | +to note that, due to the `resource sets`_ used on Summit, the options refer to |
| 23 | +resource sets as follows: |
| 24 | + |
| 25 | +- num_procs (int, optional) – The total number resource sets for this run. |
| 26 | + |
| 27 | +- num_nodes (int, optional) – The number of nodes on which to submit the run. |
| 28 | + |
| 29 | +- procs_per_node (int, optional) – The number of resource sets per node. |
| 30 | + |
| 31 | +It is recommended that the user defines a resource set as the minimal configuration |
| 32 | +of CPU cores/processes and GPUs. These can be added to the ``extra_args`` option |
| 33 | +of the *submit* function. Alternatively, the portable options can be ignored and |
| 34 | +everything expressed in ``extra_args``. |
| 35 | + |
| 36 | +For example, the following *jsrun* line would run three resource sets, |
| 37 | +each having one core (with one process), and one GPU, along with some extra options:: |
| 38 | + |
| 39 | + jsrun -n 3 -a 1 -g 1 -c 1 --bind=packed:1 --smpiargs="-gpu" |
| 40 | + |
| 41 | +To express this line in the ``submit`` function may look |
| 42 | +something like the following:: |
| 43 | + |
| 44 | + exctr = Executor.executor |
| 45 | + task = exctr.submit(app_name="mycode", |
| 46 | + num_procs=3, |
| 47 | + extra_args="-a 1 -g 1 -c 1 --bind=packed:1 --smpiargs="-gpu"" |
| 48 | + app_args="-i input") |
| 49 | + |
| 50 | +This would be equivalent to:: |
| 51 | + |
| 52 | + exctr = Executor.executor |
| 53 | + task = exctr.submit(app_name="mycode", |
| 54 | + extra_args="-n 3 -a 1 -g 1 -c 1 --bind=packed:1 --smpiargs="-gpu"" |
| 55 | + app_args="-i input") |
| 56 | + |
| 57 | +The libEnsemble resource manager works out the resources available to each worker, |
| 58 | +but unlike some other systems, ``jsrun`` on Summit dynamically schedules runs to |
| 59 | +available slots across and within nodes. It can also queue tasks. This allows variable |
| 60 | +size runs to easily be handled on Summit. If oversubscription to the `jsrun` system |
| 61 | +is desired, then libEnsemble's resource manager can be disabled in the |
| 62 | +calling script via:: |
| 63 | + |
| 64 | + libE_specs["disable_resource_manager"] = True |
| 65 | + |
| 66 | +In the above example, the task being submitted used three GPUs, which is half those |
| 67 | +available on a Summit node, and thus two such tasks may be allocated to each node |
| 68 | +(from different workers), if they were running at the same time. |
| 69 | + |
| 70 | +Job Submission |
| 71 | +-------------- |
| 72 | + |
| 73 | +Summit used LSF_ for job management and submission. For libEnsemble, the most |
| 74 | +important command is ``bsub`` for submitting batch scripts from the login nodes |
| 75 | +to execute on the launch nodes. |
| 76 | + |
| 77 | +It is recommended to run libEnsemble on the launch nodes (assuming workers are |
| 78 | +submitting MPI applications) using the ``local`` communications mode (multiprocessing). |
| 79 | + |
| 80 | +Interactive Runs |
| 81 | +^^^^^^^^^^^^^^^^ |
| 82 | + |
| 83 | +You can run interactively with ``bsub`` by specifying the ``-Is`` flag, |
| 84 | +similarly to the following:: |
| 85 | + |
| 86 | + $ bsub -W 30 -P [project] -nnodes 8 -Is |
| 87 | + |
| 88 | +This will place you on a launch node. |
| 89 | + |
| 90 | +.. note:: |
| 91 | + You will need to reactivate your conda virtual environment. |
| 92 | + |
| 93 | +Batch Runs |
| 94 | +^^^^^^^^^^ |
| 95 | + |
| 96 | +Batch scripts specify run settings using ``#BSUB`` statements. The following |
| 97 | +simple example depicts configuring and launching libEnsemble to a launch node with |
| 98 | +multiprocessing. This script also assumes the user is using the ``parse_args()`` |
| 99 | +convenience function from libEnsemble's :doc:`tools module<../utilities>`. |
| 100 | + |
| 101 | +.. code-block:: bash |
| 102 | +
|
| 103 | + #!/bin/bash -x |
| 104 | + #BSUB -P <project code> |
| 105 | + #BSUB -J libe_mproc |
| 106 | + #BSUB -W 60 |
| 107 | + #BSUB -nnodes 128 |
| 108 | + #BSUB -alloc_flags "smt1" |
| 109 | +
|
| 110 | + # --- Prepare Python --- |
| 111 | +
|
| 112 | + # Load conda module and gcc. |
| 113 | + module load python |
| 114 | + module load gcc |
| 115 | +
|
| 116 | + # Name of conda environment |
| 117 | + export CONDA_ENV_NAME=my_env |
| 118 | +
|
| 119 | + # Activate conda environment |
| 120 | + export PYTHONNOUSERSITE=1 |
| 121 | + source activate $CONDA_ENV_NAME |
| 122 | +
|
| 123 | + # --- Prepare libEnsemble --- |
| 124 | +
|
| 125 | + # Name of calling script |
| 126 | + export EXE=calling_script.py |
| 127 | +
|
| 128 | + # Communication Method |
| 129 | + export COMMS="--comms local" |
| 130 | +
|
| 131 | + # Number of workers. |
| 132 | + export NWORKERS="--nworkers 128" |
| 133 | +
|
| 134 | + hash -r # Check no commands hashed (pip/python...) |
| 135 | +
|
| 136 | + # Launch libE |
| 137 | + python $EXE $COMMS $NWORKERS > out.txt 2>&1 |
| 138 | +
|
| 139 | +With this saved as ``myscript.sh``, allocating, configuring, and queueing |
| 140 | +libEnsemble on Summit is achieved by running :: |
| 141 | + |
| 142 | + $ bsub myscript.sh |
| 143 | + |
| 144 | +Example submission scripts are also given in the :doc:`examples<example_scripts>`. |
| 145 | + |
| 146 | +Launching User Applications from libEnsemble Workers |
| 147 | +---------------------------------------------------- |
| 148 | + |
| 149 | +Only the launch nodes can submit MPI runs to the compute nodes via ``jsrun``. |
| 150 | +This can be accomplished in user simulator functions directly. However, it is highly |
| 151 | +recommended that the :doc:`Executor<../executor/ex_index>` interface |
| 152 | +be used inside the simulator or generator, because this provides a portable interface |
| 153 | +with many advantages including automatic resource detection, portability, |
| 154 | +launch failure resilience, and ease of use. |
| 155 | + |
| 156 | +.. _conda: https://conda.io/en/latest/ |
| 157 | +.. _LSF: https://www.olcf.ornl.gov/wp-content/uploads/2018/12/summit_workshop_fuson.pdf |
| 158 | +.. _mpi4py: https://mpi4py.readthedocs.io/en/stable/ |
0 commit comments