Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions docs/nitpicky
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,12 @@ py:class pathlib._local.Path
# Internal paths that are verified importable but Sphinx can't find
py:class libensemble.resources.platforms.Aurora
py:class libensemble.resources.platforms.GenericROCm
py:class libensemble.resources.platforms.Crusher
py:class libensemble.resources.platforms.Frontier
py:class libensemble.resources.platforms.Perlmutter
py:class libensemble.resources.platforms.PerlmutterCPU
py:class libensemble.resources.platforms.PerlmutterGPU
py:class libensemble.resources.platforms.Polaris
py:class libensemble.resources.platforms.Spock
py:class libensemble.resources.platforms.Summit
py:class libensemble.resources.platforms.Sunspot
py:class libensemble.resources.rset_resources.RSetResources
py:class libensemble.resources.env_resources.EnvResources
py:class libensemble.resources.resources.Resources
Expand Down
73 changes: 12 additions & 61 deletions docs/platforms/bebop.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Bebop
=====

Bebop_ is a Cray CS400 cluster with Intel Broadwell and Knights Landing compute
Bebop_ is a Cray CS400 cluster with Intel Broadwell compute
nodes available in the Laboratory Computing Resources
Center (LCRC) at Argonne National
Laboratory.
Expand Down Expand Up @@ -52,24 +52,24 @@ for installing libEnsemble.
Job Submission
--------------

Bebop uses Slurm_ for job submission and management. The two commands you'll
likely use the most to run jobs are ``srun`` and ``sbatch`` for running
interactively and batch, respectively.

libEnsemble node-worker affinity is especially flexible on Bebop. By adjusting
``srun`` runtime options_ users may assign multiple libEnsemble workers to each
allocated node(oversubscription) or assign multiple nodes per worker.
Bebop uses PBS for job submission and management.

Interactive Runs
^^^^^^^^^^^^^^^^

You can allocate four Knights Landing nodes for thirty minutes through the following::
You can allocate four Broadwell nodes for thirty minutes through the following::

qsub -I -A <project_id> -l select=4:mpiprocs=4 -l walltime=30:00

salloc -N 4 -p knl -A [username OR project] -t 00:30:00
Once in the interactive session, you may need to reload your modules::

With your nodes allocated, queue your job to start with four MPI ranks::
cd $PBS_O_WORKDIR
module load anaconda3 gcc openmpi aocl
conda activate bebop_libe_env

srun -n 4 python calling.py
Now run your script with four workers (one for generator and three for simulations)::

python my_libe_script.py --comms local --nworkers 4

``mpirun`` should also work. This line launches libEnsemble with a manager and
**three** workers to one allocated compute node, with three nodes available for
Expand All @@ -83,57 +83,10 @@ be initiated with ``libE_specs["dedicated_mode"]=True``
and not oversubscribing, specify one more MPI process than the number of
allocated nodes. The manager and first worker run together on a node.

If you would like to interact directly with the compute nodes via a shell,
the following starts a bash session on a Knights Landing node
for thirty minutes::

srun --pty -A [username OR project] -p knl -t 00:30:00 /bin/bash

.. note::
You will need to reactivate your conda virtual environment and reload your
modules! Configuring this routine to occur automatically is recommended.

Batch Runs
^^^^^^^^^^

Batch scripts specify run settings using ``#SBATCH`` statements. A simple example
for a libEnsemble use case running in :doc:`distributed<platforms_index>` MPI
mode on Broadwell nodes resembles the following:

.. code-block:: bash
:linenos:

#!/bin/bash
#SBATCH -J myjob
#SBATCH -N 4
#SBATCH -p bdwall
#SBATCH -A myproject
#SBATCH -o myjob.out
#SBATCH -e myjob.error
#SBATCH -t 00:15:00

# These four lines construct a machinefile for the executor and slurm
srun hostname | sort -u > node_list
head -n 1 node_list > machinefile.$SLURM_JOBID
cat node_list >> machinefile.$SLURM_JOBID
export SLURM_HOSTFILE=machinefile.$SLURM_JOBID

srun --ntasks 5 python calling_script.py

With this saved as ``myscript.sh``, allocating, configuring, and running libEnsemble
on Bebop is achieved by running ::

sbatch myscript.sh

Example submission scripts for running on Bebop in distributed and centralized mode
are also given in the :doc:`examples<example_scripts>`.

Debugging Strategies
--------------------

View the status of your submitted jobs with ``squeue``, and cancel jobs with
``scancel <Job ID>``.

Additional Information
----------------------

Expand All @@ -144,5 +97,3 @@ See the LCRC Bebop docs here_ for more information about Bebop.
.. _conda: https://conda.io/en/latest/
.. _here: https://docs.lcrc.anl.gov/bebop/running-jobs-bebop/
.. _mpi4py: https://mpi4py.readthedocs.io/en/stable/
.. _options: https://slurm.schedmd.com/srun.html
.. _Slurm: https://slurm.schedmd.com/
5 changes: 3 additions & 2 deletions docs/platforms/example_scripts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Example Scheduler Submission Scripts
====================================

Below are example submission scripts used to configure and launch libEnsemble
on a variety of high-powered systems. See :doc:`here<platforms_index>` for more
on a variety of high-powered systems. See :ref:`here<platform-index>` for more
information about the respective systems and configuration.

General examples
Expand Down Expand Up @@ -73,7 +73,7 @@ System Examples
:caption: /examples/libE_submission_scripts/bebop_submit_pbs_distrib.sh
:language: bash

.. dropdown:: Summit - On Launch Nodes with Multiprocessing
.. dropdown:: Summit (Decommissioned) - On Launch Nodes with Multiprocessing

.. literalinclude:: ../../examples/libE_submission_scripts/summit_submit_mproc.sh
:caption: /examples/libE_submission_scripts/summit_submit_mproc.sh
Expand All @@ -84,3 +84,4 @@ System Examples
.. literalinclude:: ../../examples/libE_submission_scripts/cobalt_submit_mproc.sh
:caption: /examples/libE_submission_scripts/cobalt_submit_mproc.sh
:language: bash

1 change: 0 additions & 1 deletion docs/platforms/platforms_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,6 @@ libEnsemble on specific HPC systems.
improv
perlmutter
polaris
spock_crusher
summit
srun
example_scripts
Expand Down
82 changes: 0 additions & 82 deletions docs/platforms/spock_crusher.rst

This file was deleted.

34 changes: 15 additions & 19 deletions docs/platforms/summit.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
======
Summit
======
=======================
Summit (Decommissioned)
=======================

Summit_ is an IBM AC922 system located at the Oak Ridge Leadership Computing
Facility (OLCF). Each of the approximately 4,600 compute nodes on Summit contains two
Summit_ was an IBM AC922 system located at the Oak Ridge Leadership Computing
Facility (OLCF). Each of the approximately 4,600 compute nodes on Summit contained two
IBM POWER9 processors and six NVIDIA Volta V100 accelerators.

Summit features three tiers of nodes: login, launch, and compute nodes.
Summit featured three tiers of nodes: login, launch, and compute nodes.

Users on login nodes submit batch runs to the launch nodes.
Batch scripts and interactive sessions run on the launch nodes. Only the launch
nodes can submit MPI runs to the compute nodes via ``jsrun``.

These docs are maintained to guide libEnsemble's usage on three-tier systems and/or
`jsrun` systems similar to Summit.

Configuring Python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

------------------

Expand Down Expand Up @@ -57,13 +60,13 @@ Or, you can install via ``conda``:

See :doc:`here<../advanced_installation>` for more information on advanced options
for installing libEnsemble.

Special note on resource sets and Executor submit options

---------------------------------------------------------

When using the portable MPI run configuration options (e.g., num_nodes) to the
:doc:`MPIExecutor<../executor/mpi_executor>` ``submit`` function, it is important
to note that, due to the `resource sets`_ used on Summit, the options refer to
to note that, due to the resource sets used on Summit, the options refer to
resource sets as follows:

- num_procs (int, optional) – The total number resource sets for this run.
Expand Down Expand Up @@ -114,7 +117,7 @@ available on a Summit node, and thus two such tasks may be allocated to each nod
Job Submission
--------------

Summit uses LSF_ for job management and submission. For libEnsemble, the most
Summit used LSF_ for job management and submission. For libEnsemble, the most
important command is ``bsub`` for submitting batch scripts from the login nodes
to execute on the launch nodes.

Expand Down Expand Up @@ -191,20 +194,13 @@ Launching User Applications from libEnsemble Workers
----------------------------------------------------

Only the launch nodes can submit MPI runs to the compute nodes via ``jsrun``.
This can be accomplished in user ``sim_f`` functions directly. However, it is highly
This can be accomplished in user simulator functions directly. However, it is highly
recommended that the :doc:`Executor<../executor/ex_index>` interface
be used inside the ``sim_f`` or ``gen_f``, because this provides a portable interface
be used inside the simulator or generator, because this provides a portable interface
with many advantages including automatic resource detection, portability,
launch failure resilience, and ease of use.

Additional Information
----------------------

See the OLCF guides_ for more information about Summit.

.. _conda: https://conda.io/en/latest/
.. _guides: https://docs.olcf.ornl.gov/systems/summit_user_guide.html
.. _LSF: https://www.olcf.ornl.gov/wp-content/uploads/2018/12/summit_workshop_fuson.pdf
.. _mpi4py: https://mpi4py.readthedocs.io/en/stable/
.. _resource sets: https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun
.. _Summit: https://docs.olcf.ornl.gov/systems/summit_user_guide.html
.. _Summit: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/
2 changes: 1 addition & 1 deletion docs/resource_manager/resource_detection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Resource detection can be disabled by setting
configuration options on the Executor submit line.

This will usually work sufficiently on
systems that have application-level scheduling and queuing (e.g., ``jsrun`` on Summit).
systems that have application-level scheduling and queuing (e.g., ``jsrun``).
However, on many cluster and multi-node systems, if the built-in resource
manager is disabled, then runs without a hostlist or machinefile supplied may be
undesirably scheduled to the same nodes.
Expand Down
48 changes: 0 additions & 48 deletions examples/libE_submission_scripts/cobalt_submit_mproc.sh

This file was deleted.

Loading
Loading