LINUX & HPC : Advanced Large Scale Computing at a Glance !: February 2022

MPI, an acronym for Message Passing Interface, is a library specification for parallel computing architectures, which allows for communication of information between various nodes and clusters. Today, MPI is the most common protocol used in high performance computing (HPC).

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/

source

Open MPI is developed in a true open source fashion by a consortium of research, academic, and industry partners. Latest version of Open MPI: Version 4.1.

Download OpenMPI from link https://www.open-mpi.org/software/ompi/v4.1/

Example https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz

source

NOTE: NVIDIA Mellanox HPC-X is a comprehensive software package that includes MPI and SHMEM communications libraries. HPC-X uses 'hcoll' library for collective communication and 'hcoll' is enabled by default in HPC-X on Azure HPC VMs and can be controlled at runtime by using the parameter[-mca coll_hcoll_enable 1]

How to install UCX :

Unified Communication X (UCX) is a framework of communication APIs for HPC. It is optimized for MPI communication over InfiniBand and works with many MPI implementations such as OpenMPI and MPICH.

wget https://github.com/openucx/ucx/releases/download/v1.4.0/ucx-1.4.0.tar.gz
tar -xvf ucx-1.4.0.tar.gz
cd ucx-1.4.0
./configure --prefix=<ucx-install-path>
make -j 8 && make install

Optimizing MPI collectives and hierarchical communication algorithms (HCOLL):

MPI Collective communication primitives offer a flexible, portable way to implement group communication operations. They are widely used across various scientific parallel applications and have a significant impact on the overall application performance. Refer configuration parameters to optimize collective communication performance using HPC-X and HCOLL library for collective communication.

As an example, if you suspect your tightly coupled MPI application is doing an excessive amount of collective communication, you can try enabling hierarchical collectives (HCOLL). To enable those features, use the following parameters.

-mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=<MLX device>:<Port>

HCOLL :

Scalable infrastructure: Designed and implemented with current and emerging “extreme-scale” systems in mind

Scalable communicator creation, memory consumption, runtime interface
Asynchronous execution
Blocking and non-blocking collective routines
Easily integrated into other packages
Successfully integrated into OMPI – “hcoll” component in “coll” framework
Successfully integrated in Mellanox OSHMEM
Experimental integration in MPICH
Host level hierarchy awareness
Socket groups, UMA groups
Exposes Mellanox and InfiniBand specific capabilities

source

How to build OpenMPI with HCOLL

Install UCX as described above and build with HCOLL as shown below

Steps:

./configure --with-lsf=/LSF_HOME/10.1/ --with-lsf-libdir=/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/ --disable-man-pages --enable-mca-no-build=btl-uct --enable-mpi1-compatibility --prefix $MY_HOME/openmpi-4.1.1/install --with-ucx=/ucx-install_dir CPPFLAGS=-I/ompi/opal/mca/hwloc/hwloc201/hwloc/include --cache-file=/dev/null --srcdir=. --disable-option-checking
make
make install

---------------------------Set Test Environment------------------------------------------------

export PATH=$MY_HOME/openmpi-4.1.1/install/bin:$PATH
export LD_LIBRARY_PATH=$MY_HOME/openmpi- 4.1.1/install/lib:/opt/mellanox/hcoll/lib:/opt/mellanox/sharp/lib:$LD_LIBRARY_PATH
export OPAL_PREFIX=$MY_HOME/openmpi-4.1.1/install

NOTE: It may be necessary to explicitly pass LD_LIBRARY_PATH as mentioned in (3)

-------------- How to run mpi testcase without HCOLL--------------------------------------

1) Use these --mca option to disable HCOLL

--mca coll_hcoll_enable 0
--mca coll_hcoll_priority 0

2) Add --mca coll_base_verbose 10 to get more details

3) Add -x LD_LIBRARY_PATH to get the proper path as shown below

-----------------------------Execute Testcase ----------------------------------

Testcase source: https://github.com/jeffhammond/BigMPI/tree/master/test

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs -mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allreduce_uniform_count

--------------------------------------------------------------------------

INT_MAX : 2147483647
UINT_MAX : 4294967295
SIZE_MAX : 18446744073709551615
----------------------:-----------------------------------------
: Count x Datatype size = Total Bytes
TEST_UNIFORM_COUNT : 2147483647
V_SIZE_DOUBLE_COMPLEX : 2147483647 x 16 = 32.0 GB
V_SIZE_DOUBLE : 2147483647 x 8 = 16.0 GB
V_SIZE_FLOAT_COMPLEX : 2147483647 x 8 = 16.0 GB
V_SIZE_FLOAT : 2147483647 x 4 = 8.0 GB
V_SIZE_INT : 2147483647 x 4 = 8.0 GB
----------------------:-----------------------------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank 2: PASSED
Rank 3: PASSED
Rank 0: PASSED
Rank 1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x 50.0% = 1073741823
Root : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Peer : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Total : payload 34359738336 32.0 GB = 32.0 GB root + 32.0 GB x 0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank 0: PASSED
Rank 2: PASSED
Rank 3: PASSED
Rank 1: PASSED
---------------------
Results from MPI_Iallreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank 2: PASSED
Rank 0: PASSED
Rank 3: PASSED
Rank 1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x 50.0% = 1073741823
Root : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Peer : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Total : payload 34359738336 32.0 GB = 32.0 GB root + 32.0 GB x 0 local peers
---------------------
Results from MPI_Iallreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank 2: PASSED
Rank 0: PASSED
Rank 3: PASSED
Rank 1: PASSED
[smpici@host01 BigCount]$

=====================Example for A data integrity issue (DI issue)====

There is end-to-end data integrity checks to detect data corruption. If any DI issue observed , it will be critical (high priority/ high severity defect)

DI issue with HCOLL ---let's see an example for DI issue.

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs --mca coll_hcoll_enable 1 --mca coll_hcoll_priority 98 test_allgatherv_uniform_count

Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank 2: ERROR: DI in 805306368 of 2147483644 slots ( 37.5 % wrong)
Rank 0: ERROR: DI in 805306368 of 2147483644 slots ( 37.5 % wrong)
Rank 3: ERROR: DI in 805306368 of 2147483644 slots ( 37.5 % wrong)
Rank 1: ERROR: DI in 805306368 of 2147483644 slots ( 37.5 % wrong)

---------------Lets run the same testcase without HCOLL-------------------------------------------

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs --mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allgatherv_uniform_count

Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank 0: PASSED
Rank 2: PASSED
Rank 3: PASSED
Rank 1: PASSED

Results from MPI_Iallgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank 3: PASSED
Rank 2: PASSED
Rank 0: PASSED
Rank 1: PASSED

=========================

How to enable and disable HCOLL to find Mellanox defect and use mca coll_base_verbose 10 to get more debug information

CASE 1: Enable HCOLL and run bigcount with allreduce

[smpici@myhostn01 collective-big-count]$ /nfs_smpi_ci/sachin/ompi_4-0.x/openmpi-4.0.7/install/bin/mpirun --timeout 500 --np 4 --npernode 1 -host myhost01:1,myhost02:1,myhost03:1,myhost04:1 -mca coll_hcoll_np 0 -mca coll_hcoll_enable 1 --mca coll_hcoll_priority 98 -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x BIGCOUNT_ENABLE_NONBLOCKING="0" -x HCOLL_RCACHE=^ucs /nfs_smpi_ci/sachin/bigmpi_ompi/ibm-tests/tests/big-mpi/BigCountUpstream/ompi-tests-public/collective-big-count/test_allreduce_uniform_count

--------------------------------------------------------------------------

Total Memory Avail. : 567 GB
Percent memory to use : 6 %
Tolerate diff. : 10 GB
Max memory to use : 34 GB
----------------------:-----------------------------------------
INT_MAX : 2147483647
UINT_MAX : 4294967295
SIZE_MAX : 18446744073709551615
----------------------:-----------------------------------------
: Count x Datatype size = Total Bytes
TEST_UNIFORM_COUNT : 2147483647
V_SIZE_DOUBLE_COMPLEX : 2147483647 x 16 = 32.0 GB
V_SIZE_DOUBLE : 2147483647 x 8 = 16.0 GB
V_SIZE_FLOAT_COMPLEX : 2147483647 x 8 = 16.0 GB
V_SIZE_FLOAT : 2147483647 x 4 = 8.0 GB
V_SIZE_INT : 2147483647 x 4 = 8.0 GB
----------------------:-----------------------------------------
---------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank 2: PASSED
Rank 3: PASSED
Rank 0: PASSED
Rank 1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x 50.0% = 1073741823
Root : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Peer : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Total : payload 34359738336 32.0 GB = 32.0 GB root + 32.0 GB x 0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

Timeout: 500 seconds

The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
[smpici@myhostn01 collective-big-count]$

CASE 2: Disable HCOLL and run bigcount with allreduce

[user1@myhostn01 collective-big-count]$ /nfs_smpi_ci/sachin/ompi_4-0.x/openmpi-4.0.7/install/bin/mpirun --np 4 --npernode 1 -host myhost01:1,myhost02:1,myhost03:1,myhost04:1 --mca coll_base_verbose 10 -mca coll_hcoll_np 0 -mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x BIGCOUNT_ENABLE_NONBLOCKING="0" -x HCOLL_RCACHE=^ucs /nfs_smpi_ci/sachin/bigmpi_ompi/ibm-tests/tests/big-mpi/BigCountUpstream/ompi-tests-public/collective-big-count/test_allreduce_uniform_count

[myhostn01:1984671] coll:base:comm_select: component disqualified: self (priority -1 < 0)
[myhostn01:1984671] coll:sm:comm_query (0/MPI_COMM_WORLD): intercomm, comm is too small, or not all peers local; disqualifying myself
[myhostn01:1984671] coll:base:comm_select: component not available: sm
[myhostn01:1984671] coll:base:comm_select: component disqualified: sm (priority -1 < 0)
[myhostn01:1984671] coll:base:comm_select: component not available: sync
[myhostn01:1984671] coll:base:comm_select: component disqualified: sync (priority -1 < 0)
[myhostn01:1984671] coll:base:comm_select: component available: tuned, priority: 30
[myhostn01:1984671] coll:base:comm_select: component not available: hcoll

----------------------:-----------------------------------------
Total Memory Avail. : 567 GB
Percent memory to use : 6 %
Tolerate diff. : 10 GB
Max memory to use : 34 GB
----------------------:-----------------------------------------
INT_MAX : 2147483647
UINT_MAX : 4294967295
SIZE_MAX : 18446744073709551615
----------------------:-----------------------------------------
: Count x Datatype size = Total Bytes
TEST_UNIFORM_COUNT : 2147483647
V_SIZE_DOUBLE_COMPLEX : 2147483647 x 16 = 32.0 GB
V_SIZE_DOUBLE : 2147483647 x 8 = 16.0 GB
V_SIZE_FLOAT_COMPLEX : 2147483647 x 8 = 16.0 GB
V_SIZE_FLOAT : 2147483647 x 4 = 8.0 GB
V_SIZE_INT : 2147483647 x 4 = 8.0 GB
----------------------:-----------------------------------------
---------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank 2: PASSED
Rank 0: PASSED
Rank 3: PASSED
Rank 1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x 50.0% = 1073741823
Root : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Peer : payload 34359738336 32.0 GB = 16 dt x 1073741823 count x 2 peers x 1.0 inflation
Total : payload 34359738336 32.0 GB = 32.0 GB root + 32.0 GB x 0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank 0: PASSED
Rank 3: PASSED
Rank 2: PASSED
Rank 1: PASSED
[user1@myhostn01 collective-big-count]$

=================================================================

This post briefly shows features for optimal collective communication performance and highlights the general recommendations. The real application performance depends on your application characteristics, runtime configuration, transport protocols, processes per node (ppn) configuration... etc.

Reference:
http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/18/bureddy-mug-18.pdf
https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/setup-mpi

LINUX & HPC : Advanced Large Scale Computing at a Glance !

Monday, February 14, 2022

Open MPI with hierarchical collectives (HCOLL) Algorithms

Popular Posts

Translate