Monday, February 14, 2022

Open MPI with hierarchical collectives (HCOLL) Algorithms

MPI, an acronym for Message Passing Interface, is a library specification for parallel computing architectures, which allows for communication of information between various nodes and clusters. Today, MPI is the most common protocol used in high performance computing (HPC).

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/
source

Open MPI is developed in a true open source fashion by a consortium of research, academic, and industry partners.  Latest version of Open MPI: Version 4.1.

Download OpenMPI from link  https://www.open-mpi.org/software/ompi/v4.1/

Example https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz 

source
source

NOTE: NVIDIA Mellanox HPC-X is a comprehensive software package that includes MPI and SHMEM communications libraries. HPC-X uses 'hcoll' library for collective communication and 'hcoll' is enabled by default in HPC-X on Azure HPC VMs and can be controlled at runtime by using the parameter[-mca coll_hcoll_enable 1]

How to install UCX :

Unified Communication X (UCX) is a framework of communication APIs for HPC. It is optimized for MPI communication over InfiniBand and works with many MPI implementations such as OpenMPI and MPICH.

  • wget https://github.com/openucx/ucx/releases/download/v1.4.0/ucx-1.4.0.tar.gz
  • tar -xvf ucx-1.4.0.tar.gz
  • cd ucx-1.4.0
  • ./configure --prefix=<ucx-install-path> 
  • make -j 8 && make install

Optimizing MPI collectives and hierarchical communication algorithms (HCOLL):

MPI Collective communication primitives offer a flexible, portable way to implement group communication operations. They are widely used across various scientific parallel applications and have a significant impact on the overall application performance. Refer configuration parameters to optimize collective communication performance using HPC-X and HCOLL library for collective communication.

As an example, if you suspect your tightly coupled MPI application is doing an excessive amount of collective communication, you can try enabling hierarchical collectives (HCOLL). To enable those features, use the following parameters.


-mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=<MLX device>:<Port>

HCOLL :

Scalable infrastructure: Designed and implemented with current and emerging “extreme-scale” systems in mind

  • Scalable communicator creation, memory consumption, runtime interface
  • Asynchronous execution
  • Blocking and non-blocking collective routines
  • Easily integrated into other packages
  • Successfully integrated into OMPI – “hcoll” component in “coll” framework
  • Successfully integrated in Mellanox OSHMEM
  • Experimental integration in MPICH
  • Host level hierarchy awareness
  • Socket groups, UMA groups
  • Exposes Mellanox and InfiniBand specific capabilities
source

How to build OpenMPI with HCOLL

Install UCX as described above and build with HCOLL  as shown below 

Steps:

  1. ./configure --with-lsf=/LSF_HOME/10.1/ --with-lsf-libdir=/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/ --disable-man-pages --enable-mca-no-build=btl-uct --enable-mpi1-compatibility  --prefix $MY_HOME/openmpi-4.1.1/install --with-ucx=/ucx-install_dir CPPFLAGS=-I/ompi/opal/mca/hwloc/hwloc201/hwloc/include --cache-file=/dev/null --srcdir=. --disable-option-checking
  2. make 
  3. make install

---------------------------Set Test Environment------------------------------------------------

  1.  export PATH=$MY_HOME/openmpi-4.1.1/install/bin:$PATH
  2.  export LD_LIBRARY_PATH=$MY_HOME/openmpi- 4.1.1/install/lib:/opt/mellanox/hcoll/lib:/opt/mellanox/sharp/lib:$LD_LIBRARY_PATH
  3.  export OPAL_PREFIX=$MY_HOME/openmpi-4.1.1/install
NOTE: It may be necessary to explicitly pass LD_LIBRARY_PATH  as mentioned in (3)

--------------  How to run mpi testcase without HCOLL--------------------------------------

1) Use these --mca option to disable HCOLL

--mca coll_hcoll_enable 0 

--mca coll_hcoll_priority 0 

2) Add --mca coll_base_verbose 10  to get more details 

3) Add -x LD_LIBRARY_PATH to get the proper path as shown below


-----------------------------Execute Testcase ----------------------------------

Testcase source:  https://github.com/jeffhammond/BigMPI/tree/master/test

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs -mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allreduce_uniform_count

--------------------------------------------------------------------------

INT_MAX               :           2147483647
UINT_MAX              :           4294967295
SIZE_MAX              : 18446744073709551615
----------------------:-----------------------------------------
                      : Count x Datatype size      = Total Bytes
TEST_UNIFORM_COUNT    :           2147483647
V_SIZE_DOUBLE_COMPLEX :           2147483647 x  16 =    32.0 GB
V_SIZE_DOUBLE         :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT_COMPLEX  :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT          :           2147483647 x   4 =     8.0 GB
V_SIZE_INT            :           2147483647 x   4 =     8.0 GB
----------------------:-----------------------------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  3: PASSED
Rank  0: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  0: PASSED
Rank  2: PASSED
Rank  3: PASSED
Rank  1: PASSED
---------------------
Results from MPI_Iallreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Iallreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
[smpici@host01 BigCount]$

=====================Example for A data integrity issue (DI issue)====

There is end-to-end data integrity checks to detect data corruption. If any DI issue observed , it will be critical (high priority/ high severity defect)

DI issue with HCOLL  ---let's see an example for DI issue.

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs  --mca coll_hcoll_enable 1 --mca coll_hcoll_priority 98 test_allgatherv_uniform_count 


Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  2: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  0: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  3: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  1: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)


---------------Lets run the same testcase without  HCOLL-------------------------------------------


$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs  --mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allgatherv_uniform_count   

Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  0: PASSED
Rank  2: PASSED
Rank  3: PASSED
Rank  1: PASSED

Results from MPI_Iallgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  3: PASSED
Rank  2: PASSED
Rank  0: PASSED
Rank  1: PASSED

=========================

How to enable and disable HCOLL to find Mellanox defect and use mca coll_base_verbose 10 to get more debug information 

CASE 1: Enable HCOLL  and run bigcount with allreduce

[smpici@myhostn01 collective-big-count]$  /nfs_smpi_ci/sachin/ompi_4-0.x/openmpi-4.0.7/install/bin/mpirun --timeout 500 --np 4 --npernode 1 -host myhost01:1,myhost02:1,myhost03:1,myhost04:1 -mca coll_hcoll_np 0 -mca coll_hcoll_enable 1 --mca coll_hcoll_priority 98 -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x BIGCOUNT_ENABLE_NONBLOCKING="0" -x HCOLL_RCACHE=^ucs /nfs_smpi_ci/sachin/bigmpi_ompi/ibm-tests/tests/big-mpi/BigCountUpstream/ompi-tests-public/collective-big-count/test_allreduce_uniform_count

--------------------------------------------------------------------------

Total Memory Avail.   :  567 GB
Percent memory to use :    6 %
Tolerate diff.        :   10 GB
Max memory to use     :   34 GB
----------------------:-----------------------------------------
INT_MAX               :           2147483647
UINT_MAX              :           4294967295
SIZE_MAX              : 18446744073709551615
----------------------:-----------------------------------------
                      : Count x Datatype size      = Total Bytes
TEST_UNIFORM_COUNT    :           2147483647
V_SIZE_DOUBLE_COMPLEX :           2147483647 x  16 =    32.0 GB
V_SIZE_DOUBLE         :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT_COMPLEX  :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT          :           2147483647 x   4 =     8.0 GB
V_SIZE_INT            :           2147483647 x   4 =     8.0 GB
----------------------:-----------------------------------------
---------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  3: PASSED
Rank  0: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 500 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
[smpici@myhostn01 collective-big-count]$

CASE 2: Disable HCOLL  and run bigcount with allreduce

[user1@myhostn01 collective-big-count]$ /nfs_smpi_ci/sachin/ompi_4-0.x/openmpi-4.0.7/install/bin/mpirun --np 4 --npernode 1 -host myhost01:1,myhost02:1,myhost03:1,myhost04:1 --mca coll_base_verbose 10 -mca coll_hcoll_np 0 -mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x BIGCOUNT_ENABLE_NONBLOCKING="0" -x HCOLL_RCACHE=^ucs /nfs_smpi_ci/sachin/bigmpi_ompi/ibm-tests/tests/big-mpi/BigCountUpstream/ompi-tests-public/collective-big-count/test_allreduce_uniform_count

[myhostn01:1984671] coll:base:comm_select: component disqualified: self (priority -1 < 0)
[myhostn01:1984671] coll:sm:comm_query (0/MPI_COMM_WORLD): intercomm, comm is too small, or not all peers local; disqualifying myself
[myhostn01:1984671] coll:base:comm_select: component not available: sm
[myhostn01:1984671] coll:base:comm_select: component disqualified: sm (priority -1 < 0)
[myhostn01:1984671] coll:base:comm_select: component not available: sync
[myhostn01:1984671] coll:base:comm_select: component disqualified: sync (priority -1 < 0)
[myhostn01:1984671] coll:base:comm_select: component available: tuned, priority: 30
[myhostn01:1984671] coll:base:comm_select: component not available: hcoll

----------------------:-----------------------------------------
Total Memory Avail.   :  567 GB
Percent memory to use :    6 %
Tolerate diff.        :   10 GB
Max memory to use     :   34 GB
----------------------:-----------------------------------------
INT_MAX               :           2147483647
UINT_MAX              :           4294967295
SIZE_MAX              : 18446744073709551615
----------------------:-----------------------------------------
                      : Count x Datatype size      = Total Bytes
TEST_UNIFORM_COUNT    :           2147483647
V_SIZE_DOUBLE_COMPLEX :           2147483647 x  16 =    32.0 GB
V_SIZE_DOUBLE         :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT_COMPLEX  :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT          :           2147483647 x   4 =     8.0 GB
V_SIZE_INT            :           2147483647 x   4 =     8.0 GB
----------------------:-----------------------------------------
---------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  0: PASSED
Rank  3: PASSED
Rank  2: PASSED
Rank  1: PASSED
[user1@myhostn01 collective-big-count]$

=================================================================

This post briefly shows features for optimal collective communication performance  and highlights the  general recommendations. The real application performance depends on your application characteristics, runtime configuration, transport protocols, processes per node (ppn) configuration... etc.


Reference:
http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/18/bureddy-mug-18.pdf
https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/setup-mpi