Monday, February 14, 2022

Open MPI with hierarchical collectives (HCOLL) Algorithms

MPI, an acronym for Message Passing Interface, is a library specification for parallel computing architectures, which allows for communication of information between various nodes and clusters. Today, MPI is the most common protocol used in high performance computing (HPC).

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/
source

Open MPI is developed in a true open source fashion by a consortium of research, academic, and industry partners.  Latest version of Open MPI: Version 4.1.

Download OpenMPI from link  https://www.open-mpi.org/software/ompi/v4.1/

Example https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz 

source
source

NOTE: NVIDIA Mellanox HPC-X is a comprehensive software package that includes MPI and SHMEM communications libraries. HPC-X uses 'hcoll' library for collective communication and 'hcoll' is enabled by default in HPC-X on Azure HPC VMs and can be controlled at runtime by using the parameter[-mca coll_hcoll_enable 1]

How to install UCX :

Unified Communication X (UCX) is a framework of communication APIs for HPC. It is optimized for MPI communication over InfiniBand and works with many MPI implementations such as OpenMPI and MPICH.

  • wget https://github.com/openucx/ucx/releases/download/v1.4.0/ucx-1.4.0.tar.gz
  • tar -xvf ucx-1.4.0.tar.gz
  • cd ucx-1.4.0
  • ./configure --prefix=<ucx-install-path> 
  • make -j 8 && make install

Optimizing MPI collectives and hierarchical communication algorithms (HCOLL):

MPI Collective communication primitives offer a flexible, portable way to implement group communication operations. They are widely used across various scientific parallel applications and have a significant impact on the overall application performance. Refer configuration parameters to optimize collective communication performance using HPC-X and HCOLL library for collective communication.

As an example, if you suspect your tightly coupled MPI application is doing an excessive amount of collective communication, you can try enabling hierarchical collectives (HCOLL). To enable those features, use the following parameters.


-mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=<MLX device>:<Port>

HCOLL :

Scalable infrastructure: Designed and implemented with current and emerging “extreme-scale” systems in mind

  • Scalable communicator creation, memory consumption, runtime interface
  • Asynchronous execution
  • Blocking and non-blocking collective routines
  • Easily integrated into other packages
  • Successfully integrated into OMPI – “hcoll” component in “coll” framework
  • Successfully integrated in Mellanox OSHMEM
  • Experimental integration in MPICH
  • Host level hierarchy awareness
  • Socket groups, UMA groups
  • Exposes Mellanox and InfiniBand specific capabilities
source

How to build OpenMPI with HCOLL

Install UCX as described above and build with HCOLL  as shown below 

Steps:

  1. ./configure --with-lsf=/LSF_HOME/10.1/ --with-lsf-libdir=/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/ --disable-man-pages --enable-mca-no-build=btl-uct --enable-mpi1-compatibility  --prefix $MY_HOME/openmpi-4.1.1/install --with-ucx=/ucx-install_dir CPPFLAGS=-I/ompi/opal/mca/hwloc/hwloc201/hwloc/include --cache-file=/dev/null --srcdir=. --disable-option-checking
  2. make 
  3. make install

---------------------------Set Test Environment------------------------------------------------

  1.  export PATH=$MY_HOME/openmpi-4.1.1/install/bin:$PATH
  2.  export LD_LIBRARY_PATH=$MY_HOME/openmpi- 4.1.1/install/lib:/opt/mellanox/hcoll/lib:/opt/mellanox/sharp/lib:$LD_LIBRARY_PATH
  3.  export OPAL_PREFIX=$MY_HOME/openmpi-4.1.1/install
NOTE: It may be necessary to explicitly pass LD_LIBRARY_PATH  as mentioned in (3)

--------------  How to run mpi testcase without HCOLL--------------------------------------

1) Use these --mca option to disable HCOLL

--mca coll_hcoll_enable 0 

--mca coll_hcoll_priority 0 

2) Add --mca coll_base_verbose 10  to get more details 

3) Add -x LD_LIBRARY_PATH to get the proper path as shown below


-----------------------------Execute Testcase ----------------------------------

Testcase source:  https://github.com/jeffhammond/BigMPI/tree/master/test

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs -mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allreduce_uniform_count

--------------------------------------------------------------------------

INT_MAX               :           2147483647
UINT_MAX              :           4294967295
SIZE_MAX              : 18446744073709551615
----------------------:-----------------------------------------
                      : Count x Datatype size      = Total Bytes
TEST_UNIFORM_COUNT    :           2147483647
V_SIZE_DOUBLE_COMPLEX :           2147483647 x  16 =    32.0 GB
V_SIZE_DOUBLE         :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT_COMPLEX  :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT          :           2147483647 x   4 =     8.0 GB
V_SIZE_INT            :           2147483647 x   4 =     8.0 GB
----------------------:-----------------------------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  3: PASSED
Rank  0: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  0: PASSED
Rank  2: PASSED
Rank  3: PASSED
Rank  1: PASSED
---------------------
Results from MPI_Iallreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Iallreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
[smpici@host01 BigCount]$

=====================Example for A data integrity issue (DI issue)====

There is end-to-end data integrity checks to detect data corruption. If any DI issue observed , it will be critical (high priority/ high severity defect)

DI issue with HCOLL  ---let's see an example for DI issue.

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs  --mca coll_hcoll_enable 1 --mca coll_hcoll_priority 98 test_allgatherv_uniform_count 


Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  2: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  0: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  3: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  1: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)


---------------Lets run the same testcase without  HCOLL-------------------------------------------


$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs  --mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allgatherv_uniform_count   

Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  0: PASSED
Rank  2: PASSED
Rank  3: PASSED
Rank  1: PASSED

Results from MPI_Iallgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  3: PASSED
Rank  2: PASSED
Rank  0: PASSED
Rank  1: PASSED

This post briefly shows features for optimal collective communication performance  and highlights the  general recommendations. The real application performance depends on your application characteristics, runtime configuration, transport protocols, processes per node (ppn) configuration... etc.


Reference:
http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/18/bureddy-mug-18.pdf
https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/setup-mpi



Friday, January 28, 2022

HPC Clusters in a Multi-Cloud Environment

High performance computing (HPC) is the ability to process data and perform complex calculations at high speeds. One of the best-known types of HPC solutions is the supercomputer. A supercomputer contains thousands of compute nodes that work together to complete one or more tasks. This is called parallel processing. HPC solutions have three main components: Compute , Network and Storage. To build a high performance computing architecture, compute servers are networked together into a cluster. Software programs and algorithms are run simultaneously on the servers in the cluster. The cluster is networked to the data storage to capture the output. Together, these components operate seamlessly to complete a diverse set of tasks.

To operate at maximum performance, each component must keep pace with the others. For example, the storage component must be able to feed and ingest data to and from the compute servers as quickly as it is processed. Likewise, the networking components must be able to support the high-speed transportation of data between compute servers and the data storage. If one component cannot keep up with the rest, the performance of the entire HPC infrastructure suffers.

Containers give HPC the portability that Hybrid Cloud demands .Containers are ready-to-execute packages of software. Container technology provides hardware abstraction, wherein the container is not tightly coupled with the server. Abstraction between the hardware and software stacks provides ease of access, ease of use, and the agility that bare metal environments lack.

Source
Source

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user while using resources more efficiently and reducing costs. Recently, high performance computing (HPC) is moving closer to the enterprise and can therefore benefit from an HPC container and Kubernetes ecosystem, with new requirements to quickly allocate and deallocate computational resources to HPC workloads so that planning of compute capacity no longer required in advance. The HPC community is picking up the concept and applying it to batch jobs and interactive applications.

In a multi-cloud environment, an enterprise utilizes multiple public cloud services, most often from different cloud providers. For example, an organization might host its web front-end application on AWS and host its Exchange servers on Microsoft Azure. Since all cloud providers are not created equal, organizations adopt a multi-cloud strategy to deliver best of breed IT services, to prevent lock-in to a single cloud provider, or to take advantages of cloud arbitrage and choose providers for specific services based on which provider is offering the lowest price at that time. Although it is similar to a hybrid cloud, multi-cloud specifically indicates more than one public cloud provider service and need not include a private cloud component at all. Enterprises adopt a multi-cloud strategy so as not to ‘keep all their eggs in a single basket’, for geographic or regulatory governance demands, for business continuity, or to take advantage of features specific to a particular provider.

source
source

Multi-cloud is the use of multiple cloud computing and storage services in a single network architecture. This refers to the distribution of cloud assets, software, applications, and more across several cloud environments. With a typical multi-cloud architecture utilizing two or more public clouds as well as private clouds, a multi-cloud environment aims to eliminate the reliance on any single cloud provider or instance.

Multi-cloud is the use of two or more cloud computing services from any number of different cloud vendors. A multi-cloud environment could be all-private, all-public or a combination of both. Companies use multi-cloud environments to distribute computing resources and minimize the risk of downtime and data loss. They can also increase the computing power and storage available to a business. Innovations in the cloud in recent years have resulted in a move from single-user private clouds to multi-tenant public clouds and hybrid clouds — a heterogeneous environment that leverages different infrastructure environments like the private and public cloud.

A multi-cloud platform combines the best services that each platform offers. This allows companies to customize an infrastructure that is specific to their business goals. A multi-cloud architecture also provides lower risk. If one web service host fails, a business can continue to operate with other platforms in a multi-cloud environment versus storing all data in one place. Examples of public Cloud Providers: 

Hybrid-cloud A hybrid cloud architecture is mix of on-premises, private, and public cloud services with orchestration between the cloud platforms. Hybrid cloud management involves unique entities that are managed as one across all environments. Hybrid cloud architecture allows an enterprise to move data and applications between private and public environments based on business and compliance requirements. For example, customer data can live in a private environment. But heavy processing can be sent to the public cloud without ever having customer data leave the private environment. Hybrid cloud computing allows instant transfer of information between environments, allowing enterprises to experience the benefits of both environments.


Hybrid cloud architecture works well for the following industries:

• Finance: Financial firms are able to significantly reduce their space requirements in a hybrid cloud architecture when trade orders are placed on a private cloud and trade analytics live on a public cloud.

• Healthcare: When hospitals send patient data to insurance providers, hybrid cloud computing ensures HIPAA compliance.

• Legal: Hybrid cloud security allows encrypted data to live off-site in a public cloud while connected o a law firm’s private cloud. This protects original documents from threat of theft or loss by natural disaster.

• Retail: Hybrid cloud computing helps companies process resource-intensive sales data and analytics.

The hybrid cloud strategy could be applied  to move workloads dynamically to the most appropriate IT environment based on cost, performance and security. Utilize on-premises resources for existing workloads, and use public or hosted clouds for new workloads. Run internal business systems and data on premises while customer-facing systems run on infrastructure as a service (iaaS), public or hosted clouds.

Reference:

https://www.hpcwire.com/2019/09/19/kubernetes-containers-and-hpc
https://www.hpcwire.com/2020/03/19/kubernetes-and-hpc-applications-in-hybrid-cloud-environments-part-ii
https://www.hpcwire.com/2021/09/02/kubernetes-based-hpc-clusters-in-azure-and-google-cloud-multi-cloud-environment
https://www-stage.avinetworks.com/
https://www.vmware.com/topics/glossary/content/hybrid-cloud-vs-multi-cloud