Friday, December 27, 2019

Distributed computing with Message Passing Inteface (MPI)

One thing is certain: The explosion of data creation in our society will continue as far as experts and anyone else can forecast. In response, there is an insatiable demand for more advanced high performance computing to make this data useful.

The IT industry has been pushing to new levels of high-end computing performance; this is the dawn of the exascale era of computing. Recent announcements from the US Department of Energy for exascale computers represent the starting point for a new generation of computing advances. This is critical for the advancement of any number of use cases such as understanding the interactions underlying the science of weather, sub-atomic structures, genomics, physics, rapidly emerging artificial intelligence applications, and other important scientific fields. The supercomputer performance dropped to every 2.3 years from 2009 to 2019 due to several factors including the slowdown in Moore’s Law and technical constraints such as Dennard scaling. To push the bleeding edge of performance and efficiency will require new architectures and computing paradigms. There is a good chance that 5 nanometer technology could come to market later this year or in 2021 due to advances in Semiconductor engineering.

The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the exascale era. The hybrid nature of these systems – distributed memory across nodes and shared memory with non-uniform memory access within each node– poses a challenge. Message Passing Interface (MPI) is a standardized message-passing library interface specification. MPI is a very abstract description on how messages can be exchanged between different processes.   In other words, Message Passing Interface (MPI) is a portable message-passing library standards developed for distributed and parallel computing. Its a STANDARD. So , MPI has multiple implementations. It is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory. MPI gives user the flexibility of calling set of routines from C, C++, Fortran, C#, Java or Python. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs).The message-passing interface (MPI) standard is the dominant communications protocol used in high performance computing today for writing message-passing programs. The MPI-4.0 standard is under development.

MPI Implementations and its derivatives :

There are a number of groups working on MPI implementations. The two principal are OpenMPI, an open-source implementation and MPICH is used as the foundation for the vast majority of its derivatives  including IBM MPI (for Blue Gene), Intel MPI, Cray MPI, Microsoft MPI, Myricom MPI, OSU MVAPICH/MVAPICH2, and many others. MPICH and its derivatives form the most widely used implementations of MPI in the world. MPICH is one of the most popular implementations of MPI. MPICH has been used as the bases for many other MPI derivatives  as shown here. On the other side, IBM Spectrum MPI, Mellanox HPC-X are  MPI - Message Passing Interface based on Open MPI. Similarly, bullx MPI is built around OpenMPI, which has been enhanced by Bull with optimized collective communication.

Open MPI was formed by the merging FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI, and is found in many TOP-500 supercomputers.

source
MPI offers a standard API, and is "portable". That means you can use same source code on all platforms without any modification. It is relatively trivial to switch an application between the different versions of MPI. Most MPI implementations use sockets for TCP based communication. Odds are good that any given MPI implementation will be better optimized and provide faster message passing, than a home grown application using sockets directly. In addition, should you ever get a chance to run your code on a cluster that has InfiniBand, the MPI layer will abstract any of those code changes. This is not a trivial advantage - coding an application to directly use OFED (or another IB Verbs) implementation is very difficult. Most MPI applications include small test apps that can be used to verify the correctness of the networking setup independently of your application. This is a major advantage when it comes time to debug your application. The MPI standard includes the "pMPI" interfaces, for profiling MPI . This interface also allows you to easily add checksums, or other data verification to all the message.  

The standard Message Passing Interface (MPI) has two-sided communication ( pt2pt) and collective communication models. In these communication models, both sender and receiver have to participate in data exchange operations explicitly, which requires synchronization between the processes.  Communications can be either of two types :
  •     Point-to-Point : Two processes in the same communicator are going to communicate.
  •     Collective : All the processes in a communicator are going to communicate together.

One-sided communications are a new type that allows communications to be made in a highly asynchronous way by defining windows of memory where every process can write and or read from. All of these revolve around the idea of Remote Memory Access (RMA). Traditional p2p or collective communications basically work in two-steps : first the data is transferred from the original(s) process(es) to the destination(s). The the sending and receiving processes are synchronised in some way (be it blocking synchronisation or by calling MPI_Wait). RMA allows us to decouple these two steps. One of the biggest implications of this, is the possibility to define shared-memory that will be used by many processes (cf MPI_Win_allocate_shared). Although shared-memory might seem out of the scope of MPI which was initially made for distributed memory, it makes sense to include such functionalities to allow processes sharing the same NUMA nodes for instance. All of these functionalities are grouped under the name of "One-sided communications", since they imply you don't need to have more than one process to store or load information in a shared-memory buffer.
 



In two-sided communication, memory is private to each process. When the sender calls the MPI_Send operation and the receiver calls the MPI_Recv operation, data in the sender memory is copied to a buffer then sent over the network, where it is copied to the receiver memory. One drawback of this approach is that the sender has to wait for the receiver to be ready to receive the data before it can send the data. This may cause a delay in sending data as shown here.


A simplified diagram of MPI two-sided communication send/receive. The sender calls MPI_Send but has to wait until the receiver calls MPI_Recv before data can be sent.To overcome this drawback, the MPI 2 standard introduced Remote Memory Access (RMA), also called one-sided communication because it requires only one process to transfer data. One-sided communication decouples data transfer from system synchronization. The MPI 3.0 standard revised and added extensions to the one-sided communication, adding new functionality to improve the performance of MPI 2 RMA.

Collective Data Movement  :

MPI_BCAST, MPI_GATHER and MPI_SCATTER are collective data movement routines in which all processes interact with a distinguished root process. For example, communication performed in the finite difference program, assuming three processes. Each column represents a processor; each illustrated figure shows data movement in a single phase.The five phases illustrated are (1) broadcast, (2) scatter, (3) nearest-neighbour exchange, (4) reduction, and (5) gather.
source

  1. MPI_BCAST to broadcast the problem size parameter (size) from process 0 to all np processes;
  2. MPI_SCATTER to distribute an input array (work) from process 0 to other processes, so that each process receives size/np elements; 
  3. MPI_SEND and MPI_RECV for exchange of data (a single floating-point number) with neighbours;
  4. MPI_ALLREDUCE to determine the maximum of a set of localerr values computed at the different processes and to distribute this maximum value to each process; and
  5. MPI_GATHER to accumulate an output array at process 0. 

 Many common MPI benchmarks are based primarily on point-to-point communication, providing the best opportunities for analyzing the performance impact of the MCA on real applications. Open MPI implements MPI point-to-point functions on top of the Point- to-point Management Layer (PML) and Point-to-point Transport Layer (PTL) frameworks.The PML fragments messages, schedules fragments across PTLs, and handles incoming message matching.The PTL provides an interface between the PML and underlying network devices.



source

where:
 Point-to-Point management layer (PML)
 Point-to-point Transport Layer (PTL) 
 Bit-transport layer (BTL)


Open MPI is a large project containing many different sub-systems and a relatively large code base. and has  three sections of code listed here.
  •     OMPI: The MPI API and supporting logic
  •     ORTE: The Open Run-Time Environment (support for different back-end run-time systems)
  •     OPAL: The Open Portable Access Layer (utility and "glue" code used by OMPI and ORTE)
There are strict abstraction barriers in the code between these sections. That is, they are compiled into three separate libraries: libmpi, liborte, and libopal with a strict dependency order: OMPI depends on ORTE and OPAL, and ORTE depends on OPAL.

source
The message passing interface (MPI) is one of the most popular parallel programming models for distributed memory systems. As the number of cores per node has increased, programmers have increasingly combined MPI with shared memory parallel programming interfaces, such as the OpenMP programming model. This hybrid of distributed-memory and shared-memory parallel programming idioms has aided programmers in addressing the concerns of performing efficient internode communication while effectively utilizing advancements in node-level architectures, including multicore and many-core processor architectures. Version 3.0 of the MPI standard, adds a new MPI interprocess shared memory extension (MPI SHM). This new extension is now supported by many MPI distributions. The MPI SHM extension enables programmers to create regions of shared memory that are directly accessible by MPI processes within the same shared memory domain. In contrast with hybrid approaches, MPI SHM offers an incremental approach to managing memory resources within a node, where data structures can be individually moved into shared segments to reduce the memory footprint and improve the communication efficiency of MPI programs. Halo exchange is a prototypical neighborhood exchange communication pattern. In such patterns, the adjacency of communication partners often results in communication with processes in the same node, making them good candidates for acceleration through MPI SHM. By applying MPI SHM to this common communication pattern, we demonstrate that direct data sharing can be used instead of communication, resulting in significant performance gains.


Open MPI includes an implementation of OpenSHMEM. OpenSHMEM is a PGAS (partitioned global address space) API for single-sided asynchronous scalable communications in HPC applications. An OpenSHMEM program is SPMD (single program, multiple data) in style. The SHMEM processes, called processing elements or PEs, all start at the same time and they all run the same program. Usually the PEs perform computation on their own subdomains of the larger problem and periodically communicate with other PEs to exchange information on which the next computation phase depends. OpenSHMEM is particularly advantageous for applications at extreme scales with many small put/get operations and/or irregular communication patterns across compute nodes, since it offloads communication operations to the hardware whenever possible. One-sided operations are non-blocking and asynchronous, allowing the program to continue its execution along with the data transfer.

IBM Spectrum® MPI is an implementation of Open MPI, its basic architecture and functionality is similar. IBM Spectrum MPI supports many features of OpenMPI and adds some unique features of its own.IBM Spectrum MPI uses the same basic code structure as Open MPI, and is made up of the  sections OMPI, ORTE and OPAL as discussed in above section. IBM® Spectrum MPI is a high-performance, production-quality implementation of Message Passing Interface (MPI). It accelerates application performance in distributed computing environments. It provides a familiar portable interface based on the open-source MPI. It goes beyond Open MPI and adds some unique features of its own, such as advanced CPU affinity features, dynamic selection of interface libraries, superior workload manager integrations and better performance. IBM Spectrum MPI supports a broad range of industry-standard platforms, interconnects and operating systems, helping to ensure that parallel applications can run almost anywhere. IBM Spectrum MPI Version 10.2 delivers an improved, RDMA-capable Parallel Active Messaging Interface (PAMI) using Mellanox OFED on both POWER8® and POWER9™ systems in Little Endian mode. It also offers an improved collective MPI library that supports the seamless use of GPU memory buffers for the application developer. The library provides advanced logic to select the fastest algorithm of many implementations for each MPI collective operation as shown  below.


As high-performance computing (HPC) bends to the needs of "big data" applications, speed remains essential. But it's not only a question of how quickly one can compute problems, but how quickly one can program the complex applications that do so. High performance computing is no longer limited to those who own supercomputers. HPC’s democratization has been driven particularly by cloud computing, which has given scientists access to supercomputing-like features at the cost of a few dollars per hour.Interest in HPC in the cloud has been growing over the past few years. The cloud offers applications a range of benefits, including elasticity, small startup and maintenance costs, and economics of scale. Yet, compared to traditional HPC systems such as supercomputers, some of the cloud’s primary benefits for HPC arise from its virtualization flexibility. In contrast to supercomputers’ strictly preserved system software, the cloud lets scientists build their own virtual machines and configure them to suit needs and preferences. In general, the cloud is still considered an addition to traditional supercomputers — a bursting solution for cases in which internal resources are overused, especially for small-scale experiments, testing, and initial research. Clouds are convenient for embarrassingly parallel applications (those that do not communicate very much among partitions), which can scale even on commodity interconnects common to contemporary clouds. This is the beauty of Super computer engineering – demand driving innovation, and the exascale era is just the next milestone on the never-ending HPC journey.

Reference:
https://stackoverflow.com/questions/153616/mpi-or-sockets
https://www.ibm.com/support/knowledgecenter/en/SSZTET_10.3/admin/smpi02_running_apps.html
https://hpc.llnl.gov/sites/default/files/MPI-SpectrumUserGuide.pdf 
https://computing.llnl.gov/tutorials/mpi/
http://www.cs.nuim.ie/~dkelly/CS402-06/Message%20Passing%20Interface.htm 
https://www.sciencedirect.com/topics/computer-science/message-passing-interface 
https://www.nextplatform.com/2020/02/13/going-beyond-exascale-computing/?_lrsc=5244db38-a9d0-4d40-9c04-2d8c2ecf4755 
https://dl.acm.org/doi/pdf/10.1145/2966884.2966909