Thursday, February 23, 2023

MPI [ Message Passing Interfaces] - behind the scenes

Parallel computing is accomplished by splitting up large and complex tasks across multiple processors. In order to organize and orchestrate parallel processing, our program must consider automatically decomposing the problem at hand and allowing the processors to communicate with each other when necessary while performing their work. This introduces a new overhead, the synchronization and the communication itself.

Computing parallelism can be roughly classified as Distributed Memory (DM) or Shared Memory(SM) class. In Distributed Memory (DM), each processor has its own memory which are connected through a network that can exchange data, thus, limiting the DM performance and scalability. In Shared

Memory (SM), each processor can access all of the memory, resulting in automatic distribution of procedurally iterations over several processors - autoparallelization, explicit distribution of work over the processors by compiler directives or function calls to threading libraries. If this overhead is not accounted. It can create several issues like bottlenecks in the parallel computer design and load imbalances.

MPI is an API for message passing between entities with separated memory spaces - processes. The standard doesn't care where those processes run - it could be on networked computers (clusters), it could be on a big shared memory machine or it could be any other architecture that provide the same semantics (e.g. IBM Blue Gene)

OpenMPI is a widely used message passing interface (MPI) library for parallel computing. It provides an abstraction layer that allows application developers to write parallel code without worrying about the underlying hardware details. However, OpenMPI also provides support for hardware-specific optimizations, including those for network adapters. For example, it supports the use of high-speed interconnects such as InfiniBand and RoCE, and it can take advantage of hardware offload capabilities such as Remote Direct Memory Access (RDMA).

UCX (Unified Communication X) is another middleware library for communication in distributed systems. It is designed to be highly scalable and to support a wide range of hardware platforms, including network adapters. UCX provides a portable API that allows applications to take advantage of hardware-specific features of network adapters, such as RDMA and network offloading. UCX can also integrate with other system-level libraries such as OpenMPI and hwloc to optimize performance on specific hardware configurations.

Hwloc (Hardware Locality) is a library for topology discovery and affinity management in parallel computing. It provides a portable API for discovering the hierarchical structure of the underlying hardware, including network adapters, and it allows applications to optimize performance by binding threads and processes to specific hardware resources. Hwloc can be used in conjunction with OpenMPI and UCX to optimize communication and data movement on high-performance computing systems.

TCP/IP is a family of networking protocols. IP is the lower-level protocol that's responsible for getting packets of data from place to place across the Internet. TCP sits on top of IP and adds virtual circuit/connection semantics. With IP alone you can only send and receive independent packets of data that are not organized into a stream or connection. It's possible to use virtually any physical transport mechanism to move IP packets around. For local networks it's usually Ethernet, but you can use anything. There's even an RFC specifying a way to send IP packets by carrier pigeon.

Sockets is a semi-standard API for accessing the networking features of the operating system. Your program can call various functions named socket, bind, listen, connect, etc., to send/receive data, connect to other computers, and listen for connections from other computers. You can theoretically use any family of networking protocols through the sockets API--the protocol family is a parameter that you pass in--but these days you pretty much always specify TCP/IP. (The other option that's in common use is local Unix sockets.)

When you are interested  to  write a parallel programming application, you should probably not be looking at TCP/IP or sockets as those things are going to be much lower level than you want. You'll probably want to look at something like MPI or any of the PGAS languages like UPC, Co-array Fortran, Global Arrays, Chapel, etc. They're going to be far easier to use than essentially writing your own networking layer.

When you use one of these higher level libraries,  you get lots of nice abstractions like collective operations, remote memory access, and other features that make it easier to just write your parallel code instead of dealing with all of the OS stuff underneath. It also makes your code portable between different machines/architectures.

MPI is free to use any available communication path(s) for MPI messages in the new communicator; the socket is only used for the initial handshaking. 

A common problem is the one of two processes each opening connections to each other. The socket code assume that the sockets are bidirectional, thus only one socket is needed by each pair of connected processes, not one socket for each member of the pair. Then it should refactor the states and state machine into a clear set of VC connection states and connection states.

There are three related objects used during a connection event. They are the connection itself (a structure specific to the communication method, sockets in the case of this note), the virtual connection, and the process group to which the virtual connection belongs

If a socket connection between two processes is established, there are always two sides: The connecting side and the accepting side. The connecting side sends an active message to the accepting side. This sides first accepts the connection. However, if both processes try to connect to each other (head-to-head situation), the processes n have both, a connecting and an accepting connection. In this situation, one of the connections is refused/discarded - while the other connection is established. This is decided on the accepting side.

State machines for establishing a connection:

Connect side :

The connecting side tries to establish a connection by sending an active message to the remote side. If the connection is accepted, the pg_id is send to the remote side. Now, the process waits, until the connection is finally accepted or refused. For this decision, the remote side requires the gp_id . Based on the answer from the remote side (ack = yes or ack = no) the connection is connected or closed.

Accept side:

The accept side receives a connection request on the listening socket. In the first instance, it accepts the connection an allocates the required structures (conn, socket). Then, the connection waits for the pg_id of the remote side to assign the socket-connection to a VC. The decision, if a connection is accepted or refused, is based on the following steps:

  1. The VC has to aktive connection (vc->conn == NULL) : The new connection is accepted
  2. The VC has an aktive connection
  3. If my_pg_id < remote_pg_id: accept and discard other connection
  4. If my_pg_id > remote_pg_id: refuse  

The answer is send to the remote note.

Data Types Required by the MPI Standard:


MPI point-to-point communication sends messages between two different MPI processes. One process performs a send operation while the other performs a matching read

MPI collectives: MPI provides a set of routines for communication patterns that involve all the processes of a certain communicator, so-called collectives. The two main advantages of using collectives are:

1) Less programming effort. 

2) Performance optimization, as the implementations are usually efficient, especially if optimized for specific architectures

For collective communication, significant gains in performance can be achieved by implementing topology- and performance-aware collectives.

Three common blocking collectives are Barrier(), Bcast() and Reduce().

  1. Allreduce(). Combination of reduction and a broadcast so that the output is available for all processes.
  2. Scatter(). Split a block of data available in a root process and send different fragments to each process.
  3. Gather(). Send data from different processes and aggregate it in a root process.
  4. Allgather(). Similar to Gather() but the output is aggregated in buffers of all the processes.
  5. Alltoall(). All processes scatter data to all processes.


No comments:

Post a Comment