High-Performance Computing (HPC or supercomputer) is omnipresent in today’s society. For example, every time you watch Netflix, the recommendation algorithm leverages HPC resources remotely to offer you personalized suggestions. HPC stands for High-Performance Computing. The ability to carry out large scale computations to solve complex problems, that either need to process a lot of data, or to have a lot of computing power at their disposal. Basically, any computing system that doesn’t fit on a desk can be described as HPC.
HPC systems are actually networks of processors. The key principle of HPC lies in the possibility to run massively parallel code to benefit from a large acceleration in runtime. A common HPC capability is around 100,000 cores. Most HPC applications are complex tasks which require the processors to exchange their results. Therefore, HPC systems need very fast memories and a low-latency, high-bandwidth communication systems (>100Gb/s) between the processors as well as between the processors and the associated memories.
We can differentiate two types of HPC systems: the homogeneous machines and the hybrid ones. Homogeneous machines only have CPUs while the hybrids have both GPUs and CPUs. Tasks are mostly run on GPUs while CPUs oversee the computation.
They have more computing power since GPUs can handle millions of threads simultaneously and are also more energy efficient. GPUs have faster memories, require less data transfer and are capable to exchange with other GPUs, which is the most energy-intensive part of the machine.
- Pasqal’s neutral atoms quantum computer, highly scalable and energy efficient;
- Lighton’s Optical Processing Unit, a special purpose ligh-based AI chip fitted for tasks such as Natural Language Processing;
- ORCA Computing’s fiber based photonic systems for simulation and fault tolerant quantum computing;
- Quandela’s photonic qubit sources that will fuel next generation of photonic devices;
- QuBit Pharmaceuticals software suites leveraging HPC and quantum computing resources to accelerate drug discovery ;
- Multiverse Computing’s solutions using disruptive mathematics to resolve finance’s most complex problems on a range of classical and quantum technologies.
The reign and modern challenges of the Message Passing Interface (MPI)
All good, but why do you guys doing numerical linear algebra and parallel computing always use the Message Passing Interface to communicate between the processors?”
MPI begun about 25 years ago and has been since then, undoubtedly, the “King” of HPC. What were the characteristics of MPI that made it the de-facto language of HPC?
MPI-3 has added several interfaces to enable more powerful communication scheduling, for example nonblocking collective operations and neighborhood collective operations.
Much of the big data community moved from single-nodes to parallel and distributed computing to process larger amounts of data using relatively short-lived programs and scripts. Thus, programmer productivity only played a minor role in MPI/HPC while it was one of the major requirements for big data analytics. While MPI codes are often orders-of-magnitude faster than many big data codes, they also take much longer to develop. And that is most often a good trade-off.
MPI I/O has been introduced nearly two decades ago to improve the handling of large datasets in parallel settings. It is successfully used in many large applications and I/O libraries such as HDF-5.
MPI predates the time when the use of accelerators became commonplace. However, when these accelerators are used in distributed-memory settings such as computer clusters, then MPI is the common way to program them. The current model, often called MPI+X (e.g., MPI+CUDA), combines traditional MPI with accelerator programming models (e.g., CUDA, OpenACC, OpenMP etc.) in a simple way. In this model, MPI communication is performed by the CPU. Yet, this can be inefficient and inconvenient and recently, we have proposed a programming model called distributed CUDA (dCUDA) to perform communication from within a CUDA compute kernel . This allows to use the powerful GPU warp scheduler for communication latency hiding. In general, integrating accelerators and communication functions is an interesting research topic.
Programming at the transport layer, where every exchange of data has to be implemented with lovingly hand-crafted sends and receives or gets and puts, is an incredibly awkward fit for numerical application developers, who want to think in terms of distributed arrays, data frames, trees, or hash tables.
Everyone uses MPI” has made it nearly impossible for even made-in-HPC-land tools like Chapel or UPC to make any headway, much less quite different systems like Spark or Flink, meaning that HPC users are largely stuck with using an API which was a big improvement over anything else available 25 years ago,
Chapel is a modern programming language designed for productive parallel computing at scale. Chapel's design and implementation have been undertaken with portability in mind, permitting Chapel to run on multicore desktops and laptops, commodity clusters, and the cloud, in addition to the high-end supercomputers for which it was originally undertaken.