Saturday, July 6, 2019

IBM's Summit & Sierra - The most powerful computers on the planet

Supercomputing is the Formula One of computing. It’s where companies test bleeding-edge technology at an unprecedented scale. Supercomputers are generally used for research purposes, including tasks such as the virtual testing of nuclear bombs, trying to understand how the universe was formed, forecasting climate change and aerodynamic modeling for aircraft. 

        The U.S. now has two machines atop the world’s supercomputer rankings, with a pair of IBM  built systems holding first and second place. Summit and Sierra, supercomputers at Oak Ridge National Laboratory and Lawrence Livermore National Laboratory, are now ranked the #1 and #2 fastest computers. They are helping us model supernovas, pioneer new materials and explore cancer, genetics and the environment — using technologies available to all businesses.

Summit is owned by the Oak Ridge National Laboratory and is designed for artificial intelligence workloads that pertain to high-energy physics and materials discovery, among other things. The lab claims it can perform more than 3 billion-billion calculations per second in some cases. Summit, an IBM-built supercomputer now running at the Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL), captured the number one spot with a performance of 122.3 petaflops on High Performance Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356 nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla V100 GPUs. The nodes are linked together with a Mellanox dual-rail EDR InfiniBand network.

Sierra is jointly operated by the DOE’s National Nuclear Security Administration and the Lawrence Livermore National Lab.Sierra, a new system at the DOE’s Lawrence Livermore National Laboratory took the number three spot, delivering 71.6 petaflops on HPL. Built by IBM, Sierra’s architecture is quite similar to that of Summit, with each of its 4,320 nodes powered by two Power9 CPUs plus four NVIDIA Tesla V100 GPUs and using the same Mellanox EDR InfiniBand as the system interconnect.

Both machines are powered by a combination of IBM’s Power9 central processing units and Nvidia Corp.’s V100 graphics processing units. They’re enormous too, made up of numerous rows of refrigerator-sized computer cabinets. Summit boasts 2.4 million processor cores in total, while Sierra has 1.6 million.

NOTE:Next, World’s fastest supercomputer will be built by AMD and Cray for US government. Frontier is expected to go online in 2021 with 1.5 exaflops of processing power.

General Purpose Computing on Graphics Processing Units [GPGPU]

GPGPU is the utilization of a GPU (graphics processing unit), which would typically only handle computer graphics, to assist in performing tasks that are traditionally handled solely by the CPU (central processing unit). GPGPU allows information to be transferred in both directions, from CPU to GPU and GPU to CPU. Such bidirectional processing can hugely improve efficiency in a wide variety of tasks related to images and video. If the application you use supports OpenCL or CUDA, you will normally see huge performance boosts when using hardware that supports the relevant GPGPU framework.

NVIDIA was an early and aggressive advocate of leveraging graphics processors for other massively parallel processing tasks (often referred to as general-purpose computing on graphics processing units, or GPGPU). However, GPGPU has been embraced in the HPC (high-performance computing) server space, and NVIDIA is the dominant supplier of GPUs for HPC. AMD acknowledges that it is there with the company's ROCm (Radeon Open Compute Platform) initiative. AMD is behind, but that doesn’t mean they’re not trying to catch up.
How do OpenCL and CUDA fit into the equation? OpenCL is currently the leading open source GPGPU framework. CUDA, on the other hand, is the leading proprietary GPGPU framework. It should be noted that Nvidia cards actually support OpenCL as well as CUDA, they just aren’t quite as efficient as AMD GPUs when it comes to OpenCL computation

CUDA and OpenCL offer two different interfaces for programming GPUs. OpenCL is an open standard that can be used to program CPUs, GPUs, and other devices from different vendors, while CUDA is specific to NVIDIA GPUs. Although OpenCL promises a portable language for GPU programming, its generality may entail a performance penalty.  CUDA can be used in two different ways, (1) via the runtime API, which provides a C-like set of routines and extensions, and (2), via the driver API, which provides lower level control over the hardware but requires more code and programming effort. Both OpenCL and CUDA call a piece of code that runs on the GPU a kernel. OpenCL promises a portable language for GPU programming, capable of targeting very dissimilar parallel processing devices. Unlike a CUDA kernel, an OpenCL kernel can be compiled at runtime, which would add to an OpenCL’s running time. On the other hand, this just-in-time compile may allow the compiler to generate code that makes better use of the target GPU.

To compete with CUDA, AMD has shifted from OpenCL to its ROCm platform. AMD is also developing a thin "HIP" compatibility layer that compiles to either CUDA or ROCm. AMD's hipBLAS, hipSPARSE, and hipDNN all translate to the cu- or roc- equivalents, depending on hardware target. So, for example, hipBLAS would link to either cuBLAS or rocBLAS. On the hardware side, AMD's Radeon VII now looks competitive with, e.g. Nvidia's 2080 Ti. AMD now offers HIP, which converts  CUDA, such that it works on both AMD and NVIDIA hardware. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

Radeon Open Compute Platform (ROCm) :
ROCm is a universal platform for GPU-accelerated computing. A modular design lets any hardware vendor build drivers that support the ROCm stack. ROCm also integrates multiple programming languages and makes it easy to add support for other languages. 

The Department of Energy announced that Frontier, their forthcoming supercomputer in 2021, will have AMD Radeon Instinct GPUs. This is a $600M contract. It seems there will soon be growing pressure for cross-platform (Nvidia/AMD) programming models in the HPC space.The $600 million award marks the first system announcement to come out of the second CORAL (Collaboration of Oak Ridge, Argonne and Livermore) procurement process (CORAL-2). Poised to deliver “greater than 1.5 exaflops of HPC and AI processing performance,” Frontier (ORNL-5) will be based on Cray’s new Shasta architecture and Slingshot interconnect and will feature future-generation AMD Epyc CPUs and Radeon Instinct GPUs. This will start with Cray working with AMD to enhance these tools for optimized GPU scaling with extensions for Radeon Open Compute Platform (ROCm). These software enhancements will leverage low-level integrations of AMD ROCmRDMA technology with Cray Slingshot to enable direct communication between the Slingshot NIC to read and write data directly to GPU memory for higher application performance.

Exploring AMD’s Ambitious ROCm Initiative :
AMD released the innovative ROCm hardware-accelerated, parallel computing environment, and since then, the company has continued to refine its bold vision for an open-source, multi-platform, high-performance computing environment. Over the past two years, ROCm developers have contributed many new features and components to the ROCm open software platform. Now, the much-anticipated release of the Vega 7nm technology based GPU environment adds another important ingredient to the mix, empowering a second generation of high-performance applications that will benefit from ROCm’s acceleration features and “write it once” programming paradigm.

ROCm is a universal platform for GPU-accelerated computing. A modular design lets any hardware vendor build drivers that support the ROCm stack. ROCm also integrates multiple programming languages and makes it easy to add support for other languages. ROCm even provides tools for porting vendor-specific CUDA code into a vendor-neutral ROCm format, which makes the massive body of source code written for CUDA available to AMD hardware and other hardware environments.
ROCm is designed as a universal platform, supporting multiple languages and GPU technologies. 

Lower in the stack, ROCm provides the Heterogeneous Computing Platform, a Linux driver, and a runtime stack optimized for “HPC and ultra-scale class computing.” ROCm’s modular design means the programming stack is easily ported to other environments.

At the heart of the ROCm platform is the Heterogeneous Compute Compiler (HCC). The open source HCC is based on the LLVM compiler with the Clang C++ preprocessor. HCC supports several versions of standard C++, including C++11, C++14, and some C++17 features. HCC also supports GPU-based acceleration and other parallel programming features, providing a path for programmers to access the advanced capabilities of AMD GPUs in the same way that the proprietary NVCC CUDA compiler provides access to NVIDIA hardware. 

    Important features include the following:
  • Multi-GPU coarse-grain shared virtual memory
  • Process concurrency and preemption
  • Large memory allocations
  • HSA signals and atomics
  • User-mode queues and DMA
  • Standardized loader and code-object format
  • Dynamics and offline-compilation support
  • Peer-to-peer multi-GPU operation with RDMA support
  • Profiler trace and event-collection API
  • Systems-management API and tools

    Solid Compilation Foundation and Language Support

  • LLVM compiler foundation
  • HCC C++ and HIP for application portability
  • GCN assembler and disassembler

How does HIP work?

The below image explains it: CUDA gets converted to HIP and HIP gets compiled for the NVIDIA GPU with NVCC, and for the AMD GPU with their new C++ compiler HCC.

 AMD announced its next-gen Navi-based Radeon RX 5700 and 5700 XT graphics cards recently.If you’re an AMD fan hoping that this will be the moment in history when the company finally pulls ahead of Nvidia with a high-end video card — like it may be doing against Intel with desktop CPUs — this isn’t that moment.  Despite its new Navi architecture, which offers 1.25x the performance per clock and 1.5x performance per watt, these aren’t even as high-end as AMD’s existing (and complicated) 13.8 TFLOP Radeon VII GPU. At up to 9.75 TFLOPs and 7.95 TFLOPs of raw computing power respectively, and with 8GB of GDDR6 memory instead of 16GB of HBM2, the 5700-series isn’t a world-beater.


Intel announced that its first "discrete" graphics chip (GPU) is coming in 2020. By "discrete GPU", the company means a graphics chip on its own, an entirely separate component that isn't integrated into a processor chip(CPU). Typically , Intel GPUs are integrated with its CPUs. Intel's GPU will be released in 2020 will be designed for enterprise applications like machine learning , as well as consumer level applications that benefit from the dedicated power of  discrete GPUs.

We eagerly await doing the price/performance comparisons across these enterprise GPU compute engines.

Reference :