Sunday, September 14, 2014

IBM - Blue Gene Q Supercomputer

Blue Gene is an IBM project aimed at designing supercomputers that can reach operating speeds in the PFLOPS (petaFLOPS) range, with low power consumption.The project created three generations of supercomputers, Blue Gene/L, Blue Gene/P, and Blue Gene/Q. 

Blue Gene systems have often led the TOP500 and Green500  rankings of the most powerful and most power efficient supercomputers, respectively. Blue Gene systems have also consistently scored top positions in the Graph500 list.

The third supercomputer design in the Blue Gene series, Blue Gene/Q has a peak performance 20 Petaflops, reaching LINPACK benchmarks performance of 17 Petaflops. United States supercomputer sat atop the TOP500 list in June 2012. Named Sequoia, the IBM BlueGene/Q system installed at the Department of Energy’s Lawrence Livermore National Laboratory achieved 16.32 petaflop/s performance running the Linpack benchmark using 1,572,864 cores. Sequoia was the first system to be built using more than one million cores.

Sequoia is primarily water cooled and consists of 96 racks; 98,304 compute nodes; 1.6 million cores; and 1.6 petabytes of memory. Sequoia is roughly 90 times more power efficient than Purple and about eight times more than BG/L relative to the peak speeds of these systems.Sequoia is dedicated to NNSA's Advanced Simulation and Computing (ASC) program.

 Blue Gene/Q  hardware Overview :
The Blue Gene/Q Compute chip is an 18 core chip. The 64-bit PowerPC A2 processor cores are 4-way simultaneously multithreaded, and run at 1.6 GHz. Each processor core has a SIMD Quad-vector double precision floating point unit (IBM QPX). 16 Processor cores are used for computing, and a 17th core for operating system assist functions such as interrupts, asynchronous I/O, MPI pacing and RAS. The 18th core is used as a redundant spare, used to increase manufacturing yield. The spared-out core is shut down in functional operation. 

The processor cores are linked by a crossbar switch to a 32 MB eDRAM L2 cache, operating at half core speed. The L2 cache is multi-versioned, supporting transactional memory and speculative execution, and has hardware support for atomic operations. L2 cache misses are handled by two built-in DDR3 memory controllers running at 1.33 GHz. The chip also integrates logic for chip-to-chip communications in a 5D torus configuration, with 2GB/s chip-to-chip links. The Blue Gene/Q chip is manufactured on IBM's copper SOI process at 45 nm. It delivers a peak performance of 204.8 GFLOPS at 1.6 GHz, drawing about 55 watts. The chip measures 19×19 mm (359.5 mm²) and comprises 1.47 billion transistors. The chip is mounted on a compute card along with 16 GB DDR3 DRAM (i.e., 1 GB for each user processor core).A Q32 compute drawer will have 32 compute cards, each water cooled.

A "midplane" (crate) of 16 compute drawers will have a total of 512 compute nodes, electrically interconnected in a 5D torus configuration (4x4x4x4x2). Beyond the midplane level, all connections are optical. Racks have two midplanes, thus 32 compute drawers, for a total of 1024 compute nodes, 16,384 user cores and 16 TB RAM.Separate I/O drawers, placed at the top of a rack or in a separate rack, are air cooled and contain 8 compute cards and 8 PCIe expansion slots for Infiniband or 10 Gigabit Ethernet networking.

Blue Gene Q Architecture

Compiler Invocation on Blue Gene/Q
To run a code on the compute nodes you must compile and link it using a "cross-compiler" on the front-end (login node). A cross-compiler produces an executable that will run on the compute nodes.

If you instead use a native compiler, the resulting executable will run on the front-end but not on the remote nodes. Also, you can use the native compilers to compile and link only serial code because there are no mpich libraries on the front-end.

Currently you should run your applications on the compute nodes only. The native compilers can be used for compiling and linking the serial portion of your parallel application.

The locations and names of the "unwrapped" cross compilers appear below. For each such there is a description of the corresponding mpich-wrapper cross compiler, which is just a wrapper script that makes the cross compiler a bit easier for the user to invoke.

If there is a Thread-safe version of a compiler, its invocation name will have a _r suffix, for example bgxlc_r is the thread-safe version of the unwrapped IBM C cross compiler bgxlc. 

Simple Example: Compile, Link, Run

Step 1 : Compile and link file hello.c which contains a C language MPI program:
                 mpixlc_r -o helloc hello.c
Step 2 : Submit the job interactively:
                 runjob --block BLOCKID --exe /home/spb/helloc -p 16 --np 2048 --env-all --cwd /home/spb> job.output 2>&1

Record-breaking science applications have been run on the BG/Q, the first to cross 10 petaflops of sustained performance. The cosmology simulation framework HACC achieved almost 14 petaflops with a 3.6 trillion particle benchmark run, while the Cardioid code, which models the electrophysiology of the human heart, achieved nearly 12 petaflops with a near real-time simulation, both on Sequoia. 

Examples of applications running on Blue Gene

A typical supercomputer consumes large amounts of electrical power, almost all of which is converted into heat due to thermal design power and CPU power dissipation issues. The packing of thousands of processors together inevitably generates significant amounts of heat density that need to be dealt with. In the Blue Gene system, IBM deliberately used low power processors to deal with heat density and  hot water cooling to achieve energy efficiency. The energy efficiency of computer systems is generally measured in terms of "FLOPS per Watt". In 2008 IBM's Blue Gene/Q reached 1684 MFLOPS/Watt. In June 2011 the top 2 spots on the Green 500 list were occupied by Blue Gene machines achieving 2097 MFLOPS/W.