Friday, June 10, 2022

Frontier supercomputer powered by AMD is the fastest and first exascale machine

Exascale computing is the next milestone in the development of supercomputers. Able to process information much faster than today’s most powerful supercomputers, exascale computers will give scientists a new tool for addressing some of the biggest challenges facing our world, from climate change to understanding cancer to designing new kinds of materials. 

One way scientists measure computer performance is in floating point operations per second (FLOPS). These involve simple arithmetic like addition and multiplication problems. Their performance in FLOPS has so many zeros - researchers instead use prefixes like Giga, Tera, Exa. where “Exa” means 18 zeros. That means an exascale computer can perform more than 1,000,000,000,000,000,000 FLOPS, or 1 exaFLOP. DOE is deploying the United States’ first exascale computers: Frontier at ORNL and Aurora at Argonne National Laboratory and El Capitan at Lawrence Livermore National Laboratory.

Exascale supercomputers will allow scientists to create more realistic Earth system and climate models. They will help researchers understand the nanoscience behind new materials. Exascale computers will help us build future fusion power plants. They will power new studies of the universe, from particle physics to the formation of stars. And these computers will help ensure the safety and security of the United States by supporting tasks such as the maintenance of  US nuclear deterrent.

For decades, the performance maximization has been the chief concern of both the hardware architects and the software developers.  Due to end of performance scaling by increasing CPUs clock frequencies (i.e Moore's Law),  Industries transit from single-core to multi-core and many-core architectures.  As a result, the hardware acceleration and the use of co-processors together with CPU are becoming a popular choice to gain the performance boost while keeping the power budget low. This includes both the new customized hardware for particular application domain such as Tensor Processing Unit (TPU), Vision Processing Unit (VPU) and Neural Processing Unit (NPU); and the modifications in existing platforms such as Intel Xeon Phi co-processors, general purpose GPUs and Field Programmable Gate Array (FPGA)s. Such accelerators together with main processors and memory, constitute a heterogeneous system. However, this heterogeneity has raised unprecedented difficulties posed to performance and energy optimization of modern heterogeneous HPC platforms. 

The focus of maximizing the performance of HPC in terms of completing the hundreds of trillion Floating Point Operations Per Second (FLOPS) has led the supercomputers to consume an enormously high amount of energy in terms of electricity and for cooling down purposes. As a consequence, current HPC systems are already consuming Megawatts of energy. Energy efficiency is becoming an equally important design concern with performance in ICT.  Current HPC systems are already consuming Megawatts of energy. For example, the world’s powerful supercomputer like Summit consumes around 13 Megawatts of power which is roughly equivalent to the power draw of roughly over 10000 households. Because of such high power consumption, future HPC systems are highly likely to be power constrained. For example, DOE aims to deploy this exascale supercomputer capable of performing 1 million trillion ( or 1018) floating-point operations per second in a power envelope of 20-30 megawatts. Initial target was to deliver a double precision exaflops of compute capability for 20 megawatts of power and other target was  2 exaflops for 29 megawatts of power when it’s running at full power. Taking into consideration the above-mentioned factors, HPE Cray designed Frontier supercomputer powered by AMD for growing accelerated computational needs and power constraints.

The Frontier supercomputer, built at the Department of Energy's Oak Ridge National Laboratory in Tennessee, has now become the world's first known supercomputer to demonstrate a processor speed of 1.1 exaFLOPS (1.1 quintillion floating point operations per second, or FLOPS).  The Frontier supercomputer's exascale performance is enabled by  world's most advanced pieces of technology from HPE and AMD.

Frontier supercomputer powered by AMD is  the first exascale machine meaning it can process more than a quintillion calculations per second with an HPL score of 1.102 Exaflop/s. Based on the latest HPE Cray EX235a architecture and equipped with AMD EPYC 64C 2GHz processors, the system has 8,730,112 total cores and a power efficiency rating of 52.23 gigaflops/watt. It relies on gigabit ethernet for data transfer. 

Exascale is the next level of computing performance. By solving calculations five times faster than today’s top supercomputers—exceeding a quintillion [ or 1018  ] calculations per second—exascale systems will enable scientists to develop new technologies for energy, medicine, and materials. The Oak Ridge Leadership Computing Facility will be home to one of America’s first exascale systems, Frontier, which will help guide researchers to new discoveries at exascale.

It's based on HPE Cray’s new EX architecture and Slingshot interconnect with optimized 3rd Gen AMD EPYC™ CPUs for HPC and AI, and AMD Instinct™ 250X accelerators. It delivers linepack (double precision floating point – FP64) compute performance of 1.1 EFLOPS (ExaFLOPS). 

Source

The Frontier test and development system (TDS) secured the first place in the Green500 list, delivering 62.68 gigaflops/watt power-efficiency from a single cabinet of optimised 3rd Gen AMD EPYC processors and AMD Instinct MI250x accelerators. It could lead to breakthroughs in medicine, astronomy, and more.  


The HPE/AMD system delivers 1.102 Linpack exaflops of computing power in a 21.1-megawatt power envelope, an efficiency of 52.23 gigaflops per watt. Frontier only uses about 29 megawatts at its very peak. During a test, Frontier ran at 1.1 exaflops and could go as high as 2 exaflops.

Source


Node diagram:

Source

These are HPE Cray EX systems has  74 cabinets of this — 9,408 nodes. Each node has one CPU and four GPUs. The GPUs are the [AMD] MI250Xs. The CPUs are an AMD Epyc CPU. It’s all wired together with the high-speed Cray interconnect, called Slingshot. And it’s a water-cooled system.  Recently  good efforts towards using computational fluid dynamics to model the water flow in the cooling system. These are incredibly instrumented machines with liquid cooling dynamically adjust to the workloads.  There’s sensors that are monitoring temperatures where even down to the individual components on the individual node-boards, so they can adjust the cooling levels up and down to make sure that the system stays at a safe temperature. It was estimated to provide over 60 gigaflops-per-watt for the single cabinet run.

This is the datacenter where they formerly had the Titan supercomputer. So they removed that supercomputer and refurbished this datacenter. That needed more power and  more cooling. So they brought in 40 megawatts of power to the datacenter and  have 40 megawatts of cooling available. Frontier really only uses about 29 megawatts of that at its very peak. And this Supercomputer is even a little bit quieter than Summit  because they’re going to liquid-cooled with no fans and no rear doors where  exchanging heat with the room. It’s 100 percent liquid cooled, and the [fan] noise generated from storage systems that are also HPE and are air-cooled.

At OLCF [Oak Ridge Leadership Computing Facility], they  have the Center for Accelerated Application Readiness, we call it CAAR. Its  vehicle for application readiness. That group supports eight apps for the OLCF and 12 apps for the Exascale Computing Project. Frontier was OLCF-5, the next system will be OLCF-6.

The result was confirmed in a benchmarking test called High-Performance Linpack (HPL). As impressive as that sounds, the ultimate limits of Frontier are even more staggering, with the supercomputer theoretically capable of a peak performance of 2 quintillion calculations per second. Among all these massively powerful supercomputers, only Frontier has achieved true exascale performance, at least where it counts, according to TOP500. Some of the most exciting things are the work in artificial intelligence and those workloads. Plan for research teams  to develop better treatments for different diseases, how to improve efficacies of treatments, and these systems are capable of digesting just incredible amounts of data. Thousands of  laboratory reports or pathology reports,  can draw inferences across these reports that no human being could ever do but that a supercomputer can do. They still have Summit here, a previous Top500 number-one system, an IBM/Nvidia machine. It’s highly utilized at this point. They will at least run it for a year and overlap with Frontier  so that we can make sure that Frontier is up and stable and give people time to transition their data and their applications over to the new system.

With the Linpack exaflops milestone achieved by the Frontier supercomputer at Oak Ridge National Laboratory, the United States is turning its attention to the next crop of exascale machines, some 5-10x more performant than Frontier. At least one such system is being planned for the 2025-2030 timeline, and the DOE is soliciting input from the vendor community to inform the design and procurement process. That can solve scientific problems 5 to 10 times faster – or solve more complex problems, such as those with more physics or requirements for higher fidelity – than the current state-of-the-art systems. These future systems will include associated networks and data hierarchies. A capable software stack will meet the requirements of a broad spectrum of applications and workloads, including large-scale computational science campaigns in modeling and simulation, machine intelligence, and integrated data analysis. They expect these systems to operate within a power envelope of 20–60 MW. These systems must be sufficiently resilient to hardware and software failures, in order to minimize requirements for user intervention. This could include the successor to Frontier (aka OLCF-6), the successor to Aurora (aka ALCF-5), the successor to Crossroads (aka ATS-5), the successor to El Capitan (aka ATS-6) as well as a future NERSC system (possibly NERSC-11). Note that of the “predecessor systems,” only Frontier has been installed so far. A key thrust of the DOE supercomputing strategy is the creation of an Advanced Computing Ecosystem (ACE) that enables “integration with other DOE facilities, including light source, data, materials science, and advanced manufacturing. The next generation of supercomputers will need to be capable of being integrated into an ACE environment that supports automated workflows, combining one or more of these facilities to reduce the time from experiment and observation to scientific insight.

The original CORAL contract called for three pre-exascale systems (~100-200 petaflops each) with at least two different architectures to manage risk. Only two systems – Summit at Oak Ridge and Sierra at Livermore – were completed in the intended timeframe, using nearly the same heterogeneous IBM-Nvidia architecture. CORAL-2 took a similar tack, calling for two or three exascale-class systems with at least two distinct architectures. The program is procuring two systems – Frontier and El Capitan – both based on a similar heterogenous HPE AMD+AMD architecture. The redefined Aurora – which is based on the heterogenous HPE Intel+Intel architecture – becomes the “architecturally-diverse” third system (although it technically still belongs to the first CORAL contract).

-----

Reference:

https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf

https://www.researchgate.net/figure/Power-consumption-by-top-10-supercomputers-over-time-based-on-the-results-published_fig2_350176475

https://www.youtube.com/watch?v=HvJGsF4t2Tc

Monday, February 14, 2022

Open MPI with hierarchical collectives (HCOLL) Algorithms

MPI, an acronym for Message Passing Interface, is a library specification for parallel computing architectures, which allows for communication of information between various nodes and clusters. Today, MPI is the most common protocol used in high performance computing (HPC).

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/
source

Open MPI is developed in a true open source fashion by a consortium of research, academic, and industry partners.  Latest version of Open MPI: Version 4.1.

Download OpenMPI from link  https://www.open-mpi.org/software/ompi/v4.1/

Example https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz 

source
source

NOTE: NVIDIA Mellanox HPC-X is a comprehensive software package that includes MPI and SHMEM communications libraries. HPC-X uses 'hcoll' library for collective communication and 'hcoll' is enabled by default in HPC-X on Azure HPC VMs and can be controlled at runtime by using the parameter[-mca coll_hcoll_enable 1]

How to install UCX :

Unified Communication X (UCX) is a framework of communication APIs for HPC. It is optimized for MPI communication over InfiniBand and works with many MPI implementations such as OpenMPI and MPICH.

  • wget https://github.com/openucx/ucx/releases/download/v1.4.0/ucx-1.4.0.tar.gz
  • tar -xvf ucx-1.4.0.tar.gz
  • cd ucx-1.4.0
  • ./configure --prefix=<ucx-install-path> 
  • make -j 8 && make install

Optimizing MPI collectives and hierarchical communication algorithms (HCOLL):

MPI Collective communication primitives offer a flexible, portable way to implement group communication operations. They are widely used across various scientific parallel applications and have a significant impact on the overall application performance. Refer configuration parameters to optimize collective communication performance using HPC-X and HCOLL library for collective communication.

As an example, if you suspect your tightly coupled MPI application is doing an excessive amount of collective communication, you can try enabling hierarchical collectives (HCOLL). To enable those features, use the following parameters.


-mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=<MLX device>:<Port>

HCOLL :

Scalable infrastructure: Designed and implemented with current and emerging “extreme-scale” systems in mind

  • Scalable communicator creation, memory consumption, runtime interface
  • Asynchronous execution
  • Blocking and non-blocking collective routines
  • Easily integrated into other packages
  • Successfully integrated into OMPI – “hcoll” component in “coll” framework
  • Successfully integrated in Mellanox OSHMEM
  • Experimental integration in MPICH
  • Host level hierarchy awareness
  • Socket groups, UMA groups
  • Exposes Mellanox and InfiniBand specific capabilities
source

How to build OpenMPI with HCOLL

Install UCX as described above and build with HCOLL  as shown below 

Steps:

  1. ./configure --with-lsf=/LSF_HOME/10.1/ --with-lsf-libdir=/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/ --disable-man-pages --enable-mca-no-build=btl-uct --enable-mpi1-compatibility  --prefix $MY_HOME/openmpi-4.1.1/install --with-ucx=/ucx-install_dir CPPFLAGS=-I/ompi/opal/mca/hwloc/hwloc201/hwloc/include --cache-file=/dev/null --srcdir=. --disable-option-checking
  2. make 
  3. make install

---------------------------Set Test Environment------------------------------------------------

  1.  export PATH=$MY_HOME/openmpi-4.1.1/install/bin:$PATH
  2.  export LD_LIBRARY_PATH=$MY_HOME/openmpi- 4.1.1/install/lib:/opt/mellanox/hcoll/lib:/opt/mellanox/sharp/lib:$LD_LIBRARY_PATH
  3.  export OPAL_PREFIX=$MY_HOME/openmpi-4.1.1/install
NOTE: It may be necessary to explicitly pass LD_LIBRARY_PATH  as mentioned in (3)

--------------  How to run mpi testcase without HCOLL--------------------------------------

1) Use these --mca option to disable HCOLL

--mca coll_hcoll_enable 0 

--mca coll_hcoll_priority 0 

2) Add --mca coll_base_verbose 10  to get more details 

3) Add -x LD_LIBRARY_PATH to get the proper path as shown below


-----------------------------Execute Testcase ----------------------------------

Testcase source:  https://github.com/jeffhammond/BigMPI/tree/master/test

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs -mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allreduce_uniform_count

--------------------------------------------------------------------------

INT_MAX               :           2147483647
UINT_MAX              :           4294967295
SIZE_MAX              : 18446744073709551615
----------------------:-----------------------------------------
                      : Count x Datatype size      = Total Bytes
TEST_UNIFORM_COUNT    :           2147483647
V_SIZE_DOUBLE_COMPLEX :           2147483647 x  16 =    32.0 GB
V_SIZE_DOUBLE         :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT_COMPLEX  :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT          :           2147483647 x   4 =     8.0 GB
V_SIZE_INT            :           2147483647 x   4 =     8.0 GB
----------------------:-----------------------------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  3: PASSED
Rank  0: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  0: PASSED
Rank  2: PASSED
Rank  3: PASSED
Rank  1: PASSED
---------------------
Results from MPI_Iallreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Iallreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
[smpici@host01 BigCount]$

=====================Example for A data integrity issue (DI issue)====

There is end-to-end data integrity checks to detect data corruption. If any DI issue observed , it will be critical (high priority/ high severity defect)

DI issue with HCOLL  ---let's see an example for DI issue.

$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs  --mca coll_hcoll_enable 1 --mca coll_hcoll_priority 98 test_allgatherv_uniform_count 


Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  2: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  0: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  3: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)
Rank  1: ERROR: DI in      805306368 of     2147483644 slots (  37.5 % wrong)


---------------Lets run the same testcase without  HCOLL-------------------------------------------


$MY_HOME/openmpi-4.1.1/install/bin/mpirun --np 4 --npernode 1 --host host01,host02,host03,host04 -x LD_LIBRARY_PATH -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x HCOLL_RCACHE=^ucs  --mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 test_allgatherv_uniform_count   

Results from MPI_Allgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  0: PASSED
Rank  2: PASSED
Rank  3: PASSED
Rank  1: PASSED

Results from MPI_Iallgatherv(double _Complex x 2147483644 = 34359738304 or 32.0 GB): Mode: PACKED MPI_IN_PLACE
Rank  3: PASSED
Rank  2: PASSED
Rank  0: PASSED
Rank  1: PASSED

=========================

How to enable and disable HCOLL to find Mellanox defect and use mca coll_base_verbose 10 to get more debug information 

CASE 1: Enable HCOLL  and run bigcount with allreduce

[smpici@myhostn01 collective-big-count]$  /nfs_smpi_ci/sachin/ompi_4-0.x/openmpi-4.0.7/install/bin/mpirun --timeout 500 --np 4 --npernode 1 -host myhost01:1,myhost02:1,myhost03:1,myhost04:1 -mca coll_hcoll_np 0 -mca coll_hcoll_enable 1 --mca coll_hcoll_priority 98 -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x BIGCOUNT_ENABLE_NONBLOCKING="0" -x HCOLL_RCACHE=^ucs /nfs_smpi_ci/sachin/bigmpi_ompi/ibm-tests/tests/big-mpi/BigCountUpstream/ompi-tests-public/collective-big-count/test_allreduce_uniform_count

--------------------------------------------------------------------------

Total Memory Avail.   :  567 GB
Percent memory to use :    6 %
Tolerate diff.        :   10 GB
Max memory to use     :   34 GB
----------------------:-----------------------------------------
INT_MAX               :           2147483647
UINT_MAX              :           4294967295
SIZE_MAX              : 18446744073709551615
----------------------:-----------------------------------------
                      : Count x Datatype size      = Total Bytes
TEST_UNIFORM_COUNT    :           2147483647
V_SIZE_DOUBLE_COMPLEX :           2147483647 x  16 =    32.0 GB
V_SIZE_DOUBLE         :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT_COMPLEX  :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT          :           2147483647 x   4 =     8.0 GB
V_SIZE_INT            :           2147483647 x   4 =     8.0 GB
----------------------:-----------------------------------------
---------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  3: PASSED
Rank  0: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 500 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
[smpici@myhostn01 collective-big-count]$

CASE 2: Disable HCOLL  and run bigcount with allreduce

[user1@myhostn01 collective-big-count]$ /nfs_smpi_ci/sachin/ompi_4-0.x/openmpi-4.0.7/install/bin/mpirun --np 4 --npernode 1 -host myhost01:1,myhost02:1,myhost03:1,myhost04:1 --mca coll_base_verbose 10 -mca coll_hcoll_np 0 -mca coll_hcoll_enable 0 --mca coll_hcoll_priority 0 -x BIGCOUNT_MEMORY_PERCENT=6 -x BIGCOUNT_MEMORY_DIFF=10 -x BIGCOUNT_ENABLE_NONBLOCKING="0" -x HCOLL_RCACHE=^ucs /nfs_smpi_ci/sachin/bigmpi_ompi/ibm-tests/tests/big-mpi/BigCountUpstream/ompi-tests-public/collective-big-count/test_allreduce_uniform_count

[myhostn01:1984671] coll:base:comm_select: component disqualified: self (priority -1 < 0)
[myhostn01:1984671] coll:sm:comm_query (0/MPI_COMM_WORLD): intercomm, comm is too small, or not all peers local; disqualifying myself
[myhostn01:1984671] coll:base:comm_select: component not available: sm
[myhostn01:1984671] coll:base:comm_select: component disqualified: sm (priority -1 < 0)
[myhostn01:1984671] coll:base:comm_select: component not available: sync
[myhostn01:1984671] coll:base:comm_select: component disqualified: sync (priority -1 < 0)
[myhostn01:1984671] coll:base:comm_select: component available: tuned, priority: 30
[myhostn01:1984671] coll:base:comm_select: component not available: hcoll

----------------------:-----------------------------------------
Total Memory Avail.   :  567 GB
Percent memory to use :    6 %
Tolerate diff.        :   10 GB
Max memory to use     :   34 GB
----------------------:-----------------------------------------
INT_MAX               :           2147483647
UINT_MAX              :           4294967295
SIZE_MAX              : 18446744073709551615
----------------------:-----------------------------------------
                      : Count x Datatype size      = Total Bytes
TEST_UNIFORM_COUNT    :           2147483647
V_SIZE_DOUBLE_COMPLEX :           2147483647 x  16 =    32.0 GB
V_SIZE_DOUBLE         :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT_COMPLEX  :           2147483647 x   8 =    16.0 GB
V_SIZE_FLOAT          :           2147483647 x   4 =     8.0 GB
V_SIZE_INT            :           2147483647 x   4 =     8.0 GB
----------------------:-----------------------------------------
---------------------
Results from MPI_Allreduce(int x 2147483647 = 8589934588 or 8.0 GB):
Rank  2: PASSED
Rank  0: PASSED
Rank  3: PASSED
Rank  1: PASSED
--------------------- Adjust count to fit in memory: 2147483647 x  50.0% = 1073741823
Root  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Peer  : payload    34359738336  32.0 GB =  16 dt x 1073741823 count x   2 peers x   1.0 inflation
Total : payload    34359738336  32.0 GB =  32.0 GB root +  32.0 GB x   0 local peers
---------------------
Results from MPI_Allreduce(double _Complex x 1073741823 = 17179869168 or 16.0 GB):
Rank  0: PASSED
Rank  3: PASSED
Rank  2: PASSED
Rank  1: PASSED
[user1@myhostn01 collective-big-count]$

=================================================================

This post briefly shows features for optimal collective communication performance  and highlights the  general recommendations. The real application performance depends on your application characteristics, runtime configuration, transport protocols, processes per node (ppn) configuration... etc.


Reference:
http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/18/bureddy-mug-18.pdf
https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/setup-mpi



Friday, January 28, 2022

HPC Clusters in a Multi-Cloud Environment

High performance computing (HPC) is the ability to process data and perform complex calculations at high speeds. One of the best-known types of HPC solutions is the supercomputer. A supercomputer contains thousands of compute nodes that work together to complete one or more tasks. This is called parallel processing. HPC solutions have three main components: Compute , Network and Storage. To build a high performance computing architecture, compute servers are networked together into a cluster. Software programs and algorithms are run simultaneously on the servers in the cluster. The cluster is networked to the data storage to capture the output. Together, these components operate seamlessly to complete a diverse set of tasks.

To operate at maximum performance, each component must keep pace with the others. For example, the storage component must be able to feed and ingest data to and from the compute servers as quickly as it is processed. Likewise, the networking components must be able to support the high-speed transportation of data between compute servers and the data storage. If one component cannot keep up with the rest, the performance of the entire HPC infrastructure suffers.

Containers give HPC the portability that Hybrid Cloud demands .Containers are ready-to-execute packages of software. Container technology provides hardware abstraction, wherein the container is not tightly coupled with the server. Abstraction between the hardware and software stacks provides ease of access, ease of use, and the agility that bare metal environments lack.

Source
Source

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user while using resources more efficiently and reducing costs. Recently, high performance computing (HPC) is moving closer to the enterprise and can therefore benefit from an HPC container and Kubernetes ecosystem, with new requirements to quickly allocate and deallocate computational resources to HPC workloads so that planning of compute capacity no longer required in advance. The HPC community is picking up the concept and applying it to batch jobs and interactive applications.

In a multi-cloud environment, an enterprise utilizes multiple public cloud services, most often from different cloud providers. For example, an organization might host its web front-end application on AWS and host its Exchange servers on Microsoft Azure. Since all cloud providers are not created equal, organizations adopt a multi-cloud strategy to deliver best of breed IT services, to prevent lock-in to a single cloud provider, or to take advantages of cloud arbitrage and choose providers for specific services based on which provider is offering the lowest price at that time. Although it is similar to a hybrid cloud, multi-cloud specifically indicates more than one public cloud provider service and need not include a private cloud component at all. Enterprises adopt a multi-cloud strategy so as not to ‘keep all their eggs in a single basket’, for geographic or regulatory governance demands, for business continuity, or to take advantage of features specific to a particular provider.

source
source

Multi-cloud is the use of multiple cloud computing and storage services in a single network architecture. This refers to the distribution of cloud assets, software, applications, and more across several cloud environments. With a typical multi-cloud architecture utilizing two or more public clouds as well as private clouds, a multi-cloud environment aims to eliminate the reliance on any single cloud provider or instance.

Multi-cloud is the use of two or more cloud computing services from any number of different cloud vendors. A multi-cloud environment could be all-private, all-public or a combination of both. Companies use multi-cloud environments to distribute computing resources and minimize the risk of downtime and data loss. They can also increase the computing power and storage available to a business. Innovations in the cloud in recent years have resulted in a move from single-user private clouds to multi-tenant public clouds and hybrid clouds — a heterogeneous environment that leverages different infrastructure environments like the private and public cloud.

A multi-cloud platform combines the best services that each platform offers. This allows companies to customize an infrastructure that is specific to their business goals. A multi-cloud architecture also provides lower risk. If one web service host fails, a business can continue to operate with other platforms in a multi-cloud environment versus storing all data in one place. Examples of public Cloud Providers: 

Hybrid-cloud A hybrid cloud architecture is mix of on-premises, private, and public cloud services with orchestration between the cloud platforms. Hybrid cloud management involves unique entities that are managed as one across all environments. Hybrid cloud architecture allows an enterprise to move data and applications between private and public environments based on business and compliance requirements. For example, customer data can live in a private environment. But heavy processing can be sent to the public cloud without ever having customer data leave the private environment. Hybrid cloud computing allows instant transfer of information between environments, allowing enterprises to experience the benefits of both environments.


Hybrid cloud architecture works well for the following industries:

• Finance: Financial firms are able to significantly reduce their space requirements in a hybrid cloud architecture when trade orders are placed on a private cloud and trade analytics live on a public cloud.

• Healthcare: When hospitals send patient data to insurance providers, hybrid cloud computing ensures HIPAA compliance.

• Legal: Hybrid cloud security allows encrypted data to live off-site in a public cloud while connected o a law firm’s private cloud. This protects original documents from threat of theft or loss by natural disaster.

• Retail: Hybrid cloud computing helps companies process resource-intensive sales data and analytics.

The hybrid cloud strategy could be applied  to move workloads dynamically to the most appropriate IT environment based on cost, performance and security. Utilize on-premises resources for existing workloads, and use public or hosted clouds for new workloads. Run internal business systems and data on premises while customer-facing systems run on infrastructure as a service (iaaS), public or hosted clouds.

Reference:

https://www.hpcwire.com/2019/09/19/kubernetes-containers-and-hpc
https://www.hpcwire.com/2020/03/19/kubernetes-and-hpc-applications-in-hybrid-cloud-environments-part-ii
https://www.hpcwire.com/2021/09/02/kubernetes-based-hpc-clusters-in-azure-and-google-cloud-multi-cloud-environment
https://www-stage.avinetworks.com/
https://www.vmware.com/topics/glossary/content/hybrid-cloud-vs-multi-cloud

Sunday, August 22, 2021

Spectrum LSF 10.1 Installation and Applying Patch | FP | interim FIX on Linux Platform

IBM Spectrum LSF (LSF, originally Platform Load Sharing Facility) is a workload management platform, job scheduler, for distributed high performance computing (HPC) by IBM. In January, 2012, Platform Computing was acquired by IBM. The product is now called IBM® Spectrum LSF.

IBM® Spectrum LSF is a complete workload management solution for demanding HPC environments that takes your job requirements, finds the best resources to run the job, and monitors its progress. Jobs always run according to host load and site policies.

LSF cluster (source)

  • Cluster is a  group of computers (hosts) running LSF that work together as a single unit, combining computing power, workload, and resources. A cluster provides a single-system image for a network of computing resources. Hosts can be grouped into a cluster in a number of ways. A cluster can contain 1) All the hosts in a single administrative group  2) All the hosts on a subnetwork.
  • Job is a unit of work that is running in the LSF system or  job is a command or set of commands  submitted to LSF for execution. LSF schedules, controls, and tracks the job according to configured policies.
  • Queue is a cluster-wide container for jobs. All jobs wait in queues until they are scheduled and dispatched to hosts.
  • Resources are the objects in your cluster that are available to run work. 

Spectrum LSF 10.1 base Installation  and applying FP /PTF/FIX

Plan your installation and install a new production IBM Spectrum LSF cluster on UNIX or Linux hosts. The following diagram illustrates an example directory structure after the LSF installation is complete.

Source

Plan your installation to determine the required parameters for the install.config file.

a )  lsf10.1_lsfinstall.tar.Z

The standard installer package. Use this package in a heterogeneous cluster with a mix of systems other than x86-64. Requires approximately 1 GB free space.

b)  lsf10.1_lsfinstall_linux_x86_64.tar.Z 

      lsf10.1_lsfinstall_linux_ppc64le.tar.Z

Use this smaller installer package in a homogeneous x86-64 or ppc cluster accordingly . 

------------------------

Get the LSF distribution packages for all host types you need and put them in the same directory as the extracted LSF installer script. Copy that package to LSF_TARDIR path mentioned in Step 3.

For example:

Linux 2.6 kernel glibc version 2.3, the distribution package is lsf10.1_linux2.6-glibc2.3-x86_64.tar.Z.

Linux  kernel glibc version 3.x, the distribution package is lsf10.1_lnx310-lib217-ppc64le.tar.Z

------------------------

LSF uses entitlement files to determine which feature set is enabled or disabled based on the edition of the product. Copy  entitlement configuration file to LSF_ENTITLEMENT_FILE  path mentioned in step 3.

The following LSF entitlement configuration files are available for each edition:

LSF Standard Edition  ===>  lsf_std_entitlement.dat

LSF Express Edition   ===>  lsf_exp_entitlement.dat

LSF Advanced Edition  ==>  lsf_adv_entitlement.dat

-------------------------

Step 1 : Get the LSF installer script package that you selected and extract it.

# zcat lsf10.1_lsfinstall_linux_x86_64.tar.Z | tar xvf -

Step 2 :  Go to extracted directory :

 cd lsf10.1_lsfinstall

Step 3 : Configure install.config as per the plan

 cat install.config
  LSF_TOP="/nfs_shared_dir/LSF_HOME"
  LSF_ADMINS="lsfadmin"
  LSF_CLUSTER_NAME="x86-64_cluster2"
  LSF_MASTER_LIST="myhost1"
  LSF_TARDIR="/nfs_shared_dir/conf_lsf/lsf_distrib/"
  LSF_ENTITLEMENT_FILE="/nfs_shared_dir/conf_lsf/lsf_std_entitlement.dat"
  LSF_ADD_SERVERS="myhost1 myhost2 myhost3 myhost4 myhost5 myhost6 myhost7 myhost8"

  ENABLE_DYNAMIC_HOSTS="Y"

Step 4:  Start LSF 10.1 base installation 

          ./lsfinstall -f install.config

Logging installation sequence in /root/LSF_new/lsf10.1_lsfinstall/Install.log
International Program License Agreement
Part 1 - General TermsBY DOWNLOADING, INSTALLING, COPYING, ACCESSING, CLICKING 
 "ACCEPT" BUTTON, OR OTHERWISE USING THE PROGRAM,
LICENSEE AGREES TO THE TERMS OF THIS AGREEMENT. IF YOU ARE
ACCEPTING THESE TERMS ON BEHALF OF LICENSEE, YOU REPRESENT
AND WARRANT THAT YOU HAVE FULL AUTHORITY TO BIND LICENSEE
TO THESE TERMS. IF YOU DO NOT AGREE TO THESE TERMS
* DO NOT DOWNLOAD, INSTALL, COPY, ACCESS, CLICK ON AN
"ACCEPT" BUTTON, OR USE THE PROGRAM; AND
* PROMPTLY RETURN THE UNUSED MEDIA, DOCUMENTATION, AND
Press Enter to continue viewing the license agreement, or
enter "1" to accept the agreement, "2" to decline it, "3"
to print it, "4" to read non-IBM terms, or "99" to go back
to the previous screen.
1
Checking the LSF TOP directory /nfs_shared_dir/LSF_HOME ...
... Done checking the LSF TOP directory /nfs_shared_diri/LSF_HOME ...
You are installing IBM Spectrum LSF - 10.1 Standard Edition
Searching LSF 10.1 distribution tar files in /nfs_shared_dir/conf_lsf/lsf_distrib Please wait ...
  1) linux3.10-glibc2.17-x86_64
Press 1 or Enter to install this host type: 1
Installing linux3.10-glibc2.17-x86_64 ...
Please wait, extracting lsf10.1_lnx310-lib217-x86_64 may take up to a few minutes ...
lsfinstall is done.
After installation, remember to bring your cluster up to date by applying the latest updates and bug fixes.

NOTE: You can do LSF installation as  non-root user. That will  be similar but with one extra prompt for multi-node cluster(yes/no)

Step 5 :  This step required only if installation was done by root .

 chown -R lsfadmin:lsfadmin $LSF_TOP

Step 6 :  check the binary files 

cd LSF_TOP/10.1/linux3.10-glibc2.17-x86_64/bin

Step 7 : By default, only root can start the LSF daemons. Any user can not submit jobs to your cluster. To make the cluster available to other users, you must manually change the ownership and setuid bit for the lsadmin and badmin binary files to root, and the file permission mode to -rwsr-xr-x (4755) so that the user ID bit for the owner is setuid.

 chown root lsadmin
 chown root badmin
 chmod 4755 lsadmin
 chmod 4755 badmin
 ls -alsrt lsadmin
 ls -alsrt badmin

chown root  $LSF_SERVERDIR/eauth  

chmod u+s $LSF_SERVERDIR/eauth 

OR 

          ./hostsetup --top="LSF_HOME" --setuid 

Step 8 : Configure  /etc/lsf.sudoers 

[root@myhost1]# cat /etc/lsf.sudoers
LSF_STARTUP_USERS="lsfadmin"
LSF_STARTUP_PATH="/nfs_shared_dir/
LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/etc"
LSF_EAUTH_KEY="testKey1"

NOTE: This lsf.sudoers file is not installed by default. This file is located in /etc. lsf.sudoers file is used to set the parameter LSF_EAUTH_KEY to configure a key for eauth to encrypt and decrypt user authentication data. All the nodes/hosts should have this file . Customers need to configure LSF_EAUTH_KEY in /etc/lsf.sudoers on each side of multi-cluster. 

Step 9 : check $LSF_SERVERDIR/eauth   and copy  lsf.sudoers to all hosts in the cluster

 ls  $LSFTOP/10.1/linux3.10-glibc2.17-x86_64/etc/


scp /etc/lsf.sudoers myhost02:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost03:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost04:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost05:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost06:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost07:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost08:/etc/lsf.sudoers

Step 10 : Start LSF  as lsfadmin and check base Installation using  lsid command.

Step 11 : Check binary type with  lsid -V

$ lsid -V
IBM Spectrum LSF 10.1.0.0 build 403338, May 27 2016
Copyright International Business Machines Corp. 1992, 2016.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

binary type: linux3.10-glibc2.17-x86_64

NOTE:  Download required FP and interim fixes from https://www.ibm.com/support/fixcentral/ 

Step 12 : Before applying PTF12 and interim patches , bring down the LSF daemons.  Use the following commands to shut down the original LSF daemons

 badmin hshutdown all
 lsadmin resshutdown all
 lsadmin limshutdown all

Deactivate all queues to make sure that no new jobs can be dispatched during the upgrade:

badmin qinact all 

Step 13: Then, become the root to apply FP12 and interim patches . 

Set LSF environment :   .   LSF_TOP/conf/profile.lsf

.   /nfs_shared_dir/LSF_HOME/conf/profile.lsf

Step 14: Apply  FP 12 on LSF BASE installation.  The patchinstall is available in $LSF_TOP//install directory

         # cd $LSF_TOP/10.1/install

Perform a check on patches running. It is recommended to check for the patch before its installation

$ patchinstall –c

 ./patchinstall /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z

[root@myhost7 install]# ./patchinstall /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z
Logging patch installation sequence in /nfs_shared_dir/LSF_HOME/10.1/install/patch.log
Checking the LSF installation directory /nfs_shared_dir/LSF_HOME ...
Done checking the LSF installation directory /nfs_shared_dir/LSF_HOME.
Checking the patch history directory ...
Done checking the patch history directory /nfs_shared_dir/LSF_HOME/patch.
Checking the backup directory ...
Done checking the backup directory /nfs_shared_dir/LSF_HOME/patch/backup.
Installing package "/root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z"...
Checking the package definition for /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z ...
Done checking the package definition for /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z.
.
.
Finished backing up files to "/nfs_shared_dir/LSF_HOME/patch/backup/LSF_linux3.10-glibc2.17-x86_64_600488".
Done installing /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z.

Step 15: Apply  interim fix1

./patchinstall /root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z

Logging patch installation sequence in /nfs_shared_dir/LSF_HOME/10.1/install/patch.log 
Installing package "/root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z"...
Checking the package definition for /root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z ...
Are you sure you want to update your cluster with this patch? (y/n) [y] y
Y
Backing up existing files ...
Finished backing up files to "/nfs_shared_dir/LSF_HOME/patch/backup/LSF_linux3.10-glibc2.17-x86_64_600505".
Done installing /root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z.
Exiting..
.
Step 16: Apply interim fix2

 ./patchinstall /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z

[root@myhost7 install]# ./patchinstall /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z
Installing package "/root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z"...
Checking the package definition for /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z ...
Backing up existing files ...
Finished backing up files to "/nfs_shared_dir/LSF_HOME/patch/backup/LSF_linux3.10-glibc2.17-x86_64_600625".
Done installing /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z.
Exiting...
 
Step 17: As a root user , Setbit for new command bctrld

  cd LSF_TOP/10.1/linux3.10-glibc2.17-x86_64/bin
  chown root bctrld
  chmod 4755 bctrld
 

Step 18 :  Check lsf.shared file for multi cluster setup.

Begin Cluster
ClusterName      Servers
CLUSTER1       (cloudhost)
CLUSTER2       (myhost1)
CLUSTER3       (remotehost2)

          End Cluster

Step 19 : Switch back to  user lsfadmin. Use the following commands to start LSF using the newer daemons.

  lsadmin limstartup all
lsadmin resstartup all
badmin hstartup all

Use the following command to reactivate all LSF queues after upgrading: badmin qact all

Step 20 : Modify Conf files as per requirement add queues, clusters...etc . Then run badmin reconfig or lsadmin reconfig as explained in LSF configuration section below.  Restart LSF as "lsfadmin" user .

$ lsid
IBM Spectrum LSF Standard 10.1.0.12, Jun 10 2021
Copyright International Business Machines Corp. 1992, 2016.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
My cluster name is CLUSTER2
My master name is myhost1
$ lsclusters -w
CLUSTER_NAME        STATUS   MASTER_HOST             ADMIN    HOSTS  SERVERS
CLUSTER1            ok       cloudhost            lsfadmin        7        7
CLUSTER2            ok       myhost1              lsfadmin        8        8
CLUSTER3            ok       remotehost2          lsfadmin        8        8
$ bhosts
HOST_NAME        STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
myhost1 ok - 20 0 0 0 0 0 myhost2 ok - 20 0 0 0 0 0 myhost3 ok - 19 0 0 0 0 0
myhost4 ok - 44 4 4 0 0 0
myhost5 ok - 44 4 4 0 0 0
myhost6 ok - 20 0 0 0 0 0 myhost7 ok - 20 0 0 0 0 0 myhost8 ok - 19 0 0 0 0 0
Spectrum LSF Cluster Installation and FP12 upgradation completed successfully  as per the details copied above.

You must run hostsetup as root to use --boot="y" option to modify the system scripts to automatically start and stop LSF daemons at system startup or shutdown. . The default is --boot="n".

1. Log on to each LSF server host as root. Start with the LSF master host.

2. Run hostsetup on each LSF server host. For example:

# cd $LSF_TOP/10.1/install

# ./hostsetup --top="$LSF_TOP" --boot="y"

NOTE: For more details on hostsetup usage, enter hostsetup -h.

In case of multi-cluster environment, reinstalling  master cluster would show status=disk  after issuing bclusters command. 


[smpici@c656f7n06 ~]$ bclusters
[Job Forwarding Information ]

LOCAL_QUEUE     JOB_FLOW   REMOTE CLUSTER    STATUS
            Queue1              send                     CLUSTER1          disc
            Queue2              send                     CLUSTER2          disc
            Queue3              send                     CLUSTER3          disc

where status=disc means communication between the two clusters is not established. The disc status might occur because no jobs are waiting to be dispatched, or because the remote master cannot be located.

Possible solution is to cleanup all the LSF daemons on all clusters. Note : lsfshutdown leaves some of the daemons on Master node. So , you need to manually kill all the LSF daemons on all master nodes.

Later,  bclusters should show the status as shown below:

[smpici@c656f7n06 ~]$ bclusters
[Job Forwarding Information ]

LOCAL_QUEUE     JOB_FLOW   REMOTE CLUSTER    STATUS
Queue1                              send        CLUSTER1                     ok
Queue2                              send        CLUSTER2                     ok
Queue3                              send        CLUSTER3                     ok

 

Singularity is a containerization solution designed [os-level-virtualization] for high-performance computing cluster environments. It allows a user on an HPC resource to run an application using a different operating system than the one provided by the cluster. For example, the application may require Ubuntu but the cluster OS is CentOS. If you are using LSF_EAUTH_KEY in  container based environment, then  there may be  eauth setbit issue.  LSF client will invoke "eauth -c" in the container by the job owner; eauth is setuid program, so it can read lsf.sudoers file to get the key. If eauth loses the setuid permission, it cannot read the lsf.sudoers file, and it will use the default key to encrypt user information. When request reaches to a server, it calls "eauth -s" which is running on root at host. It gets the key, and use the key to decrypt the user information, and failed. In other words, only default key can work for singularity environment. 

That can be resolved by disabling LSF_AUTH_QUERY_COMMANDS in configuration file as shown below. Since LSF12, LSF introduced LSF_AUTH_QUERY_COMMANDS in lsf.conf. The default value is set to Y. Basically it adds extra user authentication for batch query. By default, each cluster has own default eauth key. If user uses "boosts remote_cluster", it uses local key to encrypt user data, but talk to remote daemon directly. As the remote daemon uses its own key to decrypt data, it fails. Job operations information is exchanged through mbatchd to mbatchd communication channel. It does not go through kind of auth. Logically, user should only query local mbatchd to see job's status (the remote job status will be sent back to the submission cluster). 

Modified files :

1) Add LSF_AUTH_QUERY_COMMANDS=N  to  `lsf.conf file` 
2) Removed : LSF_EAUTH_KEY="testKey1"   from `/etc/lsf.sudoers`

========================================================================

Step 1 : Check cluster names 

[sachinpb@cluster1_master sachinpb]$ lsclusters  -w
CLUSTER_NAME        STATUS   MASTER_HOST       ADMIN    HOSTS  SERVERS
cluster1   ok    cluster1_master              sachinpb        5        5
cluster2   ok    cluster2_master              sachinpb        8        8
[sachinpb@cluster1_master sachinpb]$ 

Step 2:  Submit job to remote cluster (Job forwarding queue=Forwarding_queue) 

[sachinpb@cluster1_master sachinpb]$  bsub -n 10 -R "span[ptile=2]" -q Forwarding_queue  -R - sleep 1000
Job <35298> is submitted to queue <Forwarding_queue>.
[sachinpb@cluster1_master sachinpb]$

Step 3 : Check status of job from submission cluster:

[sachinpb@cluster1_master sachinpb]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
35298   sachinpb  RUN   x86_c656f8 cluster1_master   cluster2_master sleep 1000 Oct 11 03:30
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
                                             cluster2_master@cluster2
[sachinpb@cluster1_master sachinpb]$ bjobs -o 'forward_cluster' 35298
FORWARD_CLUSTER
cluster2
[sachinpb@cluster1_master sachinpb]$ bjobs -o 'dstjobid' 35298
DSTJOBID
8589
[sachinpb@cluster1_master sachinpb]$

Step 4 : Get list of compute nodes on remote cluster  by issuing  bjobs -m  command from submission cluster as shown below:
[sachinpb@cluster1_master sachinpb]$ bjobs -m cluster2 -o 'EXEC_HOST' 8589
EXEC_HOST
computeNode04:computeNode04:computeNode06:computeNode06:computeNode05:computeNode05:computeNode03:computeNode03:computeNode07:computeNode07
[sachinpb@cluster1_master sachinpb]$

======================= LSF configuration section ===========================

After you change any configuration file, use the lsadmin reconfig and badmin reconfig commands to reconfigure your cluster. Log on to the host as root or the LSF administrator (in our case "lsfadmin")

Run lsadmin reconfig to restart LIM and checks for configuration errors. If no errors are found, you are prompted to either restart the lim daemon on management host candidates only, or to confirm that you want to restart the lim daemon on all hosts. If unrecoverable errors are found, reconfiguration is canceled. Run the badmin reconfig command to reconfigure the mbatchd daemon and checks for configuration errors.

  • lsadmin reconfig to reconfigure the lim daemon
  • badmin reconfig to reconfigure the mbatchd daemon without restarting
  • badmin mbdrestart to restart the mbatchd daemon
  • bctrld restart sbd to restart the sbatchd daemon

More details about cluster reconfiguration commands as shown in the table copied below :

https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=cluster-commands-reconfigure-your
Source

How to resolve some known eauth related issues - commands like bhosts, bjobs ...etc fail with error "User permission denied".
Problem/solution 1: 
Example 1: 
[smpici@host1 ~]$ bhosts
User permission denied
Example 2:
mpirun --timeout 30 hello_world
Jan 11 02:42:52 2022 1221079 3 10.1 lsb_pjob_send_requests: lsb_pjob_getAckReturn failed on host <host1>, lsberrno <0>
[host1:1221079] [[64821,0],0] ORTE_ERROR_LOG: The specified application failed to start in file ../../../../../../opensrc/ompi/orte/mca/plm/lsf/plm_lsf_module.c at line 347
--------------------------------------------------------------------------
The LSF process starter (lsb_launch) failed to start the daemons on
the nodes in the allocation.
Returned : -1
lsberrno : (282) Failed while executing tasks
This may mean that one or more of the nodes in the LSF allocation is
not setup properly.
Then, Please check clocks on the node . If clocks show the difference, 
then you need configure  chrony as shown below on all nodes.
systemctl enable chronyd.service
systemctl stop chronyd.service
systemctl start chronyd.service
systemctl status chronyd.service
The root cause of the problem was that the system clock between the nodes and the launch nodes were out of sync.
After the clocks were sync'ed up, I tried LSF cmds and it worked . In worst-case, If hosts have time un-synchronization, 
please configure LSF_EAUTH_TIMEOUT=0 in lsf.conf on each side of multi-cluster
Problem/solution 2:

When multi-cluster job forwarding cluster shows unavailable state. 
Just check the PORT numbers defined in lsf.conf are same on both the clusters. Otherwise , cluster shows unavailable state.

[lsfadmin@ conf]$ lsclusters -w
CLUSTER_NAME          STATUS   MASTER_HOST               ADMIN       HOSTS  SERVERS
cluster1               ok       c1_master              lsfadmin       40       40
cluster2              unavail   unknown                 unknown        -  
Default port numbers :
[root@c1_master ~]# cat $LSF_HOME/conf/lsf.conf | grep PORT
LSF_LIM_PORT=7869
LSF_RES_PORT=6878
LSB_MBD_PORT=6881
LSB_SBD_PORT=6882
LSB_QUERY_PORT=6891

Useful Tips:  How long jobs are available after done or exit 
CLEAN_PERIOD_DONE=seconds
Controls the amount of time during which successfully finished jobs are kept in mbatchd core memory. 
This applies to DONE and PDONE (post job execution processing) jobs.

If CLEAN_PERIOD_DONE is not defined, the clean period for DONE jobs is defined by CLEAN_PERIOD in lsb.params. 
If CLEAN_PERIOD_DONE is defined, its value must be less than CLEAN_PERIOD, otherwise it will be ignored and a warning message will appear.
 
# bparams -a | grep CLEAN_PERIOD
        CLEAN_PERIOD = 3600
        CLEAN_PERIOD_DONE = not configured
Problem/solution 3:
Regarding job forwarding setup between clusters over public/private IP address.
I can't ssh from cluster1 Master (f2n01-10.x.x.x) to cluster2 with public IP on cluster2 Master(9.x.x.x).
Do we need to open these ports by setting the firewall rules ?. From an LSF point of view the ports for
the lim and mbd need to be opened up. Issue commands lsclusters and bclusters commands.
The status reported should be OK in both. You should try this from both clusters.
Problem/solution 4: 
Test working of blaunch with LSF job submission. The blaunch command works only under LSF. 
It can be used only to launch tasks on remote hosts that are part of a job allocation. 
It cannot be used as a stand-alone command. The call to blaunch is made under the bsub environment.
You cannot run blaunch directly from the command line.
blaunch is not to be used outside of the job execution environment provided by bsub.
Most MPI implementations and many distributed applications use the rsh and ssh commands as their task launching
mechanism. The blaunch command provides a drop-in replacement for the rsh and ssh commands as a transparent
method for launching parallel applications within LSF.The following are some examples of blaunch usage:
Submit a parallel job:
bsub -n 4 blaunch myjob
Submit a job to an application profile
bsub -n 4 -app pjob blaunch myjob
-----------------Example -----------------------
[sachinpb@xyz]$ cat show_ulimits.c
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <sys/resource.h>
int main()
{
  struct rlimit old_lim;
  if(getrlimit(RLIMIT_CORE, &old_lim) == 0)
      printf("current limits -> soft limit= %ld \t"
           " hard limit= %ld \n", old_lim.rlim_cur, old_lim.rlim_max);
  else
      fprintf(stderr, "%s\n", strerror(errno));
//  abort();
  return 0;
}
[sachinpb@xyz]$
-------------------------------------------------

[submission_node]$ gcc -o show_ulimits show_ulimits.c
[submission_node]$ ls
generate_core.c  show_ulimits  show_ulimits.c  test_limit
[submission_node]$ ./show_ulimits
current limits -> soft limit= 0          hard limit= 0
[submission_node]$

--------------------------------------------------------


[submission_node]$ bsub -o /shared-dir/sachin_test_lsf_out_%J -n 4 -R "span[ptile=2]" -q x86_test_q -m "HOSTA HOSTB" blaunch /shared_dir/core_file_test/show_limits
Job <2511> is submitted to queue <x86_test_q>.
[submission_node]$ bjobs 2511
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
2511    sachinpb  DONE  x86_test_q bsub_host HOSTA     show_limits Oct 29 02:17
                                             HOSTA
                                             HOSTB
                                             HOSTB
[submission_node]$


[submission_node]$ cat /nfs_smpi_ci/sachin_test_lsf_out_2511
Sender: LSF System <sachinpb@HOSTA>
Subject: Job 2511: <blaunch /nfs_smpi_ci/core_file_test/test_limit> in cluster <x86-64_pok-cluster2> Done

Job <blaunch /nfs_smpi_ci/core_file_test/test_limit> was submitted from host <bsub_host> by user <sachinpb> in cluster <x86-64_pok-cluster2> at Thu Oct 29 02:17:45 2020
Job was executed on host(s) <2*HOSTA>, in queue <x86_test_q>, as user <sachinpb> in cluster <x86-64_pok-cluster2> at Thu Oct 29 02:23:52 2020
                            <2*HOSTB>

Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
blaunch /nfs_smpi_ci/core_file_test/test_limit
------------------------------------------------------------

Successfully completed.

The output (if any) follows:

current limits -> soft limit= -1         hard limit= -1
current limits -> soft limit= -1         hard limit= -1
current limits -> soft limit= -1         hard limit= -1
current limits -> soft limit= -1         hard limit= -1
[submission_node]$

-----------------------------------------------------------------------------------
Problem/solution 5: 
When job submission fail with "User permission denied" . 
Please check setbits  with details copied below 
[smpici@c690f2n01 big-mpi]$ bsub
bsub> sleep 100
bsub> User permission denied. Job not submitted.
$LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/bin 
chown root lsadmin
chown root badmin
chmod 4755 lsadmin
chmod 4755 badmin

chown root bctrld
chmod 4755 bctrld 

$LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/etc
chown root eauth  
chmod u+s eauth 

Just run this script to avoid manual configurations.
./hostsetup --top="LSF_HOME" --setuid 
--------------------------------------------------
References:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=migrate-install-unix-linux
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=iul-if-you-install-lsf-as-non-root-user