Thursday, December 26, 2019

Distributed computing with Message Passing Inteface (MPI)

One thing is certain: The explosion of data creation in our society will continue as far as experts and anyone else can forecast. In response, there is an insatiable demand for more advanced high performance computing to make this data useful.

The IT industry has been pushing to new levels of high-end computing performance; this is the dawn of the exascale era of computing. Recent announcements from the US Department of Energy for exascale computers represent the starting point for a new generation of computing advances. This is critical for the advancement of any number of use cases such as understanding the interactions underlying the science of weather, sub-atomic structures, genomics, physics, rapidly emerging artificial intelligence applications, and other important scientific fields. The supercomputer performance dropped to every 2.3 years from 2009 to 2019 due to several factors including the slowdown in Moore’s Law and technical constraints such as Dennard scaling. To push the bleeding edge of performance and efficiency will require new architectures and computing paradigms. There is a good chance that 5 nanometer technology could come to market later this year or in 2021 due to advances in Semiconductor engineering.

The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the exascale era. The hybrid nature of these systems – distributed memory across nodes and shared memory with non-uniform memory access within each node– poses a challenge. Message Passing Interface (MPI) is a standardized message-passing library interface specification. MPI is a very abstract description on how messages can be exchanged between different processes.   In other words, Message Passing Interface (MPI) is a portable message-passing library standards developed for distributed and parallel computing. Its a STANDARD. So , MPI has multiple implementations. It is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory. MPI gives user the flexibility of calling set of routines from C, C++, Fortran, C#, Java or Python. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs).The message-passing interface (MPI) standard is the dominant communications protocol used in high performance computing today for writing message-passing programs. The MPI-4.0 standard is under development.

MPI Implementations and its derivatives :

There are a number of groups working on MPI implementations. The two principal are OpenMPI, an open-source implementation and MPICH is used as the foundation for the vast majority of its derivatives  including IBM MPI (for Blue Gene), Intel MPI, Cray MPI, Microsoft MPI, Myricom MPI, OSU MVAPICH/MVAPICH2, and many others. MPICH and its derivatives form the most widely used implementations of MPI in the world. MPICH is one of the most popular implementations of MPI. MPICH has been used as the bases for many other MPI derivatives  as shown here. On the other side, IBM Spectrum MPI, Mellanox HPC-X are  MPI - Message Passing Interface based on Open MPI. Similarly, bullx MPI is built around OpenMPI, which has been enhanced by Bull with optimized collective communication.

Open MPI was formed by the merging FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI, and is found in many TOP-500 supercomputers.

source
MPI offers a standard API, and is "portable". That means you can use same source code on all platforms without any modification. It is relatively trivial to switch an application between the different versions of MPI. Most MPI implementations use sockets for TCP based communication. Odds are good that any given MPI implementation will be better optimized and provide faster message passing, than a home grown application using sockets directly. In addition, should you ever get a chance to run your code on a cluster that has InfiniBand, the MPI layer will abstract any of those code changes. This is not a trivial advantage - coding an application to directly use OFED (or another IB Verbs) implementation is very difficult. Most MPI applications include small test apps that can be used to verify the correctness of the networking setup independently of your application. This is a major advantage when it comes time to debug your application. The MPI standard includes the "pMPI" interfaces, for profiling MPI . This interface also allows you to easily add checksums, or other data verification to all the message.  

The standard Message Passing Interface (MPI) has two-sided communication ( pt2pt) and collective communication models. In these communication models, both sender and receiver have to participate in data exchange operations explicitly, which requires synchronization between the processes.  Communications can be either of two types :
  •     Point-to-Point : Two processes in the same communicator are going to communicate.
  •     Collective : All the processes in a communicator are going to communicate together.

One-sided communications are a new type that allows communications to be made in a highly asynchronous way by defining windows of memory where every process can write and or read from. All of these revolve around the idea of Remote Memory Access (RMA). Traditional p2p or collective communications basically work in two-steps : first the data is transferred from the original(s) process(es) to the destination(s). The the sending and receiving processes are synchronised in some way (be it blocking synchronisation or by calling MPI_Wait). RMA allows us to decouple these two steps. One of the biggest implications of this, is the possibility to define shared-memory that will be used by many processes (cf MPI_Win_allocate_shared). Although shared-memory might seem out of the scope of MPI which was initially made for distributed memory, it makes sense to include such functionalities to allow processes sharing the same NUMA nodes for instance. All of these functionalities are grouped under the name of "One-sided communications", since they imply you don't need to have more than one process to store or load information in a shared-memory buffer.
 



In two-sided communication, memory is private to each process. When the sender calls the MPI_Send operation and the receiver calls the MPI_Recv operation, data in the sender memory is copied to a buffer then sent over the network, where it is copied to the receiver memory. One drawback of this approach is that the sender has to wait for the receiver to be ready to receive the data before it can send the data. This may cause a delay in sending data as shown here.


A simplified diagram of MPI two-sided communication send/receive. The sender calls MPI_Send but has to wait until the receiver calls MPI_Recv before data can be sent.To overcome this drawback, the MPI 2 standard introduced Remote Memory Access (RMA), also called one-sided communication because it requires only one process to transfer data. One-sided communication decouples data transfer from system synchronization. The MPI 3.0 standard revised and added extensions to the one-sided communication, adding new functionality to improve the performance of MPI 2 RMA.

Collective Data Movement  :

MPI_BCAST, MPI_GATHER and MPI_SCATTER are collective data movement routines in which all processes interact with a distinguished root process. For example, communication performed in the finite difference program, assuming three processes. Each column represents a processor; each illustrated figure shows data movement in a single phase.The five phases illustrated are (1) broadcast, (2) scatter, (3) nearest-neighbour exchange, (4) reduction, and (5) gather.
source

  1. MPI_BCAST to broadcast the problem size parameter (size) from process 0 to all np processes;
  2. MPI_SCATTER to distribute an input array (work) from process 0 to other processes, so that each process receives size/np elements; 
  3. MPI_SEND and MPI_RECV for exchange of data (a single floating-point number) with neighbours;
  4. MPI_ALLREDUCE to determine the maximum of a set of localerr values computed at the different processes and to distribute this maximum value to each process; and
  5. MPI_GATHER to accumulate an output array at process 0. 

 Many common MPI benchmarks are based primarily on point-to-point communication, providing the best opportunities for analyzing the performance impact of the MCA on real applications. Open MPI implements MPI point-to-point functions on top of the Point- to-point Management Layer (PML) and Point-to-point Transport Layer (PTL) frameworks.The PML fragments messages, schedules fragments across PTLs, and handles incoming message matching.The PTL provides an interface between the PML and underlying network devices.



source

where:
 Point-to-Point management layer (PML)
 Point-to-point Transport Layer (PTL) 
 Bit-transport layer (BTL)


Open MPI is a large project containing many different sub-systems and a relatively large code base. and has  three sections of code listed here.
  •     OMPI: The MPI API and supporting logic
  •     ORTE: The Open Run-Time Environment (support for different back-end run-time systems)
  •     OPAL: The Open Portable Access Layer (utility and "glue" code used by OMPI and ORTE)
There are strict abstraction barriers in the code between these sections. That is, they are compiled into three separate libraries: libmpi, liborte, and libopal with a strict dependency order: OMPI depends on ORTE and OPAL, and ORTE depends on OPAL.

source
The message passing interface (MPI) is one of the most popular parallel programming models for distributed memory systems. As the number of cores per node has increased, programmers have increasingly combined MPI with shared memory parallel programming interfaces, such as the OpenMP programming model. This hybrid of distributed-memory and shared-memory parallel programming idioms has aided programmers in addressing the concerns of performing efficient internode communication while effectively utilizing advancements in node-level architectures, including multicore and many-core processor architectures. Version 3.0 of the MPI standard, adds a new MPI interprocess shared memory extension (MPI SHM). This new extension is now supported by many MPI distributions. The MPI SHM extension enables programmers to create regions of shared memory that are directly accessible by MPI processes within the same shared memory domain. In contrast with hybrid approaches, MPI SHM offers an incremental approach to managing memory resources within a node, where data structures can be individually moved into shared segments to reduce the memory footprint and improve the communication efficiency of MPI programs. Halo exchange is a prototypical neighborhood exchange communication pattern. In such patterns, the adjacency of communication partners often results in communication with processes in the same node, making them good candidates for acceleration through MPI SHM. By applying MPI SHM to this common communication pattern, we demonstrate that direct data sharing can be used instead of communication, resulting in significant performance gains.


Open MPI includes an implementation of OpenSHMEM. OpenSHMEM is a PGAS (partitioned global address space) API for single-sided asynchronous scalable communications in HPC applications. An OpenSHMEM program is SPMD (single program, multiple data) in style. The SHMEM processes, called processing elements or PEs, all start at the same time and they all run the same program. Usually the PEs perform computation on their own subdomains of the larger problem and periodically communicate with other PEs to exchange information on which the next computation phase depends. OpenSHMEM is particularly advantageous for applications at extreme scales with many small put/get operations and/or irregular communication patterns across compute nodes, since it offloads communication operations to the hardware whenever possible. One-sided operations are non-blocking and asynchronous, allowing the program to continue its execution along with the data transfer.

IBM Spectrum® MPI is an implementation of Open MPI, its basic architecture and functionality is similar. IBM Spectrum MPI supports many features of OpenMPI and adds some unique features of its own.IBM Spectrum MPI uses the same basic code structure as Open MPI, and is made up of the  sections OMPI, ORTE and OPAL as discussed in above section. IBM® Spectrum MPI is a high-performance, production-quality implementation of Message Passing Interface (MPI). It accelerates application performance in distributed computing environments. It provides a familiar portable interface based on the open-source MPI. It goes beyond Open MPI and adds some unique features of its own, such as advanced CPU affinity features, dynamic selection of interface libraries, superior workload manager integrations and better performance. IBM Spectrum MPI supports a broad range of industry-standard platforms, interconnects and operating systems, helping to ensure that parallel applications can run almost anywhere. IBM Spectrum MPI Version 10.2 delivers an improved, RDMA-capable Parallel Active Messaging Interface (PAMI) using Mellanox OFED on both POWER8® and POWER9™ systems in Little Endian mode. It also offers an improved collective MPI library that supports the seamless use of GPU memory buffers for the application developer. The library provides advanced logic to select the fastest algorithm of many implementations for each MPI collective operation as shown  below.


As high-performance computing (HPC) bends to the needs of "big data" applications, speed remains essential. But it's not only a question of how quickly one can compute problems, but how quickly one can program the complex applications that do so. High performance computing is no longer limited to those who own supercomputers. HPC’s democratization has been driven particularly by cloud computing, which has given scientists access to supercomputing-like features at the cost of a few dollars per hour.Interest in HPC in the cloud has been growing over the past few years. The cloud offers applications a range of benefits, including elasticity, small startup and maintenance costs, and economics of scale. Yet, compared to traditional HPC systems such as supercomputers, some of the cloud’s primary benefits for HPC arise from its virtualization flexibility. In contrast to supercomputers’ strictly preserved system software, the cloud lets scientists build their own virtual machines and configure them to suit needs and preferences. In general, the cloud is still considered an addition to traditional supercomputers — a bursting solution for cases in which internal resources are overused, especially for small-scale experiments, testing, and initial research. Clouds are convenient for embarrassingly parallel applications (those that do not communicate very much among partitions), which can scale even on commodity interconnects common to contemporary clouds. This is the beauty of Super computer engineering – demand driving innovation, and the exascale era is just the next milestone on the never-ending HPC journey.

Reference:
https://stackoverflow.com/questions/153616/mpi-or-sockets
https://www.ibm.com/support/knowledgecenter/en/SSZTET_10.3/admin/smpi02_running_apps.html
https://hpc.llnl.gov/sites/default/files/MPI-SpectrumUserGuide.pdf 
https://computing.llnl.gov/tutorials/mpi/
http://www.cs.nuim.ie/~dkelly/CS402-06/Message%20Passing%20Interface.htm 
https://www.sciencedirect.com/topics/computer-science/message-passing-interface 
https://www.nextplatform.com/2020/02/13/going-beyond-exascale-computing/?_lrsc=5244db38-a9d0-4d40-9c04-2d8c2ecf4755

Friday, October 18, 2019

How to use System Tap - Who killed my process




In computing, SystemTap (stap) is a scripting language and tool for dynamically instrumenting running production Linux kernel-based operating systems. System administrators can use SystemTap to extract, filter and summarize data in order to enable diagnosis of complex performance or functional problems.

SystemTap consists of free and open-source software and includes contributions from Red Hat, IBM, Intel, Hitachi, Oracle, and other community members


Installation : yum install systemtap systemtap-runtime
 


To determine which process is sending the signal to application/process, it is necessary to trace the signals through the Linux kernel. 

Script 1:  An example script that will monitor SIGKILL and SIGTERM send to the myApp_mtt process

cat my-systemtap_SIGKILL_SIGTERM.stp
--------------------------------------------------------------------- 
#! /usr/bin/env stap
#
# This systemtap script will monitor for SIGKILL and SIGTERM signals send to
# a process named "myApp_mtt".script show process tree of process
# which tried to kill "myApp_mtt"
#

probe signal.send {
  if ((sig_name == "SIGKILL" || sig_name == "SIGTERM") && pid_name == "myApp_mtt") {
    printf("%10d   %-34s   %-10s   %5d   %-7s   %s pid: %d, tid:%d uid:%d ppid:%d\n",
             gettimeofday_s(), tz_ctime(gettimeofday_s()), pid_name, sig_pid, sig_name, execname(), pid(), tid(), uid(), ppid());

    cur_proc = task_current();
    parent_pid = task_pid(task_parent (cur_proc));

    while (parent_pid != 0) {
        printf ("%s (%d),%d,%d -> ", task_execname(cur_proc), task_pid(cur_proc), task_uid(cur_proc),task_gid (cur_proc));
        cur_proc = task_parent(cur_proc);
        parent_pid = task_pid(task_parent (cur_proc));
    }
  }
}

probe begin {
  printf ("\nSACHIN P B: Investigating a murder mistery of Mr. myApp_mtt\n");
  printf("systemtap script started at: %s\n\n", tz_ctime(gettimeofday_s()));
  printf("%50s%-18s\n",
    "",  "Signaled Process");
  printf("%-10s   %-34s   %-10s   %5s   %-7s   %s\n",
    "Epoch", "Time of Signal", "Name", "PID", "Signal", "Signaling Process Name");
  printf("---------------------------------------------------------------");
  printf("---------------------------------------------------------------");
  printf("\n");
}

probe end {
  printf("\n");
}
----------------------------------------------------------
Script 2:  Sample Shell script to send signals SIGTERM/SIGKILL

cat I_am_killer-007.sh
#!/bin/bash
echo "I am going to kill Mr.myApp_mtt sooner.....wait and watch"
sleep 20
pkill -SIGTERM myApp_mtt     ----> CASE1
pkill _SIGKILL myApp_mtt     ----->  CASE 2
echo "Done !!!.......Catch me if you can !"
-----------------------------------------------------
CASE 1:  Test  SIGTERM
Step 1 : Lets start systemtap as shown below:
[root@myhostname sachin]# stap my-systemtap_SIGKILL_SIGTERM.stp
SACHIN P B: Investigating a murder mistery of Mr. myApp_mtt
systemtap script started at: Thu Oct 17 18:34:54 2019 EDT

                                                  Signaled Process
Epoch        Time of Signal                       Name           PID   Signal    Signaling Process Name
------------------------------------------------------------------------------------------------------------------------------
waits here to print logs  when  SIGTERM and SIGKILL  caught

+++++++++++++++++++++++++++++++++++
Step 2 : Lets start our application myApp_mtt
[root@myhostname sachin]# ./myApp_mtt &
[1] 114583
[root@myhostname sachin]#

[root@myhostname sachin]#  ps -ef | grep myApp_mtt | grep -v grep
root     114583  80054  0 19:04 pts/8    00:00:00 ./myApp_mtt
[root@myhostname sachin]#
++++++++++++++++++++++++++++++++++
Step 3: Lets kill this application sending SIGTERM

[root@myhostname sachin]# ./I_am_killer-007.sh
I am going to kill Mr.myApp_mtt sooner.....wait and watch
+++++++++++++++++++++++++++++++++++++
Step 4: Verify PID/PPID  of process that sends SIGKILL
[root@myhostname sachin]#  ps -ef | grep I_am_killer-007.sh | grep -v grep
root     122566  79450  0 19:05 pts/7    00:00:00 /bin/bash ./I_am_killer-007.sh
[root@myhostname sachin]#
+++++++++++++++++++++++++++++++++++++
Step 5: Check for completion:
[root@myhostname sachin]# ./I_am_killer-007.sh
I am going to kill Mr.myApp_mtt sooner.....wait and watch
Done !!!.......Catch me if you can !
[root@myhostname sachin]#
[root@myhostname sachin]#  ps -ef | grep I_am_killer-007.sh | grep -v grep
root     122566  79450  0 19:05 pts/7    00:00:00 /bin/bash ./I_am_killer-007.sh
[root@myhostname sachin]#
[1]+  Terminated              ./myApp_mtt
[root@myhostname sachin]#
++++++++++++++++++++++++++++++++++++++
Step 6: Check  system tap logs -that should match pid of parent process and killer.
[root@myhostname sachin]# stap my-systemtap_SIGKILL_SIGTERM.stp
SACHIN P B: Investigating a murder mistery of Mr. myApp_mtt
systemtap script started at: Thu Oct 17 19:03:53 2019 EDT

                                                  Signaled Process
Epoch        Time of Signal                       Name           PID   Signal    Signaling Process Name
------------------------------------------------------------------------------------------------------------------------------
1571353566   Thu Oct 17 19:06:06 2019 EDT         myApp_mtt       114583   SIGTERM   pkill pid: 124080, tid:124080 uid:0 ppid:122566
pkill (124080),0,0 -> I_am_killer-007 (122566),0,0 -> bash (79450),0,0 -> su (79449),0,0 -> sudo (78656),0,0 -> bash (78202),560045,100 -> sshd (78200),560045,100 -> sshd (77624),0,0 -> sshd (11405),0,0 ->



+++++++++++++++++++++++++++++++++++++++++++
CASE 2 : Test SIGKILL
Step 1 : Now , You  change the script  to send signal SIGKILL  to myApp_mtt.
[root@myhostname sachin]# cat I_am_killer-007.sh
#!/bin/bash
echo "I am going to kill Mr.myApp_mtt sooner.....wait and watch"
sleep 20
pkill -SIGKILL myApp_mtt
echo "Done !!!.......Catch me if you can !"
++++++++++++++++++++++++++++++++++++++++++++

Step 2: Verify PID/PPID  of process that sends SIGKILL
[root@myhostname sachin]#  ./myApp_mtt &
[2] 151008
[root@myhostname sachin]# ps -ef | grep myApp_mtt | grep -v grep
root     151008 150421  0 03:07 pts/45   00:00:00 ./myApp_mtt
[root@myhostname sachin]#  ps -ef | grep I_am_killer-007.sh | grep -v grep
root     151027 150627  0 03:07 pts/4    00:00:00 /bin/bash ./I_am_killer-007.sh
[root@myhostname sachin]#
[root@myhostname sachin]# ./I_am_killer-007.sh
I am going to kill Mr.myApp_mtt sooner.....wait and watch
Done !!!.......Catch me if you can !
[root@myhostname sachin]#
[1]   Killed                  ./myApp_mtt
[root@myhostname sachin]#
+++++++++++++++++++++++++++++++++++++++++++
Step 3: Check systemtap logs for SIGKILL signal and to know the process that killed   myApp_mtt
[root@myhostname sachin]# stap my-systemtap_SIGKILL_SIGTERM.stp
SACHIN P B: Investigating a murder mistery of Mr. myApp_mtt
systemtap script started at: Fri Oct 18 03:07:34 2019 EDT

                                                  Signaled Process
Epoch        Time of Signal                       Name           PID   Signal    Signaling Process Name
------------------------------------------------------------------------------------------------------------------------------
1571382496   Fri Oct 18 03:08:16 2019 EDT         myApp_mtt       151008   SIGKILL   pkill pid: 151049, tid:151049 uid:0 ppid:151027
pkill (151049),0,0 -> I_am_killer-007 (151027),0,0 -> bash (150627),0,0 -> su (150626),0,0 -> sudo (150623),0,0 -> bash (150589),560045,100 -> sshd (150588),560045,100 -> sshd (150582),0,0 -> sshd (11405),0,0 ->


Conclusion : We caught the killer (I_am_killer-007) who sent SIGNAL (SIGTERM/SIGKILL) to process/application

++++++++++++++++++++++++++++++++++++++++++++++++++++++
Reference:
1) https://sourceware.org/systemtap/SystemTap_Beginners_Guide/
2) https://www.thegeekdiary.com/how-to-find-which-process-is-killing-myApp_mtt-with-sigkill-or-sigterm-on-linux/
3) https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html-single/systemtap_language_reference/index
4)http://epic-alfa.kavli.tudelft.nl/share/doc/systemtap-client-2.7/examples/network/connect_stat.stp

Thursday, August 15, 2019

Spectrum LSF GPU enhancements & Enabling GPU features

The IBM Spectrum LSF Suites portfolio redefines cluster virtualization and workload management by providing a tightly integrated solution for demanding, mission-critical HPC environments that can increase both user productivity and hardware utilization while decreasing system management costs. The heterogeneous, highly scalable and available architecture provides support for traditional high-performance computing and high throughput workloads, as well as for big data, cognitive, GPU machine learning, and containerized workloads. Clients worldwide are using technical computing environments supported by LSF to run hundreds of genomic workloads, including Burrows-Wheeler Aligner (BWA), SAMtools, Picard, GATK, Isaac, CASAVA, and other frequently used pipelines for genomic analysis.

source
IBM Spectrum LSF provides support for heterogeneous computing environments, including NVIDIA GPUs. With the ability to detect, monitor and schedule GPU enabled workloads to the appropriate resources, IBM Spectrum LSF enables users to easily take advantage of the benefits provided by GPUs.  

Solution highlights include: 
  •     Enforcement of GPU allocations via cgroups
  •     Exclusive allocation and round robin shared mode allocation
  •     CPU-GPU affinity
  •     Boost control
  •     Power management/li>
  •     Multi-Process Server (MPS) support
  •     NVIDIA Pascal and DCGM support 
The order of GPU conditions when allocating the GPUs are as follows:
  •     The largest GPU compute capability (gpu_factor value).
  •     GPUs with direct NVLink connections.
  •     GPUs with the same model, including the GPU total memory size.
  •     The largest available GPU memory.
  •     The number of concurrent jobs on the same GPU.
  •     The current GPU mode.

Configurations:

1) GPU auto-configuration
Enabling GPU detection for LSF is now available with automatic configuration. To enable automatic GPU configuration, configure LSF_GPU_AUTOCONFIG=Y in the lsf.conf file. LSF_GPU_AUTOCONFIG controls whether LSF enables use of GPU resources automatically. If set to Y, LSF automatically configures built-in GPU resources and automatically detects GPUs. If set to N, manual configuration of GPU resources is required to use GPU features in LSF. Whether LSF_GPU_AUTOCONFIG is set to Y or N, LSF will always collect GPU metrics from hosts. On
When enabled, the lsload -gpu, lsload -gpuload, and lshosts -gpu commands will show host-based or GPU-based resource metrics for monitoring.

2) The LSB_GPU_NEW_SYNTAX=extend parameter must be defined in the lsf.conf file to enable the -gpu option and GPU_REQ parameter syntax.

3) Other configurations :

  • To configure GPU resource requirements for an application profile, specify the GPU_REQ parameter in the lsb.applications file.   i.e GPU_REQ="gpu_req"
  • To configure GPU resource requirements for a queue, specify the GPU_REQ parameter in the lsb.queues file.  i.e GPU_REQ="gpu_req"
  • To configure default GPU resource requirements for the cluster, specify the LSB_GPU_REQ parameter in the lsf.conf file. i.e LSB_GPU_REQ="gpu_req"
---------------------------------------------------------------------------------------------
Configuration change required on clusters : LSF_HOME/conf/lsf.conf

#To enable "-gpu"
LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend
LSB_GPU_REQ="num=4:mode=shared:j_exclusive=yes"
--------------------------------------------------------------------------------------------
Specify additional GPU resource requirements
LSF now allows you to request additional GPU resource requirements to allow you to further refine the GPU resources that are allocated to your jobs. The existing bsub -gpu command option, LSB_GPU_REQ parameter in the lsf.conf file, and the GPU_REQ parameter in the lsb.queues and lsb.applications files now have additional GPU options to make the following requests:
  •     The gmodel option requests GPUs with a specific brand name, model number, or total GPU memory.
  •     The gtile option specifies the number of GPUs to use per socket.
  •     The gmem option reserves the specified amount of memory on each GPU that the job requires.
  •     The nvlink option requests GPUs with NVLink connections.
You can also use these options in the bsub -R command option or RES_REQ parameter in the lsb.queues and lsb.applications files for complex GPU resource requirements, such as for compound or alternative resource requirements. Use the gtile option in the span[] string and the other options (gmodel, gmem, and nvlink) in the rusage[] string as constraints on the ngpus_physical resource.

Monitor GPU resources with lsload command
Options within the lsload command show the host-based and GPU-based GPU information for a cluster. The lsload -l command does not show GPU metrics. GPU metrics can be viewed using the lsload -gpu command, lsload -gpuload command, and lshosts -gpu command.

lsload -gpu

[root@powerNode2 ~]# lsload -gpu
HOST_NAME       status  ngpus  gpu_shared_avg_mut  gpu_shared_avg_ut  ngpus_physical
powerNode1           ok      4                  0%                 0%               4
powerNode2           ok      4                  0%                 0%               4
powerNode3           ok      4                  0%                 0%               4
powerNode4           ok      4                  0%                 0%               4
powerNode5           ok      4                  0%                 0%               4
[root@powerNode2 ~]#


lsload -gpuload
[root@powerNode2 ~]# lsload -gpuload
HOST_NAME       gpuid   gpu_model   gpu_mode  gpu_temp   gpu_ecc  gpu_ut  gpu_mut gpu_mtotal gpu_mused   gpu_pstate   gpu_status   gpu_error
powerNode1 0 TeslaV100_S        0.0       33C       0.0      0%       0%      15.7G        0M            0           ok           -
                    1 TeslaV100_S        0.0       36C       0.0      0%       0%      15.7G        0M            0            ok           -
                    2 TeslaV100_S        0.0       33C       0.0      0%       0%      15.7G        0M            0            ok           -
                    3 TeslaV100_S        0.0       36C       0.0      0%       0%      15.7G        0M            0            ok           -
powerNode2 0 TeslaP100_S        0.0       37C       0.0      0%       0%      15.8G        0M            0           ok           -
                    1 TeslaP100_S        0.0       32C       0.0      0%       0%      15.8G        0M            0           ok           -
                    2 TeslaP100_S        0.0       36C       0.0      0%       0%      15.8G        0M            0           ok           -
                    3 TeslaP100_S        0.0       31C       0.0      0%       0%      15.8G        0M            0           ok           -
powerNode3 0 TeslaP100_S        0.0       33C       0.0      0%       0%      15.8G        0M            0           ok           -
                    1 TeslaP100_S        0.0       32C       0.0      0%       0%      15.8G        0M            0           ok           -
                    2 TeslaP100_S        0.0       35C       0.0      0%       0%      15.8G        0M            0           ok           -
                    3 TeslaP100_S        0.0       37C       0.0      0%       0%      15.8G        0M            0           ok           -
powerNode4 0 TeslaV100_S        0.0       35C       0.0      0%       0%      15.7G        0M            0           ok           -
                    1 TeslaV100_S        0.0       35C       0.0      0%       0%      15.7G        0M            0           ok           -
                    2 TeslaV100_S        0.0       32C       0.0      0%       0%      15.7G        0M            0           ok           -
                    3 TeslaV100_S        0.0       36C       0.0      0%       0%      15.7G        0M            0           ok           -
powerNode5 0 TeslaP100_S        0.0       31C       0.0      0%       0%      15.8G        0M            0           ok           -
                    1 TeslaP100_S        0.0       32C       0.0      0%       0%      15.8G        0M            0           ok           -
                    2 TeslaP100_S        0.0       34C       0.0      0%       0%      15.8G        0M            0           ok           -
                    3 TeslaP100_S        0.0       36C       0.0      0%       0%      15.8G        0M            0           ok           -
[root@powerNode2 ~]#


lshosts -gpu

[root@powerNode2 ~]# bhosts -gpu
HOST_NAME              ID           MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV
powerNode1               0 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        1 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        2 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        3 TeslaP100_SXM2_        0M        0M      0      0      0      0
powerNode2              0 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        1 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        2 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        3 TeslaP100_SXM2_        0M        0M      0      0      0      0
powerNode3              0 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        1 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        2 TeslaP100_SXM2_        0M        0M      0      0      0      0
                        3 TeslaP100_SXM2_        0M        0M      0      0      0      0
powerNode4              0 TeslaV100_SXM2_        0M        0M      0      0      0      0
                        1 TeslaV100_SXM2_        0M        0M      0      0      0      0
                        2 TeslaV100_SXM2_        0M        0M      0      0      0      0
                        3 TeslaV100_SXM2_        0M        0M      0      0      0      0
powerNode5              0 TeslaV100_SXM2_        0M        0M      0      0      0      0
                        1 TeslaV100_SXM2_        0M        0M      0      0      0      0
                        2 TeslaV100_SXM2_        0M        0M      0      0      0      0
                        3 TeslaV100_SXM2_        0M        0M      0      0      0      0
[root@powerNode2 ~]# 

 The -gpu option for lshosts shows the GPU topology information for a cluster.

[root@powerNode2 ~]# lshosts -gpu
HOST_NAME   gpu_id       gpu_model   gpu_driver   gpu_factor      numa_id
powerNode1       0 TeslaP100_SXM2_       418.67          6.0            0
                 1 TeslaP100_SXM2_       418.67          6.0            0
                 2 TeslaP100_SXM2_       418.67          6.0            1
                 3 TeslaP100_SXM2_       418.67          6.0            1
powerNode2       0 TeslaP100_SXM2_       418.67          6.0            0
                 1 TeslaP100_SXM2_       418.67          6.0            0
                 2 TeslaP100_SXM2_       418.67          6.0            1
                 3 TeslaP100_SXM2_       418.67          6.0            1
powerNode3       0 TeslaP100_SXM2_       418.67          6.0            0
                 1 TeslaP100_SXM2_       418.67          6.0            0
                 2 TeslaP100_SXM2_       418.67          6.0            1
                 3 TeslaP100_SXM2_       418.67          6.0            1
powerNode4       0 TeslaV100_SXM2_       418.67          7.0            0
                 1 TeslaV100_SXM2_       418.67          7.0            0
                 2 TeslaV100_SXM2_       418.67          7.0            8
                 3 TeslaV100_SXM2_       418.67          7.0            8
powerNode5       0 TeslaV100_SXM2_       418.67          7.0            0
                 1 TeslaV100_SXM2_       418.67          7.0            0
                 2 TeslaV100_SXM2_       418.67          7.0            8
                 3 TeslaV100_SXM2_       418.67          7.0            8
[root@powerNode2 ~]# 
Job Submission :
1) Submit a  normal job 
[sachinpb@powerNode2 ~]$  bsub -q ibm_q -R "select[type==ppc]" sleep 200
Job <24807> is submitted to queue <ibm_q>.
[sachinpb@powerNode2 ~]$

2) Submit a job with GPU requirements:
[sachinpb@powerNode2 ~]$  bsub -q ibm_q -gpu "num=1" -R "select[type==ppc]" sleep 200
Job <24808> is submitted to queue <ibm_q>.
[sachinpb@powerNode2 ~]$

3) List jobs
[sachinpb@powerNode2 ~]$ bjobs
JOBID   USER    STAT  QUEUE      ROM_HOST   EXEC_HOST   JOB_NAME     SUBMIT_TIME
24807   sachinpb  RUN   ibm_q    powerNode2   powerNode6 sleep 200     Aug  1 05:34
24808   sachinpb  RUN   ibm_q    powerNode2   powerNode2 sleep 200     Aug  1 05:34
[sachinpb@powerNode2 ~]$

We can see that job <24807>  submitted without "-gpu" option and so, it selected non-GPU node [powerNode6]. Other job <24808> was running on powerNode2 with 4 GPUs as listed in lshosts -gpu command shown in above example.

4) Submit a job with GPU requirements to Another cluster(x86-cluster2) where cluster was configured with Job-forwarding Mode:

[sachinpb@powerNode2  ~]$ lsclusters
CLUSTER_NAME    STATUS   MASTER_HOST   ADMIN    HOSTS  SERVERS
power_cluster1             ok      powerNode2                 lsfadmin       5        5
x86-64_cluster2           ok       x86-masterNode           lsfadmin       8        8
[sachinpb@powerNode2  ~]$


[sachinpb@powerNode2 ~]$ bsub -q x86_q -gpu "num=1" -R "select[type==X86_64]" sleep 200
Job <46447> is submitted to queue <x86_ibmgpu_q>.
[sachinpb@powerNode2 ~]$
[sachinpb@powerNode2 ~]$ bjobs
JOBID   USER    STAT  QUEUE        FROM_HOST   EXEC_HOST                    JOB_NAME   SUBMIT_TIME
46447   sachinpb  RUN   x86_q           powerNode2   x86_intelbox@x86-cluster2    sleep 200      Feb  9 00:55 


I hope this blog helped in understanding how to enable GPU support in Spectrum LSF followed by job submission.
NOTE: GPU enabled workloads supported from IBM Spectrum LSF Version 10.1 Fix Pack 6 onwards. LSF systems using RHEL, version 7 or higher is required to support LSF_GPU_AUTOCONFIG.

References:
https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_gpu/chap_submit_monitor_gpu_jobs.html