IBM® Spectrum LSF is a complete workload management solution for demanding HPC environments that takes your job requirements, finds the best resources to run the job, and monitors its progress. Jobs always run according to host load and site policies.
LSF cluster (source) |
- Cluster is a group of computers (hosts) running LSF that work together as a single unit, combining computing power, workload, and resources. A cluster provides a single-system image for a network of computing resources. Hosts can be grouped into a cluster in a number of ways. A cluster can contain 1) All the hosts in a single administrative group 2) All the hosts on a subnetwork.
- Job is a unit of work that is running in the LSF system or job is a command or set of commands submitted to LSF for execution. LSF schedules, controls, and tracks the job according to configured policies.
- Queue is a cluster-wide container for jobs. All jobs wait in queues until they are scheduled and dispatched to hosts.
- Resources are the objects in your cluster that are available to run work.
Spectrum LSF 10.1 base Installation and applying FP /PTF/FIX
Plan your installation and install a new production IBM Spectrum LSF cluster on UNIX or Linux hosts. The following diagram illustrates an example directory structure after the LSF installation is complete.
![]() |
Source |
Plan your installation to determine the required parameters for the install.config file.
a ) lsf10.1_lsfinstall.tar.Z
The standard installer package. Use this package in a heterogeneous cluster with a mix of systems other than x86-64. Requires approximately 1 GB free space.
b) lsf10.1_lsfinstall_linux_x86_64.tar.Z
lsf10.1_lsfinstall_linux_ppc64le.tar.Z
Use this smaller installer package in a homogeneous x86-64 or ppc cluster accordingly .
------------------------
Get the LSF distribution packages for all host types you need and put them in the same directory as the extracted LSF installer script. Copy that package to LSF_TARDIR path mentioned in Step 3.
For example:
Linux 2.6 kernel glibc version 2.3, the distribution package is lsf10.1_linux2.6-glibc2.3-x86_64.tar.Z.
Linux kernel glibc version 3.x, the distribution package is lsf10.1_lnx310-lib217-ppc64le.tar.Z
------------------------
LSF uses entitlement files to determine which feature set is enabled or disabled based on the edition of the product. Copy entitlement configuration file to LSF_ENTITLEMENT_FILE path mentioned in step 3.
The following LSF entitlement configuration files are available for each edition:
LSF Standard Edition ===> lsf_std_entitlement.dat
LSF Express Edition ===> lsf_exp_entitlement.dat
LSF Advanced Edition ==> lsf_adv_entitlement.dat
-------------------------
Step 1 : Get the LSF installer script package that you selected and extract it.
# zcat lsf10.1_lsfinstall_linux_x86_64.tar.Z | tar xvf -
Step 2 : Go to extracted directory :
cd lsf10.1_lsfinstall
Step 3 : Configure install.config as per the plan
cat install.config
LSF_TOP="/nfs_shared_dir/LSF_HOME"
LSF_ADMINS="lsfadmin"
LSF_CLUSTER_NAME="x86-64_cluster2"
LSF_MASTER_LIST="myhost1"
LSF_TARDIR="/nfs_shared_dir/conf_lsf/lsf_distrib/"
LSF_ENTITLEMENT_FILE="/nfs_shared_dir/conf_lsf/lsf_std_entitlement.dat"
LSF_ADD_SERVERS="myhost1 myhost2 myhost3 myhost4 myhost5 myhost6 myhost7 myhost8"
ENABLE_DYNAMIC_HOSTS="Y"
Step 4: Start LSF 10.1 base installation
./lsfinstall -f install.config
Logging installation sequence in /root/LSF_new/lsf10.1_lsfinstall/Install.log
International Program License Agreement
Part 1 - General TermsBY DOWNLOADING, INSTALLING, COPYING, ACCESSING, CLICKING
"ACCEPT" BUTTON, OR OTHERWISE USING THE PROGRAM,
LICENSEE AGREES TO THE TERMS OF THIS AGREEMENT. IF YOU ARE
ACCEPTING THESE TERMS ON BEHALF OF LICENSEE, YOU REPRESENT
AND WARRANT THAT YOU HAVE FULL AUTHORITY TO BIND LICENSEE
TO THESE TERMS. IF YOU DO NOT AGREE TO THESE TERMS
* DO NOT DOWNLOAD, INSTALL, COPY, ACCESS, CLICK ON AN
"ACCEPT" BUTTON, OR USE THE PROGRAM; AND
* PROMPTLY RETURN THE UNUSED MEDIA, DOCUMENTATION, AND
Press Enter to continue viewing the license agreement, or
enter "1" to accept the agreement, "2" to decline it, "3"
to print it, "4" to read non-IBM terms, or "99" to go back
to the previous screen.
1
Checking the LSF TOP directory /nfs_shared_dir/LSF_HOME ...
... Done checking the LSF TOP directory /nfs_shared_diri/LSF_HOME ...
You are installing IBM Spectrum LSF - 10.1 Standard Edition
Searching LSF 10.1 distribution tar files in /nfs_shared_dir/conf_lsf/lsf_distrib Please wait ...
1) linux3.10-glibc2.17-x86_64
Press 1 or Enter to install this host type: 1
Installing linux3.10-glibc2.17-x86_64 ...
Please wait, extracting lsf10.1_lnx310-lib217-x86_64 may take up to a few minutes ...
lsfinstall is done.
After installation, remember to bring your cluster up to date by applying the latest updates and bug fixes.
NOTE: You can do LSF installation as non-root user. That will be similar but with one extra prompt for multi-node cluster(yes/no)
chown -R lsfadmin:lsfadmin $LSF_TOP
Step 6 : check the binary files
cd LSF_TOP/10.1/linux3.10-glibc2.17-x86_64/bin
Step 7 : By default, only root can start the LSF daemons. Any user can submit jobs to your cluster. To make the cluster available to other users, you must manually change the ownership and setuid bit for the lsadmin and badmin binary files to root, and the file permission mode to -rwsr-xr-x (4755) so that the user ID bit for the owner is setuid.
chown root lsadmin
chown root badmin
chmod 4755 lsadmin
chmod 4755 badmin
ls -alsrt lsadmin
ls -alsrt badmin
chown root $LSF_SERVERDIR/eauth
chmod u+s $LSF_SERVERDIR/eauth
OR
./hostsetup --top="LSF_HOME" --setuid
Step 8 : Configure /etc/lsf.sudoers
[root@myhost1]# cat /etc/lsf.sudoers
LSF_STARTUP_USERS="lsfadmin"
LSF_STARTUP_PATH="/nfs_shared_dir/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/etc"
LSF_EAUTH_KEY="testKey1"
NOTE: This lsf.sudoers file is not installed by default. This file is located in /etc. lsf.sudoers file is used to set the parameter LSF_EAUTH_KEY to configure a key for eauth to encrypt and decrypt user authentication data. All the nodes/hosts should have this file . Customers need to configure LSF_EAUTH_KEY in /etc/lsf.sudoers on each side of multi-cluster.
Step 9 : check $LSF_SERVERDIR/eauth and copy lsf.sudoers to all hosts in the cluster
ls $LSFTOP/10.1/linux3.10-glibc2.17-x86_64/etc/
scp /etc/lsf.sudoers myhost02:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost03:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost04:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost05:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost06:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost07:/etc/lsf.sudoers
scp /etc/lsf.sudoers myhost08:/etc/lsf.sudoers
Step 10 : Start LSF as lsfadmin and check base Installation using lsid command.
Step 11 : Check binary type with lsid -V
$ lsid -V
IBM Spectrum LSF 10.1.0.0 build 403338, May 27 2016
Copyright International Business Machines Corp. 1992, 2016.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
binary type: linux3.10-glibc2.17-x86_64
NOTE: Download required FP and interim fixes from https://www.ibm.com/support/fixcentral/
Step 12 : Before applying PTF12 and interim patches , bring down the LSF daemons. Use the following commands to shut down the original LSF daemons
badmin hshutdown all
lsadmin resshutdown all
lsadmin limshutdown all
Deactivate all queues to make sure that no new jobs can be dispatched during the upgrade:
badmin qinact all
Step 13: Then, become the root to apply FP12 and interim patches .
Set LSF environment : . LSF_TOP/conf/profile.lsf
. /nfs_shared_dir/LSF_HOME/conf/profile.lsf
Step 14: Apply FP 12 on LSF BASE installation. The patchinstall is available in $LSF_TOP//install directory
# cd $LSF_TOP/10.1/install
Perform a check on patches running. It is recommended to check for the patch before its installation
$ patchinstall –c
./patchinstall /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z
[root@myhost7 install]# ./patchinstall /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z
Logging patch installation sequence in /nfs_shared_dir/LSF_HOME/10.1/install/patch.log
Checking the LSF installation directory /nfs_shared_dir/LSF_HOME ...
Done checking the LSF installation directory /nfs_shared_dir/LSF_HOME.
Checking the patch history directory ...
Done checking the patch history directory /nfs_shared_dir/LSF_HOME/patch.
Checking the backup directory ...
Done checking the backup directory /nfs_shared_dir/LSF_HOME/patch/backup.
Installing package "/root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z"...
Checking the package definition for /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z ...
Done checking the package definition for /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z.
.
.
Finished backing up files to "/nfs_shared_dir/LSF_HOME/patch/backup/LSF_linux3.10-glibc2.17-x86_64_600488".
Done installing /root/PTF12_x86_2versions/lsf10.1_lnx310-lib217-x86_64-600488.tar.Z.
Step 15: Apply interim fix1
./patchinstall /root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z
Step 16: Apply interim fix2Logging patch installation sequence in /nfs_shared_dir/LSF_HOME/10.1/install/patch.log
Installing package "/root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z"...
Checking the package definition for /root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z ...
Are you sure you want to update your cluster with this patch? (y/n) [y] y
Y
Backing up existing files ...
Finished backing up files to "/nfs_shared_dir/LSF_HOME/patch/backup/LSF_linux3.10-glibc2.17-x86_64_600505".
Done installing /root/LSF_patch1/lsf10.1_lnx310-lib217-x86_64-600505.tar.Z.
Exiting...
./patchinstall /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z
[root@myhost7 install]# ./patchinstall /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z
Installing package "/root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z"...
Checking the package definition for /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z ...
Backing up existing files ...
Finished backing up files to "/nfs_shared_dir/LSF_HOME/patch/backup/LSF_linux3.10-glibc2.17-x86_64_600625".
Done installing /root/LSF_patch2/lsf10.1_lnx310-lib217-x86_64-600625.tar.Z.
Exiting...
Step 17: As a root user , Setbit for new command bctrld
cd LSF_TOP/10.1/linux3.10-glibc2.17-x86_64/bin
chown root bctrld
chmod 4755 bctrld
Step 18 : Check lsf.shared file for multi cluster setup.
Begin ClusterClusterName ServersCLUSTER1 (cloudhost)CLUSTER2 (myhost1)CLUSTER3 (remotehost2)End Cluster
Step 19 : Switch back to user lsfadmin. Use the following commands to start LSF using the newer daemons.
lsadmin limstartup all
lsadmin resstartup all
badmin hstartup all
Use the following command to reactivate all LSF queues after upgrading: badmin qact all
Step 20 : Modify Conf files as per requirement add queues, clusters...etc . Then run badmin reconfig or lsadmin reconfig as explained in LSF configuration section below. Restart LSF as "lsfadmin" user .
$ lsid IBM Spectrum LSF Standard 10.1.0.12, Jun 10 2021 Copyright International Business Machines Corp. 1992, 2016. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. My cluster name is CLUSTER2 My master name is myhost1
$ lsclusters -w
CLUSTER_NAME STATUS MASTER_HOST ADMIN HOSTS SERVERS
CLUSTER1 ok cloudhost lsfadmin 7 7
CLUSTER2 ok myhost1 lsfadmin 8 8
CLUSTER3 ok remotehost2 lsfadmin 8 8
$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
myhost1 ok - 20 0 0 0 0 0
myhost2 ok - 20 0 0 0 0 0
myhost3 ok - 19 0 0 0 0 0
myhost4 ok - 44 4 4 0 0 0
myhost5 ok - 44 4 4 0 0 0
myhost6 ok - 20 0 0 0 0 0
myhost7 ok - 20 0 0 0 0 0
myhost8 ok - 19 0 0 0 0 0
Spectrum LSF Cluster Installation and FP12 upgradation completed successfully as per the details copied above.
You must run hostsetup as root to use --boot="y" option to modify the system scripts to automatically start and stop LSF daemons at system startup or shutdown. . The default is --boot="n".
1. Log on to each LSF server host as root. Start with the LSF master host.
2. Run hostsetup on each LSF server host. For example:
# cd $LSF_TOP/10.1/install
# ./hostsetup --top="$LSF_TOP" --boot="y"
NOTE: For more details on hostsetup usage, enter hostsetup -h.
In case of multi-cluster environment, reinstalling master cluster would show status=disk after issuing bclusters command.
[smpici@c656f7n06 ~]$ bclusters
[Job Forwarding Information ]
LOCAL_QUEUE JOB_FLOW REMOTE CLUSTER STATUS
Queue1 send CLUSTER1 disc
Queue2 send CLUSTER2 disc
Queue3 send CLUSTER3 disc
where status=disc means communication between the two clusters is not established. The disc status might occur because no jobs are waiting to be dispatched, or because the remote master cannot be located.
Possible solution is to cleanup all the LSF daemons on all clusters. Note : lsfshutdown leaves some of the daemons on Master node. So , you need to manually kill all the LSF daemons on all master nodes.
Later, bclusters should show the status as shown below:
[smpici@c656f7n06 ~]$ bclusters
[Job Forwarding Information ]
LOCAL_QUEUE JOB_FLOW REMOTE CLUSTER STATUS
Queue1 send CLUSTER1 ok
Queue2 send CLUSTER2 ok
Queue3 send CLUSTER3 ok
======================= LSF configuration section ===========================
After you change any configuration file, use the lsadmin reconfig and badmin reconfig commands to reconfigure your cluster. Log on to the host as root or the LSF administrator (in our case "lsfadmin")
Run lsadmin reconfig to restart LIM and checks for configuration errors. If no errors are found, you are prompted to either restart the lim daemon on management host candidates only, or to confirm that you want to restart the lim daemon on all hosts. If unrecoverable errors are found, reconfiguration is canceled. Run the badmin reconfig command to reconfigure the mbatchd daemon and checks for configuration errors.
- lsadmin reconfig to reconfigure the lim daemon
- badmin reconfig to reconfigure the mbatchd daemon without restarting
- badmin mbdrestart to restart the mbatchd daemon
- bctrld restart sbd to restart the sbatchd daemon
More details about cluster reconfiguration commands as shown in the table copied below :
Source |
How to resolve some known eauth related issues - commands like bhosts, bjobs ...etc fail with error "User permission denied".
Example 1:
[smpici@host1 ~]$ bhosts User permission denied
Example 2:
mpirun --timeout 30 hello_world Jan 11 02:42:52 2022 1221079 3 10.1 lsb_pjob_send_requests: lsb_pjob_getAckReturn failed on host <host1>, lsberrno <0> [host1:1221079] [[64821,0],0] ORTE_ERROR_LOG: The specified application failed to start in file ../../../../../../opensrc/ompi/orte/mca/plm/lsf/plm_lsf_module.c at line 347 -------------------------------------------------------------------------- The LSF process starter (lsb_launch) failed to start the daemons on the nodes in the allocation. Returned : -1 lsberrno : (282) Failed while executing tasks This may mean that one or more of the nodes in the LSF allocation is not setup properly.
Then, Please check clocks on the node . If clocks show the difference, then you need configure chrony as shown below on all nodes.
systemctl enable chronyd.service systemctl stop chronyd.service systemctl start chronyd.service systemctl status chronyd.service
References: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=migrate-install-unix-linux
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=iul-if-you-install-lsf-as-non-root-user