Tuesday, January 24, 2017

Spectrum LSF : accounting Logs for resource usage statistics





 The bacct command uses the current lsb.acct file for its output. Displays a summary of accounting statistics for all finished jobs (with a DONE or EXIT status) submitted by the user who invoked the command, on all hosts, projects, and queues in the LSF system. bacct displays statistics for all jobs logged in the current Platform LSF accounting log file: LSB_SHAREDIR/cluster_name/logdir/lsb.acct.

The lsb.acct file is the batch job log file of LSF. The master batch daemon
mbatchd generates a record for each job completion or failure. The record is
appended to the job log file lsb.acct

The file is located at  LSB_HOMESHAREDIR/cluster_name/logdir

[root@c61f4s20 conf]# bacct
/lsf_home/work/CI_cluster1/logdir/lsb.acct: No such file or directory
[root@c61f4s20 conf]#

NOTE: Create a file "lsb.acct"  at LSB_HOMESHAREDIR/cluster_name/logdir
and restart LSF daemons to take your modifications.
----------------------------------------
[lsfadmin@localhost logdir]$ pwd
/lsf_home/work/CI_cluster1/logdir
[lsfadmin@localhost logdir]$ ls -alsrt lsb.acct
44 -rw-r--r-- 1 lsfadmin lsfadmin 41182 Jan 25 01:32 lsb.acct
[lsfadmin@localhost logdir]$
----------------------------------------


[lsfadmin@localhost logdir]$ head -2 lsb.acct
"JOB_FINISH" "10.1" 1484136455 106 1002 33554438 1 1484136355 0 0 1484136355 "lsfadmin" "normal" "" "" "" "c61f4s20" "/lsf_home/conf" "" "" "" "1484136355.106" 1 "c712f6n07" 1 "c712f6n07" 64 250.0 "" "sleep 100" 0.505916 0.029839 9088 0 -1 0 0 931 0 0 0 256 -1 0 0 0 18 7 -1 "" "default" 0 1 "" "" 0 12288 352256 "" "" "" "" 0 "" 0 "" -1 "/lsfadmin" "" "" "" -1 "" "" 5136 "" 1484136355 "" "" 2 1032 "0" 1033 "0" 0 -1 0 12288 "select[type == any] order[r15s:pg] " "" -1 "" -1 0 "" 0 0 "" 100 "/lsf_home/conf" 0 "" 0.000000 0.00 0.00 0.00 0.00 1 "c712f6n07" -1 0 0 0
"JOB_FINISH" "10.1" 1484136493 107 1002 33554438 1 1484136393 0 0 1484136393 "lsfadmin" "normal" "" "" "" "c61f4s20" "/lsf_home/conf" "" "" "" "1484136393.107" 1 "localhost" 1 "localhost" 64 250.0 "" "sleep 100" 0.516205 0.018902 9088 0 -1 0 0 932 0 0 0 256 -1 0 0 0 18 7 -1 "" "default" 0 1 "" "" 0 12288 352256 "" "" "" "" 0 "" 0 "" -1 "/lsfadmin" "" "" "" -1 "" "" 5136 "" 1484136393 "" "" 2 1032 "0" 1033 "0" 0 -1 0 12288 "select[type == any] order[r15s:pg] " "" -1 "" -1 0 "" 0 0 "" 100 "/lsf_home/conf" 0 "" 0.000000 0.00 0.00 0.00 0.00 1 "localhost" -1 0 0 0
[lsfadmin@localhost logdir]$

-------------------------------
1) Submit a  job:

[lsfadmin@localhost ~]$ bsub
bsub> sleep 400
bsub> Job <232> is submitted to default queue <normal>.

2) Check status

[lsfadmin@localhost ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
232     lsfadmi RUN   normal     localhost   c712f6n07   sleep 400  Jan 25 01:25

3) Verify the statistics  on jobs submitted by the USER  called lsfadmin  by using  bacct command
[lsfadmin@localhost ~]$ bacct

Accounting information about jobs that are:
  - submitted by users lsfadmin,
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:      69      Total number of exited jobs:    26
 Total CPU time consumed:     146.0      Average CPU time consumed:     1.5
 Maximum CPU time of a job:     5.0      Minimum CPU time of a job:     0.0
 Total wait time in queues: 1181375.0
 Average wait time in queue:12435.5
 Maximum wait time in queue:681365.0      Minimum wait time in queue:    0.0
 Average turnaround time:     12738 (seconds/job)
 Maximum turnaround time:    681365      Minimum turnaround time:         2
 Average hog factor of a job:  0.00 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.02      Minimum hog factor of a job:  0.00
 Average expansion factor of a job:  12435.08 ( turnaround time / run time )
 Maximum expansion factor of a job:  681365.00
 Minimum expansion factor of a job:  1.00
 Total Run time consumed:     28773      Average Run time consumed:     302
 Maximum Run time of a job:    1000      Minimum Run time of a job:       0
 Total throughput:             0.29 (jobs/hour)  during  330.25 hours
 Beginning time:       Jan 11 07:07      Ending time:          Jan 25 01:22

[lsfadmin@localhost ~]$

NOTE: custom made Statistics not reported by bacct but of interest to individual system administrators  can be generated by directly using awk or perl to process the lsb.acct file.

Reference:
1) http://www.ibm.com/support/knowledgecenter/SSJSMF_9.1.2/lsf/workbookhelp/wb_usecase_lsf.html
2) https://www.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_command_ref/bacct.1.html
3) https://www.ibm.com/support/knowledgecenter/SSETD4_9.1.2/lsf_admin/job_exit_info_view_logged.html
4) http://www-01.ibm.com/support/docview.wss?uid=isg3T1013490