HPC Batching System

Batch processing for "loosely coupled" compute cluster. eg LSF, PBS/TORQUE, SGE.
The traditional Supercomputer seems as rare as dinosours, and even supercomputing center run batch submission system like GE or SLURM or some such.
Cluster tends to refer to highly similar systems setup with intention of using it as a single system. Grid tends to cover a wider variety of system (eg workstation queues).

Note that instead of using a batch system, a number of tools exist to carry out embarrasingly parallel tasks of running the same program many times over and in parallel against a large number of input files:

Batch System 101

Think of old days mainframes where user submit job, it will run non-interactively, it will be processed and output collected at the end.
Sys admin has to worry about scheduling, fair sharing, etc. Things that OS used to do, now sys admin are exposed and may need to tweak them. Deja vu? Time to dig out that Operating System book from college :)

DanT Grid Engine for Dummies

HPC Batch system is increasingly adopting containers. For a feature comparison table of Shifter vs Singularity, and a list of relevant articles, see Docker vs Singularity vs Shifter vs Univa Grid Engine Container Edition

Command Summary

Below is a quick comparison of commands between a couple of popular batch/queue management systems. A "Rosetta Stone" of Batch/Queue systems on Unix/Linux :)
Task			LSF			PBS/Torque		SGE
--------------------	--------------------	--------------------	--------------------
Job submission		bsub			qsub			qsub
Job monitor		bjobs			qstat			qstat
Job removal								qdel

Job control		bstop						qhold
			bresume						qrls
			bpeek



Check nodes		bhosts			pbsnodes -a		qhost / qstat -f
			lshosts						
			lsload

List avail queues 							qconf -sql
Queue status		bqueues			qstat -q		qstat -g c
Queue behaviour		bparams

User accounting		bacct





SGE/UGE

SGE = Sun Grid Engine.
OGE = Open Grid Engine. And NOT Oracle GE !!
UGE = Univa Grid Engine.
GE = Grid Engine.

User Howto

Find out what queue is avail:
qconf -sql 

Submit a job to a specific queue:
qsub sge-script.sh -q 10GBe.q		# eg run only in a queue named for nodes with 10 gigE NIC

qlogin -q interactive.q			# often an interactive queue may have more relaxed load avg req than same host on default.q or all.q
					# but limited slots to avoid heavy processing in them
					# yet granting a prompt for lightweight testing env use even when cluster is fully busy


qsub -R y -pe smp 8 ./sge-script.sh     # -R y : Reservation yes... avoid starvation in busy cluster
                                        # -pe smp 8 : req run with a Physical Env of 8 core (smp)






qrsh


### examples of qsub with request for extra resources ###

# Find out pe env, eg smp, mpi, orte (openmpi, most widely used mpi) 
qconf -spl

# Describe detail of a pe env: 
qconf -sp mpi
qconf -sp smp


# Submit a job to a queue with certain cores numbers
qsub sge-script.sh -pe smp 8		# this req 8 core from machine providing smp parallel env.

qsub sge-script.sh -q default.q@compute-4-1	# submit to a specific, named node/host.    also see -l h=...


# an example that takes lot of parameters
# -l h_rt specifies hard limit stop for the job (for scheduler to know what's ceiling time of run job)
# -l m_mem_free specifies free memory on node where job can be run.  
#               seems to be mem per core, so if specify 8GB and -pe smp 8, then it would need a systemwith 64GB free RAM
# -pe smp 8  specify to use the symetric multiprocessing processing Parallel Environment, 8 cores
# -binding linear:8  is to allow multiple threads of the process to use multiple cores (some uge or cgroup config seems to default bind to single core even when pe smp 8 is specified.  
#    incidentally, load average counts runnable jobs.  so even if 6 threads are bound to a single core, 
#    load avg maybe at 6 but most other core sits idle as visible by htop.
qsbu -M me@example.com -m abe -cwd -V -r y -b y -l h_rt=86400,m_mem_free=6G -pe smp 8 -binding linear:8 -q default.q -N jobTitle  binaryCmd -a -b -c input.file 






# Find out what host group are avail, eg host maybe grouped by cpu cores , memory amount.
qconf -shgrpl

# Find out which specific nodes are in a given host group:
qconf -shgrp @hex

qsub sge-script.sh -q default.q@@hex	# @@groupname submits to a group, and the group can be defined to be number of cores.  not recommended.


# Change the queue a job is in, thereby altering where subsequent job is run (eg in array job)
qalter jobid -q default.q@@hex


Basic User Commands


qsub sge-script.sh	# submit job (must be a script)
			# std out and err will be captured to files named after JOBid.
			# default dir is $HOME, unless -cwd is used.
			
qsub -cwd -o ./sge.out -e ./sge.err  myscript.sh
			# -cwd = use current working dir (eg when $HOME not usable on some node)
			# -o and -e will append to files if they already exist.

qsub	-S /bin/bash	# tell SGE/qsub to parse the script file using specifed SHELL
	-q queue-name	# submit the job to the named queue.
			# eg default@@10gb  would seelct default queue, with 10gb (complex) resource available
	-r y		# resume job (yes/no).  Typically when a job exit with non-zero exit code, rerun the job.
	-R y		# use Reservation. useful in busy cluster and req > 1 core.  
			# hog system till all cores become avail.  
			# less likely to get starvation, but less efficient.  

	-N job-name	# specify a name for the job, visible in qstat, output filename, etc.  not too important.

    -pe smp 4		# request 4 cpu cores for the job 
    -binding linear:4	# when core binding is defaulted to on, this allow smp process to be able to use the desired number of core
    -l ...		# specify resources shown in qconf -sc
    -l h_rt=3600	# hard time limit of 3600 seconds (after that process get a kill -9)
    -l m_mem_free=4GB	# ask for node with 4GB of free memory.
			# if also req smp, memory req is per cpu core

	

qsub 	-t 1-5		# create array job with task id ranging from 1 to 5.
			# note: all parameters of qsub can be placed in script 
			# eg,  place #$ -t 1-5 in script rather than specify it in qsub command line 


qstat 	
	-j		# show overall scheduler info
	-j JOBID	# status of a job.  There is a brief entry about possible errors.
	-u USERNAME	# list of all pending jobs for the specified user
	-u \*		# list all jobs from all users
	-ext		# extended overview list, show cpu/mem usage
	-g c		# queues and their load summary
	-pri		# priority details of queues
	-urg		# urgency  details
	-explain A	# eshow reason for Alarm (also E(rror), c(onfig), a(larm) state)
	-s r		# list running jobs
	-s p		# list pending jobs

	
Reduce the priority of a job:
qalter JOBID -p -500
	# -ve values is DEcrease priority, lowest is -1024
	# +ve is INcrease priority, only admin can do this.  max is 1024
	# It really just affect pending jobs (eg state in qw).  won't preempt running jobs.
qmod -u USERNAME -p -100
	# DEcrease job priority of all jobs for the specified user 
	# (those that have been qsub, not future jobs)




qdel JOBID		# remove the specified job
qdel -f JOBID		# forcefully delete a job, for those that sometime get stuck in "dr" state and qdel returns "job already in deletion"
			# would likely need to go to node hosting job and remove $execd_spool_dir/.../active_jobs/JOBID 
			# else may see lot of complains about zombie job in qmaster message file



Job STATE abbreviations

They are in output of qstat
	E	= Error
	qw	= queued waiting
	r	= running
	dr	= running, died  (eg, host no longer responding)
	s	= 
	t 	= terminating?  jobs that exceeded threashold of h_rt, but somehow sge_execd had problem contecting with scheduler, had resulted in this.
	ts	= ??
	h	= hold	(eg qhold, or job depency tree to hold)


qstat -f  shows host info.  last column (6) is state, and when in problem, has things like au or E on it.
au state means alarm, unreachable.  typically host is transiently down or SGE_execd is not running.  
clearable automatically when sge can talk to the node again.
E is hard Error state, and even with reboot, will not go away.  

qmod -cq  default.q@HOSTNAME 	# clear Error state, do this after sys admin knows problem has been addressed.
qmod -e   default.q@HOSTNAME	# enable node, ie, remove the disabled state.  should go away automatically if reboot is clean.


here is quick way to list hosts in au and/or E state:
qstat -f | grep au$								# will get most of the nodes with alarm state, but misses adu, auE...
qstat -f | awk '$6 ~ /[a-zA-Z]/ {print $0}'                                     # some host have error state in column 6, display only such host
qstat -f | awk '$6 ~ /[a-zA-Z]/  &&  $1 ~ /default.q@compute/  {print $0}'      # add an additional test for nodes in a specific queue

alias qchk="qstat -f | awk '\$6 ~ /[a-zA-Z]/ {print \$0}'"				# this escape seq works for alias
alias qchk="qstat -f | awk '\$6 ~ /[a-zA-Z]/  &&  \$1 ~ /default.q@/  {print \$0}'"	# this works too
(o state for host: when hosts is removed from @allhosts and no longer allocatable in any queue)

Basic+ User Commands

qmon			# GUI mon 


qsub -q default.q@compute-4-1 $SGE_ROOT/examples/jobs/sleeper.sh 120
			# run job on a specific host/node, using SGE example script that sleeps for 120 sec.
			 

qsub -l h=compute-5-16 $SGE_ROOT/examples/jobs/sleeper.sh 120 	
			# run job on a specific host/node, alt format
qsub -l h=compute-[567]-* $SGE_ROOT/examples/jobs/sleeper.sh 120 	
			# run on a set of node.  Not truly recommended, using some resource or hostgroup would be better.


qconf 
	-sc		# show complex eg license, load_avg
	-ss		# see which machine is listed as submit host
	-sel		# show execution hosts
	-sql 		# see list of queues   (but more useful to do qstat -g c)



qhost			# see list of hosts (those that can run jobs)
			# 	their arch, state, etc info also
      -F		# Full details
qhost -F -h compute-7-1 # Full detail of a specific node, include topology, slot limit, etc.

qstat -f

qstat	-f		# show host state  (contrast with qhost)
			# column 6 contain state info.  Abbreviation used (not readily in man page):
			# upper case = perpanent error state

			a  Load threshold alarm
			o  Orphaned
			A  Suspend threshold alarm
			C  Suspended by calendar
			D  Disabled by calendar

			c  Configuration ambiguous
			d  Disabled
			s  Suspended
			u  Unknown
			E  Error

qstat -f | grep d$	# disabled node, though some time appear as adu
qstat -f | awk '$6~/[cdsuE]/ && $3!~/^[0]/'
			# node with problematic state and job running (indicated in column 3) 

Sample submit script

#!/bin/bash

#$ -S /bin/bash
#$ -cwd
#$ -N sge-job-name
#$ -e ./testdir/sge.err.txt
#$ -m beas
#$ -M tin@grumpyxmas.com

### #$ -q short_jobs.q

### -m define when email is send: (b)egin, (e)nd, (a)bort, (s)uspend
### -M tin@grumpyxmas.com work after specifiying -m, (w/o -m don't send mail at all)
### if don't specify -M, then email will be send to username@mycluster.local
### such address will work if mailer supports it 
### (eg postfix's recipeint-canonical has mapping for
### @mycluster.local @mycompany.com


## bash commands for SGE script:

echo "hostname is:"
hostname

This is a good ref for sge task array job: http://talby.rcs.manchester.ac.uk/~ri/_linux_and_hpc_lib/sge_array.html

#!/bin/bash

#$ -t 1-100

#  -t will create a task array, in this eg, element 1 to 100, inclusive
#  The following example extract one line out of a text file, line numbered according to SGE task array element number
#  and pass that single line as input to a program to process it:

INFILE=`awk "NR==$SGE_TASK_ID" my_file_list.text`
      # instead of awk, can use sed:   sed -n "${SGE_TASK_ID}p" my_file_list.text
./myprog < $INFILE

#
# these variables are usable inside the task array job:
# $SGE_TASK_ID
# $SGE_TASK_FIRST
# $SGE_TASK_LAST
# $SGE_TASK_STEPSIZE, eg in 1-100:20, which produces 5 qsubs with 20 elements each
#
# JOB_ID
# JOB_NAME
# SGE_STDOUT_PATH
#
For other variables available inside an SGE script, see the qsub man page

#!/bin/bash

#$ -V	# inherite user's shell env variable (eg PATH, LD_LIBRARY_PATH, etc)
#$ -j y	# yes to combine std out and std err

Admin Commands


qalter ...		# modify a job
qalter -clear ...	# reset all elements of job to initial defaul status... 
qmod   -cj JOBID	# clear an error that is holding a job from progressing
qhold      JOBID	# place a job on hold  (aka qalter -hold_jid JOBID)
			# for qw job, hold prevents it from becomming to r state
			# for running job, probably equiv of kill -suspend to the process.  free up cpu, but not memory or licenses.
qrls   -hn JOBID 	# release a job that is on hold (ie resume)

qacct  -j  JOBID	# accounting info for jobs
qacct -j		# see all jobs that SGE processed, all historical data
qacct -o USERNAME	# see summary of CPU time user's job has taken


qrsh -l mem_total=32G -l arch=x86 -q interactive		# obtain an interative shell on a node of specified char

qconf			# SGE config
	-s...		# show subcommand set
	-a...		# add  subcommand set
	-d...		# delete
	-m...		# modify

qconf -sq [QueueName]	# show how the queue is defined (s_rt, h_rt are soft/hard time limit on how long jobs can run).
qconf -sc		# show complexes
qconf -spl		# show pe
qconf -sel		# list nodes in the grid
qconf -se HOSTNAME	# see node detail
qconf -sconf		# SGE conf
qconf -ssconf		# scheduler conf

qconf -shgrpl		# list all groups
qconf -shgrp @allhosts	# list nodes in the named group "@allhosts"
      -shgrp_tree grp	# list nodes one per line

qconf -su site-local	# see who is in the group "site-local"
qconf -au site-local sn	# add user sn to the group called "site-local"

qconf -sq highpri.q	# see characteristics of the high priority queue, eg which group alloed to submit job to it.
			# hipri 	adds urgency of 1000
			# express pri 	adds urgency of 2000
			# each urgency of 1000 adds a 0.1 to the priority value shown in qstat.
			# default job is 0.5, hipri job would be 0.6


qconf -spl		# show PE env 

qconf -sconf	# show global config 
qconf -ssconf	# show scheduler config (eg load sharing weight, etc)

Enable fair sharing (default in GE 6.0 Enterprise):
qconf -mconf 	# global GE config
	enforce_user		auto
	auto_user_fshare	100
qconf -msconf	# scheduler config
	weight_tickets_functional 10000

	# fair share only schedule job within its 0.x level
	# each urgency of 1000 add 0.1 to the priority value, 
	# thus a high priority job will be be affected by fair share of default job.
	# any 0.6xx level job will run to completion before any 0.5xx job will be scheduled to run.

	# eg use, can declare project priority so any job submitted with the project code will get higher priority and not be penalized by previous run and fair share...
	

qconf -tsm	# show scheduler info with explanation of historical usage weight 
		# very useful in share-tree scheduling 
		# (which reduce priority based on historical half-life usage of the cluster of a given user)
		# straight forward functional share ignore history (less explaining to user)
		# It is a single-time run when the scheduler next execution
		# log should be stored in    $sge_root/$cell/common/schedd_runlog  


qping ...

tentakel "ping -c 1 192.168.1.1" 	
	# run ping aginast head node IP on all nodes (setup by ROCKS)


Disabling an Execution Host

Method 1
If you're running 6.1 or better, here's the best way.
  1. Create a new hostgroup called @disabled (qconf -ahgrp @disabled).
  2. Create a new resource quota set via "qconf -arqs limit hosts @disabled to slots=0" (check with qconf -srqs).
  3. Now, to disable a host, just add it to the host group ( qconf -aattr hostgroup hostlist compute-1-16 @disabled ).
  4. To reenable the host, remove it from the host group ( qconf -dattr hostgroup hostlist compute-1-16 @disabled ).
  5. Alternatively, can update disabled hosts by doing qconf -mhgrp @disabled, placing NONE in the hostlist when there are no more nodes in this disabled hostgroup.
This process will stop new jobs from being scheduled to the machine and allow the currently running jobs to complete.
  1. To see which nodes are in this disabled host group: qconf -shgrp @disabled
  2. The disabled hosts rqs can be checked as qconf -srqs and should looks like
      {
       name         disabled_hosts
       description  "Disabled Hosts via slots=0"
       enabled      TRUE
       limit        hosts {@disabled} to slots=0
      }
      
Ref: http://superuser.com/questions/218072/how-do-i-temporarily-take-a-node-out-from-sge-sun-grid-engine
Method 2
qmod -d \*@worker-1-16 		# disable node compute-1-1 from accepting new jobs in the any/all queues.
qmod -d all.q@worker-1-16 	# disable node compute-1-1 from accepting new jobs in the all.q queue.
qmod -d       worker-1-16 	# disable node compute-1-1, current job finish, no new job can be assigned.
qstat -j 			# can check if nodes is disabled from scheduler consideration
qstat -f 			# show host status.  
qstat -f | grep d$   		# look for disabled node (after reboot, even if crashed by memory, sge/uge disable node)
qmod -e  all.q@worker-1-16	# re-enable node
qmod -cq default.q@worker-5-4	# remove the "E" state of the node.
Method 3
This will remove the host right away, probably killing jobs running on it. Thus, use only when host is idle. But it should be the surest way to remove the node.
qconf -de compute-1-16		# -de = disable execution host (kill running jobs!)
qconf -ae compute-1-16		# re-add it when maintenance is done.

Pausing queue

pausing the queue means all existing jobs will continue to run, but jobs qsub after queue is disabled will just get queued up until it is re-enabled. ref: http://www.rocksclusters.org/roll-documentation/sge/4.2.1/managing-queues.html
qmod -d default.q		# this actually place every node serving the queue in "d" state.  check with qstat -f
qmod -e default.q

Admin howto


See all exec node in cluster:
qconf -sel

Add submit host:
qconf -as hostname
	# note that execution and submit host reverse IP to DNS must lookup and match
	# workstations can be submit host, they will not be running sgeexecd.  
	# submit host don't need special software, but would need qsub to submit. 
	# though often, sge client package is installed
	# /etc/profile.d/settings.*sh will be added for configurations.  minimally:
	#   SGE_CLUSTER_NAME=sge1
	#   SGE_QMASTER_PORT=6444
	#   optional?:
	#   SGE_CELL=default
	#   SGE_EXECD_PORT=6445
	#   SGE_ROOT=/cm/shared/apps/sge/current
	# submit host need to know who the master is.  
	# ie /etc/hosts need entry for the cluster head node external IP



Add an execution host to the system, modeled after an existing node:
qconf -ae TemplateHostName		
# the TemplateHostName is an existing node which the new addition will be model after (get their complex config, etc)
# command will open in vi window. change name to new hostname, edit whatever complex etc.  save and exit.

qconf -Ae  node-config.txt
# -A means add node according to the spec from a file
# the format of the file can be the cut-n-paste from what the vi window have when doing qconf -ae

example of node-config.txt ::
hostname              node-9-16.cm.cluster
load_scaling          NONE
complex_values        m_mem_free=98291.000000M,m_mem_free_n0=49152.000000M, \
                      m_mem_free_n1=49139.093750M,ngpus=0,slots=40
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE
## above is cut-n-paste of qconf -ae node-1-1 + edit
## add as qconf -Ae node-config.txt
A way to script adding a bunch of nodes at the same time. create a shell script with the following, and run it: (fix slots to match the number of cpu cores you have. Memory should be fixed by UGE.
for X in $(seq 1 16 ) ; do

	echo "
	hostname              compute-2-$X
	load_scaling          NONE
	complex_values        m_mem_free=98291.000000M,m_mem_free_n0=49152.000000M, \
	                      m_mem_free_n1=49139.093750M,ngpus=0,slots=28
	user_lists            NONE
	xuser_lists           NONE
	projects              NONE
	xprojects             NONE
	usage_scaling         NONE
	report_variables      NONE " > compute-2-$X.qc

qconf -Ae compute-2-$X.qc
rm compute-2-$X.qc

done

qhost -F gpu | grep -B1 ngpus=[1-9]		# list host with at least 1 GPU installed


qconf -as hostname.fqdn
$ -as = add submit host
# note that execution and submit host rever IP to DNS must lookup and match


# remove a workstation form a workstation.q
# (needed before the execution host can be removed)
qconf -mhgrp @wkstn
# (open in vi env, edit as desired, moving whatever host that should not be in the grp anymore)

remove execution host (think this will delete/kill any running job )
qconf -de hostname


# groups are member of queue
# host are member of group?  or also to queue?


# node doesn't want to run any job.  Check ExecD:
# login to the node that isn't running.
# don't do on head node!!
/etc/init.d/sgeexecd.clustername stop
# if don't stop, kill the process
/etc/init.d/sgeexecd.clustername start


# slots are defined in the queue, and also for each exec host.
#
# typically define slot based on cpu-cores, which are added to group named after them, eg quad, hex
qconf -mq default.q 
# vi, edit desired default slots, or slot avail in a given machine explictly.
# eg
slots                 8,[node-4-9.cm.cluster=24], \
                        [node-4-10.cm.cluster=20],[node-4-11.cm.cluster=20], \
                        [node-1-3.cm.cluster=12]

qconf -se node-4-10	# see slots and other complex that the named execution host has

Cgroups

cgroups_params connects the configuration of cgroups feature. This need to be set via qconf -mconf. UGE 8.1.7 and later, and config may need to be added by hand if hot-upgraded the system.
  cgroups_params             cgroup_path=/cgroup cpuset=false m_mem_free_hard=true

Places where (CPU) slots are defined

(Not sure of order of precedence yet)

jsv

job submission verifier?
a script that run that check if a submitted job passes all the requirements.  it may add flags to the job.
System-wide JSV is specified in
   $SGE_ROOT/default/common/sge_request  eg
   /cm/shared/apps/sge/univa/default/common/sge_request
and the jsv repository is typically at centralized location, eg
   /cm/shared/apps/sge/current/util/resources/jsv/

User can add command line to specify JSV, but unlike most other SGE/UGE flags, all JSV options are additive, 
ie, they will ALL be executed.  ditto for jsv specified in ~/.sge_request.

Troubleshooting

  • "cannot run in PE "smp" because it only offers 0 slots" in qstat -j JOBID output.
    Assuming there are indeed nodes with number of cpu cores requested by -pe smp NN, then this is because such node is not available, so the job is queued.

    Various Notes...

    PBS / TORQUE

    PBS = Portable Batch System.
    Original open source PBS became OpenPBS. OpenPBS is quite old now. TORQUE is the new development adapted from the original OpenPBS, and ClusterResources seems to be the main site putting all the glues together for it.
  • OpenPBS.org - points to Altair, who is still providing original code for download. Last version = 2.3.16 on 2001/12/06. Some patches seems to be available up till 2004 from a site hosted by Argone National Lab.
  • PBS Pro - Commercial version of PBS, supported by Altair (http://www.pbsgridworks.com)
  • Torque (Open Source) manual from Cluster Resources (WiKi).

    Two halves of the brain:
  • Scheduler: Read script only when it starts up. Def = pbs_sche, can be changed Maui or Moab for smarter scheduling.
  • Server: live process, change thru qmgr and effective immediately.
    Both pbs_mom and pbs_server must be started by root.
    
    qsub command-script.sh		# run script, stdout and stderr automatically saved as script.oXX and script.eXX
    
    qmgr -c 'print server'		# Print Server config, including pbs version.
    qmgr -c 'p s'			# Most command can be abreviated to first letter.
    
    qalter				# alter batch job, eg move job from one queue to another.
    
    

    Creating New Queue

    Run qmgr command as root:
    qmgr -c "create queue test_queue_tin queue_type=execution"
    qmgr -c "set    queue test_queue_tin started = True"
    qmgr -c "set    queue test_queue_tin enabled = True"
    qmgr -c "set    queue test_queue_tin max_running  = 2" 
    qmgr -c "set    queue test_queue_tin max_user_run = 1" 
    qmgr -c "set    queue test_queue_tin resources_default.nice = 2" 
    

    Integration with MPI

    Integration with MPICH is somewhat documented. They recommend mpiexec (PBS version at http://www.osc.edu/~pw/mpiexec/index.php#Too_many_mpiexecs ). This allows PBS to have better control. Note that there are "too many mpiexecs" and this is an add-on to the standard MPICH distribution.
    OpenMPI is touted as the best thing to use. OpenMPI automatically communicate with Torque/PBS Pro and retrieve host and np info. Jobs were to be submitted using mpirun rather than qsub... http://www.open-mpi.org/faq/?category=tm
    
    

    Deleting a Queue

    qmgr -c "delete queue test_queue_tin"
    

    Misc

    Prologue/Epilogue script - Run before/after a job (for all queues?)

    LSF

    LSF User Guide (older version 4, but still usable, nice html format :) ) [local cache of Platform Computing doc]
    add pdf for quick ref guides ...
    
    
    
    
    
    
    
    bjobs -u all            # list all user jobs
    -w -u USERNAME  # wide format, w/o truncating fields.
    -l  JOBID       # long listing of a given job
     
    
    
    
    bhist -l -n 4           # list job history, going back 4 log files (if logs
    are available!)
    bhist -l         # show details of the job history,
    		# including how much time it was spend in state
    		# Pending, UserSuspend, SysSuspend, etc
    
    bjobs -lp               # list all pending jobs and the reason why they are
    pending (what they are waiting for).
    
    bpeek            # take a peek look at the current output of your job.
    		# LSF esentially capture the whole stdout of the job
    		# and displays it.  The whole buffer is displayed each time
    		# until the job is dead
    
    
    
    
    
    bsub "cmd option"       # submit job to be run as batch process
    		# LSF will execute the command in a select remote host (LSF client).  
    		# It will duplicate the env, including
    		# the current working dir, env var, etc
    		# Obviously, depends on unanimous NFS setup on all host
    
    		# anyway to reduce priority w/o creating new queues?
    
    
    bstop 		# send STOP signal to the job
    		# LSF essentially do kill -STOP on the dispatched
    		# process and all if the children
    		# ps -efl will show "T" for these processes.
    bstop 0			# stop all jobs owned by the current user
    
    bresume 		# send CONTinue signal to the job
    bresume 0 		# resume all jobs
    
    brequeue 	# Re-Queue a job.  LSF essentially kill the job
    		# and re-submit it, which will likely run on a different host
    		# Thus, it really depend on process that can be restarted gracefully.
    		# eg, setiathome will restart by reading state saved in xml files.
    
    
    lsload			# show loads
    bqueues			# list all queues
    lsinfo
    bhosts
    bparams
    
    lshosts			# display LSF "server nodes" and clients
    lshosts -l | egrep '(ENABLED|HOST_NAME)'	
    		# which which hosts has checked out license
    		# master node can be proxy that dolls out license to clients
    		# all license expires in midnite.
    					
    
    
    
    
    ENV setup
    <!--

    Admin/Accounting

    
    bacct 			# display stats on finished LSF jobs.
    -u username
    
    bacct -u all -C 3/1,3/31		# display aggregate job stats for all users for month of march.
    					# -C for completed (and exited) jobs.
    bacct -u all -S 2007/1/1,2007/3/31	# similar to above, but for Submitted jobs, with year explicityly defined.
    
    

    Assistant

    .bashrc function, to count host and cpu cores.
    qhostTot() { qhost | sed 's/G//g' | awk '/^compute/ {h+=1; c+=$3; m+=$8} END {print "host="h " core=" c " RAM=" m "G"}' ; }
    
    alias:
    alias qchk='qstat -f | awk '\''$6 ~ /[a-zA-Z]/  &&  $1 ~ /default.q@/  {print $0}'\'''
    



    Links

    See also: Ref:

    SGE:

    [Doc URL]
    tiny.cc/uge
    http://tin6150.github.io/psg/lsf.html New home page. Hosted at dropbox since June 2011, up to date as of Dec 2016 and beyond...
    http://grumpyxmas.medianewsonline.com/lsf.html Alt home, last updated Dec 2012
    http://tin6150.s3-website-us-west-1.amazonaws.com/lsf.html (won't likely be coming after all)
    ftp://sn.is-a-geek.com/psg/lsf.html My home "server". Up sporadically.
    (cc) Tin Ho. See main page for copyright info.



    "ting"
    "ting"
    tin6150 sn50