- Job Queuing in a Linux Cluster
Written by Bowon Lee
Last updated: March 10, 2006
Introduction
A job queuing system is necessary in order to effectively utilize all
available resources in a cluster. Among the available job queuing
systems, the Sun Grid Engine (SGE) will be described briefly in this page.
Basic Procedure
The basic functions of job queuing systems are
- Monitor the status of available CPU's
- Take jobs from the users and put them in a queue
- Send the jobs in the queue to any CPU's in idle
- Manage jobs in the queue
When there are jobs remaining in the queue and all the CPU's are
running, then the job queuing system monitors the CPU's until any
of the CPU's become idle and send the jobs in the queue to those
CPU's in the order the jobs are received.
Useful commands
- qstat
This command displays the current status of the queue.
Useful options are
-f : | Display status of the queue in more detail |
-u {userID} : | Display status of jobs owned by a user userID |
-j {jobID} : | Display status of a job with a job ID# jobID |
For more detailed usage, please type 'man qstat'.
- qsub
This command submits a batch job to the job queuing system.
It is important to note that qsub only submits a script rather than
a job itself.
The optional arguments can either follow the command or can be written
in the script to be sent.
Userul options are
-cwd : | Execute the jobs in the script in the current
working directory |
-S {shell} : | Specify the shell to be used |
An example script test.sh is provided.
#!/bin/bash
# This is a test script
pwd
hostname
date
We can test this script by executing it in the master node.
$ test.sh
We should be able to see the result such as
/home/myuserid/test/
machine.domain.uiuc.edu
Fri Mar 10 13:37:49 CST 2006
So we expect to see similar results when we submit this batch job
to the slave nodes.
We can submit this job by typing
$ qsub -cwd -S /bin/bash test.sh
As stated earlier, those options can be written in the script such that
#!/bin/bash
#$ -S /bin/bash
# This is a test script
#$ -cwd
pwd
hostname
date
Then we can submit a job by typing
$ qsub test.sh
Since we submit a job through the job queuing system, we cannot directly
see the result in the master node. When the submitted job has a job ID jobID,
then SGE generates two text files 'test.sh.o{jobID}' and 'test.sh.e{jobID}' to
store standard output and standard error respectively. So it is recommended to
examine the results in those files to make sure that job is successfully
completed.
For more detailed usage, please type 'man qsub'.
- qdel
This command deletes any jobs in the queue.
Useful options are
-u {userID} : | Kills any jobs owned by a user userID |
-j {jobID} : | Kills a job with a job ID# jobID |
For more detailed usage, please type 'man qdel'.
Comments
In most cases, the slave nodes are connected through the network switch in a
private network. So the slave nodes cannot directly connect to the outside network.
So if your current working directory is mounted through the network, then
the files cannot be accessed in the slave nodes. So make sure that the working
directory is physically located in the master node so that the slave nodes
can access those files.
A tutorial about running the HTK commands in parallel using the job queuing
system can be found at
HTK_parallel
Feel free to send feedbacks to bowonlee@uiuc.edu
Created by Bowon Lee
|
March 10, 2006
|