- Job Queuing in a Linux Cluster

      Written by Bowon Lee
      Last updated: March 10, 2006


  • Introduction
    1. A job queuing system is necessary in order to effectively utilize all available resources in a cluster. Among the available job queuing systems, the Sun Grid Engine (SGE) will be described briefly in this page.

  • Basic Procedure
    1. The basic functions of job queuing systems are

      1. Monitor the status of available CPU's
      2. Take jobs from the users and put them in a queue
      3. Send the jobs in the queue to any CPU's in idle
      4. Manage jobs in the queue

      When there are jobs remaining in the queue and all the CPU's are running, then the job queuing system monitors the CPU's until any of the CPU's become idle and send the jobs in the queue to those CPU's in the order the jobs are received.

  • Useful commands
    1. qstat
    2. This command displays the current status of the queue.
      Useful options are

        -f : Display status of the queue in more detail
        -u {userID} : Display status of jobs owned by a user userID
        -j {jobID} : Display status of a job with a job ID# jobID

      For more detailed usage, please type 'man qstat'.

    3. qsub
    4. This command submits a batch job to the job queuing system.
      It is important to note that qsub only submits a script rather than a job itself.
      The optional arguments can either follow the command or can be written in the script to be sent.
      Userul options are

        -cwd : Execute the jobs in the script in the current working directory
        -S {shell} : Specify the shell to be used

      An example script test.sh is provided.

        #!/bin/bash
        # This is a test script

        pwd
        hostname
        date

      We can test this script by executing it in the master node.

        $ test.sh

      We should be able to see the result such as

        /home/myuserid/test/
        machine.domain.uiuc.edu
        Fri Mar 10 13:37:49 CST 2006

      So we expect to see similar results when we submit this batch job to the slave nodes.
      We can submit this job by typing

        $ qsub -cwd -S /bin/bash test.sh

      As stated earlier, those options can be written in the script such that

        #!/bin/bash
        #$ -S /bin/bash
        # This is a test script

        #$ -cwd
        pwd
        hostname
        date

      Then we can submit a job by typing

        $ qsub test.sh

      Since we submit a job through the job queuing system, we cannot directly see the result in the master node. When the submitted job has a job ID jobID, then SGE generates two text files 'test.sh.o{jobID}' and 'test.sh.e{jobID}' to store standard output and standard error respectively. So it is recommended to examine the results in those files to make sure that job is successfully completed.

      For more detailed usage, please type 'man qsub'.

    5. qdel
    6. This command deletes any jobs in the queue.
      Useful options are

        -u {userID} : Kills any jobs owned by a user userID
        -j {jobID} : Kills a job with a job ID# jobID

      For more detailed usage, please type 'man qdel'.

  • Comments
    1. In most cases, the slave nodes are connected through the network switch in a private network. So the slave nodes cannot directly connect to the outside network. So if your current working directory is mounted through the network, then the files cannot be accessed in the slave nodes. So make sure that the working directory is physically located in the master node so that the slave nodes can access those files.

      A tutorial about running the HTK commands in parallel using the job queuing system can be found at HTK_parallel

      Feel free to send feedbacks to bowonlee@uiuc.edu

    Created by Bowon Lee March 10, 2006