MIT Information Systems   Longjobs -- Frequently Asked Questions
Athena Owl    For full Longjobs documentation, see: Overview | Job Scripts | Running Jobs | Checking Job Status


What kind of machines are in the longjobs pool?

2005-2006 Longjobs Execution Machines
Platform Operating System Model Memory (MB) Disk (GB) CPU Speed (GHz) Quantity
Sun Solaris Sun Fire V210 2048 36 2 UltraSPARC-IIIi 1.0 6
PC Linux IBM X335 1024 36 Intel Xeon 3.2 6


Can I test my job without waiting in the queue or using up quota?

Yes. We have a program which simulates the longjobs environment on your local workstation so that you can test scripts without actually submitting them; you don't have to wait in a queue or use up quota. We recommend this in particular if you are new to longjobs or are running a different type of job for the first time.

To use the test program:
  1. First prepare a version of your script to run a short job, in one of the following ways:
  2. Run the testjob program on the script prepared above.

    Basic syntax:
       athena% testjob script_name
           
    example:
       athena% add longjobs
       athena% testjob foo       
       [Ignoring -a script directive]
       Note that any locker dependencies will not be tested.  You
       have attached the following lockers; please ensure that
       your script and/or dotfiles attach or add them as needed:
    
       matlab
       longjobs
       infoagents
       29.123
    
       Executing foo...
       Process exited with status 0
       The standard output stream is in foo.out
       The standard error stream is in foo.err
       


Note on lockers: testjob runs on your local machine and will have access to any lockers you attached or added before running it, unlike the actual longjobs system which will have access to lockers only if your script or dotfiles attach/add them. As shown above, testjob will show you a list of the lockers which are attached when it runs, as a reminder to you to check your job's locker dependencies.

See the Testing section on the Job Scripts page for more details on how testjob deals with output, qsub options, etc.

Can I configure things so that I don't have to retype the account, queue name, etc. every time I submit a job?



Yes, there are two ways you can do this:

1: Using the
QSUB
environment
variable
Set the QSUB environment variable to a single string containing all of the options you want to use. For example:
       setenv QSUB "-a 29.123 -q sun-long"
       
will specify 29.123 as the account and sun-long as the queue. With this set, all you would have to type to submit a job would be:
       athena% qsub script_name
       
(You may add such a setenv command to your ~/.environment file for use during future logins, or type it at the command line for use in the current session only.)


Using directives in your job script At the beginning of your job script, add a single line for each qsub option, with the following syntax (PBS stands for Portable Batch System, the software on which longjobs is based.) :
       #PBS -flag option
       
For example:
       #PBS -a 29.123
       #PBS -l walltime=10:00:00
       
will specify 29.123 as the account and set a time limit of 10 hours. Note that directives must be placed at the beginning of the job script, before any commands (if there are any commands in the script before a directive, the directive will be ignored).
Notes:

Can the stderr/stdout files be flushed to my account while the job is running (i.e., don't wait until the job is done to copy them over from the server)?

Yes, you can use the qsub -k option to specify this at submit time, for either one or both of the files. Note, however, that it will flush the file(s) to your homedir, not the submit directory. When attempting to start the job, if the system is unable to open the file(s) for writing for any reason, it will abort the job. If the job starts successfully but there are subsequent write errors (e.g., due to insufficient quota), it is up to whatever program is running to handle them; since few programs test for errors writing to stdout/stderr, most likely such output would be lost (as opposed to the default setting, where the system will mail you the file if for some reason it was unable to write it back to your submit dir at the end of the job).

-k e
as the job runs, writes the stderr file ~/jobname.ejobid only; at job end, writes jobname.ojobid to the submit dir.
-k o
as the job runs, writes the stdout file ~/jobname.ojobid only; at job end, writes jobname.ejobid to the submit dir.
-k eo
as the job runs, writes both ~/jobname.ejobid and ~/jobname.ojobid.
-k n
(default) flushes neither file as the job runs; at job end, writes both to the submit dir.

Can I set up dependencies between multiple jobs, e.g., don't run one job until another job has finished?

Yes, you can use the qsub -W depend=dependency_list option to define dependencies between jobs. There are a number of ways to do this, as explained on the qsub man page. We will give one simple example here.

Suppose we have split our task into two job scripts, called part1 and part2, and want to submit both jobs at the same time, but have the system wait until part1 finishes before it starts part2:
  1. Submit the first job and note its jobid, e.g.:
             athena% qsub -a long-29.123 -q sun-medium part1
             [Creating a renewable Kerberos ticket for use by the job]
             Password for jqpublic@ATHENA.MIT.EDU: 
             [Forwarding renewable Kerberos 5 TGT...]
             1171.hydrogen.mit.edu
           
  2. Submit the second job with -W depend=afterok:jobid-to-wait-for, e.g.:
     
             athena% qsub -a long-29.123 -q sun-medium -W depend=afterok:1171 part2
             [Creating a renewable Kerberos ticket for use by the job]
             Password for jqpublic@ATHENA.MIT.EDU: 
             [Forwarding renewable Kerberos 5 TGT...]
             1175.hydrogen.mit.edu
           
In this case, if you run qstat, you will see the second job is in the Hold state. You can check a job's dependencies by viewing the complete status with qstat -f, e.g.
  athena% qstat -f 1175 | grep depend
  depend = afterok:1171.hydrogen.mit.edu@hydrogen.mit.edu


Notes:

My job quit unexpectedly, what's wrong?

  1. Start by looking at the contents of the standard output and error files from the job. These will be in the directory from which you submit the job, named jobname.ojobid and jobname.ejobid respectively (jobname is the first 15 characters of the script name, or STDIN, and jobid was the number assigned to the job when you submitted it. (If there are multiple files likes this and you don't recall the jobid, you can see which are most recent with ls -ltr)

    Some sample errors, and things to check:

    my_data: File exists. If your script does something like crunch_data > my_data, you should check that the file does not already exist from a previous run (you can have your script delete it first, or instead construct a unique filename using PBS_JOBID; see the notes on environment variables on the Job Scripts page).
    crunch_data: Command not found If crunch_data is a command from a locker, does your script add that locker? (See the notes on lockers and path and testing on the Job Scripts page.)

    Does the command exist for the platform where your job ran? (e.g., if it is Sun-only, make sure you submit to one of the sun- queues, not the any- queue)
    Cannot access input.m: No such file or directory Does the file input.m exist in the job's working directory when the script runs? (See the notes on working directory and testing on the Job Scripts page.)
    Cannot open display Recall that jobs execute without access to an X display or controlling terminal, so any program requiring such access will fail. Check if the program has an option for running in batch mode.
    Cannot open /dev/tty


  2. Try testing your script as described on the Job Scripts page.

    Why does it say "unauthorized" when I try to submit?

    If you see a message like this:
    qsub: Unauthorized Request 
    
    it can mean several things:

    My job hasn't finished yet, is something wrong?

    When viewing queue status, keep in mind that: Predicting wait time may be difficult. Unlike a simple print queue, this is not a strict first-in-first-out system; viewing the queues should give you some idea of system demand but scheduling order and wait time will depend on several factors (see the Checking Job Status page for more on this).

    How many jobs can I run simultaneously?

    There is an overall system limit which may change according to usage levels and what part of the academic year it is. There are other related limits (such as a per-queue user limit) which may be added/modified as needed. If a job is not running because it would exceed such a limit, you will see a message indicating this if you run qstat -l, e.g.:
      Not Running: User has reached server running job limit.
    
    At this writing, the only limit is the system-wide maximum, set at 3. You can always check this with:
      athena% qmgr -c 'list server max_user_run'
      Server hydrogen.mit.edu
            max_user_run = 3
    
    Keep in mind that other scheduling factors may influence how long your jobs wait in the queue.

    How can I submit multiple jobs without retyping my password each time?

    By default, qsub uses your password to create a renewable Kerberos 5 ticket-granting ticket (TGT) for use by the job. By using its -tf option to forward an existing ticket, though, you can create one new TGT to be used by each job you plan to submit, as follows:
    1. Start a new shell.
    2. Change the KRB5CCNAME and KRBTKFILE environment variable settings, to protect your existing ticket files. Set the variables to point at unique file names on the local disk, e.g.:
            setenv KRB5CCNAME /tmp/krb5cc_p$$
            setenv KRBTKFILE /tmp/krb4cc_p$$
      
    3. Use kinit -r renewable_lifetime to create a new renewable TGT. (See the kinit(1) man page). The maximum renewable lifetime is currently 7 days. For example:
            athena% kinit -r 7d
            Password for jqpublic@ATHENA.MIT.EDU: 
      
    4. Use qsub -tf to submit each job using the newly created ticket; you will not have to enter your password.
    5. Run kdestroy to destroy the ticket files created in step 3 (IMPORTANT).
    6. Exit the shell (to revert the environment variable settings changed in step 2).

    Why does qstat give a usage error?

    There is a command name conflict with the qstat program in the outland locker. (The outland qstat command displays the status of Quake servers, and gives an error when run without arguments). To make sure you run the longjobs qstat, use add -f longjobs when attaching the locker, run longjob -jobs, or invoke /mit/longjobs/bin/qstat explicitly.

    Who can I contact for help?

    Please provide as much information as possible about what comands you were using, any error messages you received, Job ID, etc.

    If your project is time-critical: Note that the Athena machines in the building 37 cluster (37-318, 37-332) are available for long unattended jobs on weekends. From 5:00 PM Fridays until 8:00 AM Mondays, Cluster staff will not log off users from unattended machines.

    We will post any system status messages at the top of the main page; please check there first if you are encountering problems. (Occasionally larger network problems affect longjobs; you may also want to check for notes on http://nic.mit.edu/3down, the general IS Services Status page.)
    Last modified: Wed Sep 21 17:23:43 EDT 2005