Longjobs FAQ

Longjobs -- Frequently Asked Questions

For full Longjobs documentation, see: Overview | Job Scripts | Running Jobs | Checking Job Status

What kind of machines are in the longjobs pool?
Can I test my job without waiting in the queue?
Can I configure things so that I don't have to retype the account, queue name, etc. every time I submit a job?
Can the stderr/stdout files be flushed to my account while the job is running (i.e., don't wait until the job is done to copy them over from the server)?
Can I set up dependencies between multiple jobs, e.g., don't run one job until another job has finished?
My job quit unexpectedly, what's wrong?
Why does it say "unauthorized" when I try to submit?
My job hasn't finished yet, is something wrong?
How many jobs can I run simultaneously?
How can I submit multiple jobs without retyping my password each time?
Why does qstat give a usage error?
Who can I contact for help?

What kind of machines are in the longjobs pool?

**2005-2006 Longjobs Execution Machines**
Platform	Operating System	Model	Memory (MB)	Disk (GB)	CPU	Speed (GHz)	Quantity
Sun	Solaris	Sun Fire V210	2048	36	2 UltraSPARC-IIIi	1.0	6
PC	Linux	IBM X335	1024	36	Intel Xeon	3.2	6

Can I test my job without waiting in the queue or using up quota?

Yes. We have a program which simulates the longjobs environment on your local workstation so that you can test scripts without actually submitting them; you don't have to wait in a queue or use up quota. We recommend this in particular if you are new to longjobs or are running a different type of job for the first time.

To use the test program:

First prepare a version of your script to run a short job, in one of the following ways:
- Comment out the long-running commands, so that they don't actually execute; instead, you might want to use which to check that an executable is available on the path, and echo to write an output file, e.g.:
```
       cd $PBS_O_WORKDIR
       rm -f my_data
       add 29.123
       # crunch_data > my_data
       which crunch_data
       echo "test" > my_data
       
```
- If possible, run the real commands, but use trivial parameters so that it completes within a few minutes, e.g.
```
       cd $PBS_O_WORKDIR
       add matlab
       # matlab -tty < input.m
       matlab -tty < tiny_input.m
       
```

Run the testjob program on the script prepared above.

Basic syntax:

   athena% testjob script_name

example:

   athena% add longjobs
   athena% testjob foo       
   [Ignoring -a script directive]
   Note that any locker dependencies will not be tested.  You
   have attached the following lockers; please ensure that
   your script and/or dotfiles attach or add them as needed:

   matlab
   longjobs
   infoagents
   29.123

   Executing foo...
   Process exited with status 0
   The standard output stream is in foo.out
   The standard error stream is in foo.err

Note on lockers: testjob runs on your local machine and will have access to any lockers you attached or added before running it, unlike the actual longjobs system which will have access to lockers only if your script or dotfiles attach/add them. As shown above, testjob will show you a list of the lockers which are attached when it runs, as a reminder to you to check your job's locker dependencies.

See the Testing section on the Job Scripts page for more details on how testjob deals with output, qsub options, etc.

Can I configure things so that I don't have to retype the account, queue name, etc. every time I submit a job?

Yes, there are two ways you can do this:

1: Using the `QSUB` environment variable	Set the `QSUB` environment variable to a single string containing all of the options you want to use. For example: setenv QSUB "-a 29.123 -q sun-long" will specify 29.123 as the account and sun-long as the queue. With this set, all you would have to type to submit a job would be: athena% qsub script_name (You may add such a setenv command to your ~/.environment file for use during future logins, or type it at the command line for use in the current session only.)

Using directives in your job script	At the beginning of your job script, add a single line for each `qsub` option, with the following syntax (PBS stands for Portable Batch System, the software on which longjobs is based.) : #PBS -flag option For example: #PBS -a 29.123 #PBS -l walltime=10:00:00 will specify 29.123 as the account and set a time limit of 10 hours. Note that directives must be placed at the beginning of the job script, before any commands (if there are any commands in the script before a directive, the directive will be ignored).

1: Using the
QSUB
environment
variable

Set the QSUB environment variable to a single string containing all of the options you want to use. For example:

       setenv QSUB "-a 29.123 -q sun-long"

will specify 29.123 as the account and sun-long as the queue. With this set, all you would have to type to submit a job would be:

       athena% qsub script_name

(You may add such a setenv command to your ~/.environment file for use during future logins, or type it at the command line for use in the current session only.)

Using directives in your job script

At the beginning of your job script, add a single line for each qsub option, with the following syntax (PBS stands for Portable Batch System, the software on which longjobs is based.) :

       #PBS -flag option

For example:

       #PBS -a 29.123
       #PBS -l walltime=10:00:00

will specify 29.123 as the account and set a time limit of 10 hours. Note that directives must be placed at the beginning of the job script, before any commands (if there are any commands in the script before a directive, the directive will be ignored).

Notes:

Command line options take precedence over script directives, and script directives take precedence over the QSUB environment variable.
For details on qsub options, see running qsub, or the qsub man page.

Can the stderr/stdout files be flushed to my account while the job is running (i.e., don't wait until the job is done to copy them over from the server)?

Yes, you can use the qsub -k option to specify this at submit time, for either one or both of the files. Note, however, that it will flush the file(s) to your homedir, not the submit directory. When attempting to start the job, if the system is unable to open the file(s) for writing for any reason, it will abort the job. If the job starts successfully but there are subsequent write errors (e.g., due to insufficient quota), it is up to whatever program is running to handle them; since few programs test for errors writing to stdout/stderr, most likely such output would be lost (as opposed to the default setting, where the system will mail you the file if for some reason it was unable to write it back to your submit dir at the end of the job).

-k e: as the job runs, writes the stderr file ~/jobname.ejobid only; at job end, writes jobname.ojobid to the submit dir.
-k o: as the job runs, writes the stdout file ~/jobname.ojobid only; at job end, writes jobname.ejobid to the submit dir.
-k eo: as the job runs, writes both ~/jobname.ejobid and ~/jobname.ojobid.
-k n: (default) flushes neither file as the job runs; at job end, writes both to the submit dir.

Can I set up dependencies between multiple jobs, e.g., don't run one job until another job has finished?

Yes, you can use the

qsub -W
depend=dependency_list

option to define dependencies between jobs. There are a number of ways to do this, as explained on the qsub man page. We will give one simple example here.

Suppose we have split our task into two job scripts, called part1 and part2, and want to submit both jobs at the same time, but have the system wait until part1 finishes before it starts part2:

Submit the first job and note its jobid, e.g.:

         athena% qsub -a long-29.123 -q sun-medium part1
         [Creating a renewable Kerberos ticket for use by the job]
         Password for jqpublic@ATHENA.MIT.EDU: 
         [Forwarding renewable Kerberos 5 TGT...]
         1171.hydrogen.mit.edu

Submit the second job with

-W
       depend=afterok:jobid-to-wait-for

, e.g.:

 
         athena% qsub -a long-29.123 -q sun-medium -W depend=afterok:1171 part2
         [Creating a renewable Kerberos ticket for use by the job]
         Password for jqpublic@ATHENA.MIT.EDU: 
         [Forwarding renewable Kerberos 5 TGT...]
         1175.hydrogen.mit.edu

In this case, if you run qstat, you will see the second job is in the Hold state. You can check a job's dependencies by viewing the complete status with qstat -f, e.g.

  athena% qstat -f 1175 | grep depend
  depend = afterok:1171.hydrogen.mit.edu@hydrogen.mit.edu

Notes:

The dependency conditions with the suffix "ok" such as afterok are only satisfied if the job exits without errors (there are corresponding conditions ending in "any" which apply regardless of the exit status). If a job on which you have an "afterok" dependency does not exit successfully, the server deletes the second job, and sends you mail.

If you try to set up a job dependency on a jobid that does not exist (including a job that has already completed), you will not see an error on the command line, but the job will be rejected and you will receive a zephyr/email message like this:

  Authentic Personal message at 12:01:00 on Wed Jan  9 2002 from HYDROGEN.MIT.EDU
  From: Longjobs Master Server 
  PBS Job Id: 1173.hydrogen.mit.edu
  Job Name:   part3
  Aborted by PBS Server 
  Dependency request for job rejected by 1180.hydrogen.mit.edu
  Unknown Job Id

To change a dependency after you have submitted a job, use qalter -W depend=type:argument jobid; to remove the dependency completely, use qalter -W depend=type jobid (i.e., omit :argument). For example, above we had:

       
    athena% qstat -f 1175 | grep depend
    depend = afterok:1171.hydrogen.mit.edu@hydrogen.mit.edu

To make 1175 wait for 1172 instead of 1171:

	      athena% qalter -W depend=afterok:1172 1175
	      athena% qstat -f 1175 | grep depend
	      depend = afterok:1172.hydrogen.mit.edu@hydrogen.mit.edu

To clear the dependency:

	      athena% qalter -W depend=afterok 1175
	      athena% qstat -f 1175 | grep depend
	      athena%

My job quit unexpectedly, what's wrong?

Start by looking at the contents of the standard output and error files from the job. These will be in the directory from which you submit the job, named jobname.ojobid and jobname.ejobid respectively (jobname is the first 15 characters of the script name, or STDIN, and jobid was the number assigned to the job when you submitted it. (If there are multiple files likes this and you don't recall the jobid, you can see which are most recent with ls -ltr)

Some sample errors, and things to check:

Cannot open /dev/tty
my_data: File exists.	If your script does something like `crunch_data > my_data`, you should check that the file does not already exist from a previous run (you can have your script delete it first, or instead construct a unique filename using PBS_JOBID; see the notes on environment variables on the Job Scripts page).
crunch_data: Command not found	If `crunch_data` is a command from a locker, does your script add that locker? (See the notes on lockers and path and testing on the Job Scripts page.) Does the command exist for the platform where your job ran? (e.g., if it is Sun-only, make sure you submit to one of the sun- queues, not the any- queue)
Cannot access input.m: No such file or directory	Does the file `input.m` exist in the job's working directory when the script runs? (See the notes on working directory and testing on the Job Scripts page.)
Cannot open display	Recall that jobs execute without access to an X display or controlling terminal, so any program requiring such access will fail. Check if the program has an option for running in batch mode.

Try testing your script as described on the Job Scripts page.

Why does it say "unauthorized" when I try to submit?
If you see a message like this:
```
qsub: Unauthorized Request 
```
it can mean several things:
- You specified an invalid account, e.g. one you have not been authorized to use, or one which has expired (type qusage for a list of all accounts which you can use, and check the Expires column). If you are not listed, or if your subscription has expired, you'll need to apply for or modify your subscription.
- You specified a restricted queue, which the account you specified is not authorized to use. To view the access setting on a queue, look at the groups field in the qstat -l -q queue-name listing. For example:
```
     athena% qstat -l -q 29.123-res
     Queue             Limit  Run  Que  State
     ----------------  -----  ---  ---  -----
     29.123-res        27:00    1    3  Restricted
           groups = 29.123
     
```
  means that only members of the group 29.123 may use 29.123-res.
My job hasn't finished yet, is something wrong?
When viewing queue status, keep in mind that:
- Most queues feed to several machines, but the number of machines assigned to a particular queue may vary (e.g., due to course reservations).
- Each machine runs only one job at a time, but a given machine may serve more than one queue.
Predicting wait time may be difficult. Unlike a simple print queue, this is not a strict first-in-first-out system; viewing the queues should give you some idea of system demand but scheduling order and wait time will depend on several factors (see the Checking Job Status page for more on this).
- To see detailed status of your own jobs, type:
```
   athena% qstat -l jobID
```
  or
```
   athena% qstat -l -u username
```
  Below the main line for each job, you will see an additional comment, for example:
```
     
  Job ID      Username  Queue         Jobname     Limit  State  Elapsed
  ------      --------  -----         -------     -----  -----  -------
  385         jqpublic  sun-medium    STDIN       06:00  Run    00:03
     Job started on Sun Feb 04 at 09:39
  386          jqpublic sun-medium    matbg       00:10  Que    --   
     Not Running: Not enough of the right type of nodes are available
  387          jqpublic sun-long      STDIN       27:00  Que    --   
     Not Running: No node can provide job's requested resources
```
  - Job 385 is currently running; comment shows when it started.
  - The comment for job 386 means that it is waiting normally, i.e. all machines serving that queue are busy with other jobs. (You would also see this message if the machines are offline for some reason; see notes below on checking machine status.)
  - The comment for job 387 means that there are currently no machines configured to handle that particular queue, or additional resources you specified yourself. (Please check the queue status with qstat -q and check for any general service status notes.)
- To view all jobs in a particular queue, use: qstat queue
  For example:
  
  athena% qstat sun-long
```
Job ID      Username  Queue         Jobname     Limit  State  Elapsed
------      --------  -----         -------     -----  -----  -------
240         hmprof    sun-long      STDIN       27:00  Hold   --
379         jqpublic  sun-long      foo         27:00  Run    22:03
390         llta      sun-long      bar2        27:00  Run    13:37
403         llta      sun-long      bar3        15:00  Que    --
```
- To view a summary of all queues use: qstat -q
  For example:
  
  athena% qstat -q
```
Queue             Limit  Run  Que  State
----------------  -----  ---  ---  -----
any-medium        06:00    0    0
any-long          27:00    0    0
linux-medium      06:00    0    0
linux-long        27:00    1    0
sun-medium        06:00    1    4
sun-long          27:00    0    1
29.123-res        27:00    1    3  Restricted
                         ---  ---
                           3    8
```
- To check the status of the machines serving the queues (e.g. if you suspect some sort of outage), type:
```
   athena% pbsnodes -l
```
  If there is a problem with any of the machines, you will see something like this:
```
   athena% pbsnodes -l
   terbium.mit.edu      offline, down
   holmium.mit.edu      offline
```
  Otherwise, you won't get any output.
  
  If multiple machines are listed as offline or down and there is no status notice on the main page to explain why, please notify Athena Consulting.
How many jobs can I run simultaneously?
There is an overall system limit which may change according to usage levels and what part of the academic year it is. There are other related limits (such as a per-queue user limit) which may be added/modified as needed. If a job is not running because it would exceed such a limit, you will see a message indicating this if you run qstat -l, e.g.:
```
  Not Running: User has reached server running job limit.
```
At this writing, the only limit is the system-wide maximum, set at 3. You can always check this with:
```
  athena% qmgr -c 'list server max_user_run'
  Server hydrogen.mit.edu
        max_user_run = 3
```
Keep in mind that other scheduling factors may influence how long your jobs wait in the queue.

How can I submit multiple jobs without retyping my password each time?
By default, qsub uses your password to create a renewable Kerberos 5 ticket-granting ticket (TGT) for use by the job. By using its -tf option to forward an existing ticket, though, you can create one new TGT to be used by each job you plan to submit, as follows:
1. Start a new shell.
2. Change the KRB5CCNAME and KRBTKFILE environment variable settings, to protect your existing ticket files. Set the variables to point at unique file names on the local disk, e.g.:
```
      setenv KRB5CCNAME /tmp/krb5cc_p$$
      setenv KRBTKFILE /tmp/krb4cc_p$$
```
3. Use kinit -r renewable_lifetime to create a new renewable TGT. (See the kinit(1) man page). The maximum renewable lifetime is currently 7 days. For example:
```
      athena% kinit -r 7d
      Password for jqpublic@ATHENA.MIT.EDU: 
```
4. Use qsub -tf to submit each job using the newly created ticket; you will not have to enter your password.
5. Run kdestroy to destroy the ticket files created in step 3 (IMPORTANT).
6. Exit the shell (to revert the environment variable settings changed in step 2).
Why does qstat give a usage error?
There is a command name conflict with the qstat program in the outland locker. (The outland qstat command displays the status of Quake servers, and gives an error when run without arguments). To make sure you run the longjobs qstat, use add -f longjobs when attaching the locker, run longjob -jobs, or invoke /mit/longjobs/bin/qstat explicitly.

Who can I contact for help?
- Contact Athena Consulting by:
  - Typing olc at the athena% command prompt
  - Calling x3-4435
  - Using the Web Form
  - Stopping by the office in the Student Center in W20-021B (map to building W20)
Please provide as much information as possible about what comands you were using, any error messages you received, Job ID, etc.

If your project is time-critical: Note that the Athena machines in the building 37 cluster (37-318, 37-332) are available for long unattended jobs on weekends. From 5:00 PM Fridays until 8:00 AM Mondays, Cluster staff will not log off users from unattended machines.

We will post any system status messages at the top of the main page; please check there first if you are encountering problems. (Occasionally larger network problems affect longjobs; you may also want to check for notes on http://nic.mit.edu/3down, the general IS Services Status page.)
Last modified: Wed Sep 21 17:23:43 EDT 2005