Longjobs -- Frequently Asked Questions | |||
|
qstat
give a usage error?Platform | Operating System | Model | Memory (MB) | Disk (GB) | CPU | Speed (GHz) | Quantity |
---|---|---|---|---|---|---|---|
Sun | Solaris | Sun Fire V210 | 2048 | 36 | 2 UltraSPARC-IIIi | 1.0 | 6 |
PC | Linux | IBM X335 | 1024 | 36 | Intel Xeon | 3.2 | 6 |
which
to
check that an executable is available on the path, and
echo
to write an output file, e.g.:
cd $PBS_O_WORKDIR rm -f my_data add 29.123 # crunch_data > my_data which crunch_data echo "test" > my_data
cd $PBS_O_WORKDIR add matlab # matlab -tty < input.m matlab -tty < tiny_input.m
testjob
program on the script prepared above.
athena% testjob script_nameexample:
athena% add longjobs athena% testjob foo [Ignoring -a script directive] Note that any locker dependencies will not be tested. You have attached the following lockers; please ensure that your script and/or dotfiles attach or add them as needed: matlab longjobs infoagents 29.123 Executing foo... Process exited with status 0 The standard output stream is in foo.out The standard error stream is in foo.err
testjob
runs on your local
machine and will have access to any lockers you attached or added before
running it, unlike the actual longjobs system which will have access to
lockers only if your script or dotfiles attach/add them. As shown
above, testjob
will show you a list of the lockers which
are attached when it runs, as a reminder to you to check your job's
locker dependencies.
testjob
deals with
output, qsub options, etc.
1: Using theQSUB environment variable |
Set the QSUB environment variable to a single string
containing all of the options you want to use.
For example:
setenv QSUB "-a 29.123 -q sun-long"will specify 29.123 as the account and sun-long as the queue. With this set, all you would have to type to submit a job would be: athena% qsub script_name(You may add such a setenv command to your ~/.environment file for use during future logins, or type it at the command line for use in the current session only.) |
| |
Using directives in your job script | At the beginning of your job script, add a single line for each
qsub option, with the following syntax (PBS stands
for Portable Batch System, the software on which longjobs is based.) :
#PBS -flag optionFor example: #PBS -a 29.123 #PBS -l walltime=10:00:00will specify 29.123 as the account and set a time limit of 10 hours. Note that directives must be placed at the beginning of the job script, before any commands (if there are any commands in the script before a directive, the directive will be ignored). |
---|
QSUB
environment
variable.
qsub
man page.
qsub -k
option to specify this at
submit time, for either one or both of the files. Note, however, that
it will flush the file(s) to your homedir, not the submit directory.
When attempting to start the job, if the system is unable to open the
file(s) for writing for any reason, it will abort the job. If the job
starts successfully but there are subsequent write errors (e.g., due to
insufficient quota), it is up to whatever program is running to handle
them; since few programs test for errors writing to stdout/stderr, most
likely such output would be lost (as opposed to the default setting,
where the system will mail you the file if for some reason it was unable
to write it back to your submit dir at the end of the job). -k e
-k o
-k eo
-k n
qsub -W
depend=dependency_list
option to define dependencies
between jobs. There are a number of ways to do this, as explained on
the qsub
man page. We will give one simple example here.
part1
and part2
, and want to submit both jobs
at the same time, but have the system wait until part1
finishes before it starts part2
:
athena% qsub -a long-29.123 -q sun-medium part1 [Creating a renewable Kerberos ticket for use by the job] Password for jqpublic@ATHENA.MIT.EDU: [Forwarding renewable Kerberos 5 TGT...] 1171.hydrogen.mit.edu
-W
depend=afterok:jobid-to-wait-for
, e.g.:
athena% qsub -a long-29.123 -q sun-medium -W depend=afterok:1171 part2 [Creating a renewable Kerberos ticket for use by the job] Password for jqpublic@ATHENA.MIT.EDU: [Forwarding renewable Kerberos 5 TGT...] 1175.hydrogen.mit.edu
qstat
, you will see the second job
is in the Hold
state. You can check a job's dependencies
by viewing the complete status with qstat -f
, e.g.
athena% qstat -f 1175 | grep depend depend = afterok:1171.hydrogen.mit.edu@hydrogen.mit.edu
afterok
are only satisfied if the job exits without
errors (there are corresponding conditions ending in "any" which apply
regardless of the exit status). If a job on which you have an
"afterok" dependency does not exit successfully, the server
deletes the second job, and sends you mail.
Authentic Personal message at 12:01:00 on Wed Jan 9 2002 from HYDROGEN.MIT.EDU From: Longjobs Master ServerPBS Job Id: 1173.hydrogen.mit.edu Job Name: part3 Aborted by PBS Server Dependency request for job rejected by 1180.hydrogen.mit.edu Unknown Job Id
qalter -W depend=type:argument jobid
; to
remove the dependency completely, use
qalter -W depend=type jobid
(i.e., omit
:argument
).
For example, above we had:
athena% qstat -f 1175 | grep depend depend = afterok:1171.hydrogen.mit.edu@hydrogen.mit.edu
athena% qalter -W depend=afterok:1172 1175 athena% qstat -f 1175 | grep depend depend = afterok:1172.hydrogen.mit.edu@hydrogen.mit.edu
athena% qalter -W depend=afterok 1175 athena% qstat -f 1175 | grep depend athena%
jobname.ojobid
and
jobname.ejobid
respectively
(jobname is the first 15 characters of the script name, or
STDIN,
and jobid was the number assigned to the job when you
submitted it. (If there are multiple files likes this and you
don't recall the jobid, you can see which are most recent with
ls -ltr
)
my_data: File exists. | If your script does something like crunch_data >
my_data , you should check that the file does not already
exist from a previous run (you can have your script delete it first,
or instead construct a unique filename using
PBS_JOBID; see the notes on
environment variables on the
Job Scripts page). |
---|---|
crunch_data: Command not found | If crunch_data is a command from a locker, does
your script add that locker? (See the notes on lockers and path and testing on the Job Scripts page.)
Does the command exist for the platform where your job ran? (e.g., if it is Sun-only, make sure you submit to one of the sun- queues, not the any- queue) |
Cannot access input.m: No such file or directory | Does the file input.m exist in the job's working directory
when the script runs? (See the notes on working directory and testing on the Job Scripts page.) |
Cannot open display | Recall that jobs execute without access to an X display or controlling terminal, so any program requiring such access will fail. Check if the program has an option for running in batch mode. |
Cannot open /dev/tty |
qsub: Unauthorized Requestit can mean several things:
qusage
for a list of all accounts which you can use,
and check the Expires column). If you are not listed, or if your
subscription has expired, you'll need to
apply for or modify your subscription.groups
field in the qstat -l -q
queue-name
listing. For example:
athena% qstat -l -q 29.123-res Queue Limit Run Que State ---------------- ----- --- --- ----- 29.123-res 27:00 1 3 Restricted groups = 29.123means that only members of the group 29.123 may use 29.123-res.
athena% qstat -l jobIDor
athena% qstat -l -u usernameBelow the main line for each job, you will see an additional comment, for example:
Job ID Username Queue Jobname Limit State Elapsed ------ -------- ----- ------- ----- ----- ------- 385 jqpublic sun-medium STDIN 06:00 Run 00:03 Job started on Sun Feb 04 at 09:39 386 jqpublic sun-medium matbg 00:10 Que -- Not Running: Not enough of the right type of nodes are available 387 jqpublic sun-long STDIN 27:00 Que -- Not Running: No node can provide job's requested resources
qstat -q
and check
for any general service status notes.)
qstat
queue
athena% qstat sun-long
Job ID Username Queue Jobname Limit State Elapsed ------ -------- ----- ------- ----- ----- ------- 240 hmprof sun-long STDIN 27:00 Hold -- 379 jqpublic sun-long foo 27:00 Run 22:03 390 llta sun-long bar2 27:00 Run 13:37 403 llta sun-long bar3 15:00 Que --
qstat -q
athena% qstat -q
Queue Limit Run Que State ---------------- ----- --- --- ----- any-medium 06:00 0 0 any-long 27:00 0 0 linux-medium 06:00 0 0 linux-long 27:00 1 0 sun-medium 06:00 1 4 sun-long 27:00 0 1 29.123-res 27:00 1 3 Restricted --- --- 3 8
athena% pbsnodes -lIf there is a problem with any of the machines, you will see something like this:
athena% pbsnodes -l terbium.mit.edu offline, down holmium.mit.edu offlineOtherwise, you won't get any output.
qstat -l
, e.g.:
Not Running: User has reached server running job limit.At this writing, the only limit is the system-wide maximum, set at 3. You can always check this with:
athena% qmgr -c 'list server max_user_run' Server hydrogen.mit.edu max_user_run = 3Keep in mind that other scheduling factors may influence how long your jobs wait in the queue.
qsub
uses your password to create a renewable
Kerberos 5 ticket-granting ticket (TGT) for use by the job. By using its
-tf
option to forward an existing ticket, though, you
can create one new TGT to be used by each job you plan to submit,
as follows:
KRB5CCNAME
and KRBTKFILE
environment variable settings, to protect your existing ticket
files. Set the variables to point at unique file names on the
local disk, e.g.:
setenv KRB5CCNAME /tmp/krb5cc_p$$ setenv KRBTKFILE /tmp/krb4cc_p$$
kinit -r renewable_lifetime
to create a new
renewable TGT. (See the kinit(1)
man page).
The maximum renewable lifetime is currently 7 days. For example:
athena% kinit -r 7d Password for jqpublic@ATHENA.MIT.EDU:
qsub -tf
to submit each job using the newly
created ticket; you will not have to enter your password.
kdestroy
to destroy the ticket files created in step 3
(IMPORTANT).
qstat
give a usage error?qstat
program
in the outland locker. (The outland qstat
command
displays the status of Quake servers, and gives an error when run
without arguments). To make sure you run the longjobs qstat, use
add -f longjobs
when attaching the locker, run
longjob -jobs
, or invoke
/mit/longjobs/bin/qstat
explicitly.