Longjobs Job Scripts

Athena Longjobs -- Using Job Scripts
On this page: Basic Template \| MATLAB \| MSI/Cerius2 \| Script Details \| Testing Scripts

A job script essentially consists of the sequence of commands you would type at the Athena prompt to run your job by hand -- the differences are that the programs you run must all be non-interactive, and that you may want to add a few commands to set things like the working directory or output paths. More complete information is in the Script Details section below.

Basic template

  ######################################################################## 
  # set job's working directory, if you want it to be other than ~/ 
  # (this example uses the submit directory, as explained below)
  cd $PBS_O_WORKDIR

  # remove or clean up any output from a previous run
  rm -f results

  # add or attach any lockers you will need aside from your homedir
  add foo bar

  # run commands, invoke other scripts, etc.
  tweedledee
  tweedledum > results
  ########################################################################

Basic notes on output, lockers, and cwd:

Whenever referring to a locker other than your homedir, make sure your script has attached that locker first. The system attaches your homedir for you, but doesn't automatically add or attach any other lockers unless you have such commands in your dotfiles (see the dotfiles section below for more information on how dotfiles are handled).
The system uses an environment variable PBS_O_WORKDIR to store the directory from which you submit the job; the first command above is a convenient way to make this be the job's working directory.
By default, standard output and error will be saved to your submit directory in two files, jobname.ojobid and jobname.ejobid; if this fails for any reason, the system will instead email them to you. (This email fallback does not apply to files your job creates directly.)
If your script does not cd or otherwise provide a different path, the system will attempt to save any files created by your job to your home directory. In either case, the system does not provide a fallback if the write fails (and you will receive an error message only if the program attempting to create the file provides one).
Some common reasons a write would fail:
- locker is over quota
- path is to a locker which isn't attached
- path is to a locker where you don't have write permission
Note that your job can make use of local disk space (e.g. in /tmp) on the execution machine, but your job must include the necessary commands for copying anything you want to save back to a locker.
Jobs may be rerun by the service in case of certain server-failure situations. In this case, the standard output and error streams are simply recreated, but there may be issues with files your jobs writes to directly. If it is not feasible to construct your job to handle the possible existence of data from a previous run, you should make the job non-rerunnable, by submitting with the -rn option (in qsub, a script directive, or the QSUB environment variable).

See also the sections below for more Script Details and tips on Testing Scripts.

MATLAB

Preparing the M-file:

For plots, note that print -djpeg and print -dtiff will not work (they require an X display); as a workaround you can either print to a different format and convert later (see imread/imwrite), or have your M-file do calculations only, and generate plots from the output data later.
If you are running the job on an SGI using MATLAB 5 (i.e., if you are specifying "matlab -ver 5.3.1" rather than using the current default), you must include the quit command at the end of your M-file. Otherwise, MATLAB 5 on SGIs goes into an endless loop of "Missing variable or function" complaints, apparently misinterpreting end-of-file; the quit is not necessary on Suns, but has no effect (unless you run the M-file interactively, in which case it will close Matlab). Alternatively, you can use a conditional, e.g. if strcmp(computer, 'SGI') quit; end

Calling MATLAB in the the job script:

Syntax for running MATLAB in non-interactive mode:
matlab -tty < infile.m > outfile
If you would rather have the results as standard output (which longjobs will save to jobname.ojobid by default):
matlab -tty < infile.m

Sample job script:

  #############################################################
  cd $PBS_O_WORKDIR
  rm -f results
  add matlab
  matlab -tty < myfile.m > results
  #############################################################

MSI/Cerius2MS

Basic syntax for running Cerius2 in non-interactive mode:
cerius2ms -n infile
or
cerius2ms -n -o outfile infile
if you would rather have standard output saved to outfile (in the first case, longjobs will save it to jobname.ojobid).
Some application modules (for example, those on the QUANTUM cards) are actually interfaces to programs independent of Cerius2, which the main program may launch in a detached state. For the process to complete in the batch environment, you may need to launch it explicitly after the Cerius2 command for successful completion in the batch environment; consult your instructor for help if the templates here don't cover your case.

CASTEP template:

  #############################################################
  cd $PBS_O_WORKDIR

  #clear previous job status, if any
  rm -f ~/.MSIcastepstat

  add molsim
  cerius2ms -n setup.log
  castep-wrap
  #############################################################

The castep-wrap command does the final processing on the intermediate files which are generated by the CASTEP/CREATE_FILES command in the cerius2 input file.

Script Details

A job script may consist of:

script directives to specify submission options (lines beginning with #PBS) as explained in the Script Directives section below)
comments (lines beginning with # which are not directives) and blank lines
executable commands

Defaults and Caveats

shell and #!

The script contents are given as standard input to the shell (tcsh unless you have changed your login shell). Note that a #! line in the script will be ignored; if you want to use a script with a different interpreter it is best to keep it a separate file, and invoke that from within the job script.

lockers and path

The system attaches your home directory before starting your job, so the usual /mit/username path is available; it does not attach or add other lockers unless you have such commands in your dotfiles (see below), or include them in your job script. If you have an appropriate binary directory set up under your homedir (i.e., following the Athena locker organization conventions described in the lockers man page), it will be added to the path automatically, assuming you are using the default Athena shell; see the dotfiles section below for more details.

If your job will write output files directly (rather than using stdout), make sure that you save these files before the end of the job to an attached locker with sufficient quota, and where you have write access. The service will attempt to rerun your job in certain server-failure cases; this may cause problems with files your job writes to directly (as opposed to standard ouput and error streams, which the service will simply recreate). If it is not feasible to construct your job to handle the possible existence of data from a previous run, you should make the job non-rerunnable by submitting with the -rn option (in qsub, a script directive, or the QSUB environment variable).

Note that you may use the local disk on the execution machine for intermediate processing (e.g., you may want to have your job generate large data files in /tmp, then compress them before saving the final output in a locker).

working directory

By default, jobs start in the user's home directory. Since it is common to want the job's working directory to be the same as the cwd from which you submit the job, the system sets an environment variable PBS_O_WORKDIR to this value. For example, if you run the qsub command from the directory ~/my_jobs/29.123 and your script contains the lines:

     cd $PBS_O_WORKDIR
     ./my_program > my_data

it will look for an executable ~/my_jobs/29.123/my_program (rather than the default ~/my_program) and will create the output file ~/my_jobs/29.123/my_data (rather than the default ~/my_data).

standard output and error

By default, the system will create one file each for the job's standard output and error streams; at the end of the job it attempts to save these back to the directory from which the job was submitted (if it fails, it will instead email them to you). The filenames are constructed from the job name and ID number generated by the system, for uniqueness. For example, if the script foo is submitted from ~/my_jobs, it generates respective error and output files:

     ~/my_jobs/foo.e483
     ~/my_jobs/foo.o483

dotfiles

The system runs your shell as a non-login shell, which means that:

~/.login is not sourced
~/.cshrc is sourced
~/.environment and ~/.cshrc.mine are sourced if they exist

assuming you are using the Athena-default ~/.cshrc which use the system-wide startup files (for more information, see the Athena Dotfiles publication). If you are using a different shell (in particular bash), you may need to source its associated dotfiles in your script.

Note that the job may fail if you have included commands which attempt to set terminal characteristics in one of the sourced dotfiles. Any such command should be skipped by adding a test for the environment variable PBS_ENVIRONMENT, for example:

     setenv PRINTER kiwi
     setenv LPROPT "-h -z"
     if ( ! $?PBS_ENVIRONMENT ) then
        terminal stuff here
     endif

job environment variables

The system sets up several environment variables for the job which may be useful in scripts, including:

PBS_O_WORKDIR = Absolute path of directory from which the job was submitted.

PBS_ENVIRONMENT = PBS_BATCH. Useful for conditionalizing on a longjobs session vs. regular login, e.g.:

        if ( $?PBS_ENVIRONMENT ) then
            cd $PBS_O_WORKDIR
	    other longjobs stuff here
	endif
	generic stuff here

PBS_JOBID = Full JobID assigned by the system, e.g. 488.longm.mit.edu. Can be useful for constructing unique filenames. Note that the full JobIB includes a hostname (for the master server), which may be omitted when used in most longjobs commands.

For more details, see the qsub man page, DESCRIPTION section.

Script Directives

The job script may begin with directives to specify qsub options using this syntax:

#PBS -flag option

For example:

    #PBS -a 29.123
    #PBS -l walltime=10:00:00

will specify 29.123 as the account and set a time limit of 10 hours. (For details on qsub options, see the Running Jobs page.)

Notes:

Directives must be placed at the beginning of the job script, before commands (if there are any commands in the script before a directive, the directive will be ignored).
Command line options take precedence over script directives, and script directives take precedence over the QSUB environment variable.

Testing Your Scripts

If you are new to the system or are running a job with a different application than usual, it's a good idea to test a short version of your script to make sure you haven't overlooked something that might keep it from running as expected. The testjob program allows you to do this without actually submitting it to the system (it simulates the longjobs environment on your local workstation so you can test your script without having to wait in a queue or use up quota).

To test your job script:

Prepare a version of your script to run a short job
Run the short job (either using the testjob utility, or on the actual system)

Preparing the script to run a short job

There are two ways you can do this:

Comment out the long-running commands, so that they don't actually execute; instead, you might want to use which to check that an executable is available on the path, and echo to write an output file, e.g.:
```
       cd $PBS_O_WORKDIR
       rm -f my_data
       add 29.123
       # crunch_data > my_data
       which crunch_data
       echo "test" > my_data
       
```

If possible, run the real commands, but use trivial parameters so that it completes within a few minutes, e.g.

       cd $PBS_O_WORKDIR
       add matlab
       # matlab < input.m
       matlab < tiny_input.m

Running the short job with testjob

Basic syntax:

   athena% testjob script_name

example:

   athena% add longjobs
   athena% testjob foo       
   [Ignoring -a script directive]
   Note that any locker dependencies will not be tested.  You
   have attached the following lockers; please ensure that
   your script and/or dotfiles attach or add them as needed:

   matlab
   longjobs
   infoagents
   29.123

   Executing foo...
   Process exited with status 0
   The standard output stream is in foo.out
   The standard error stream is in foo.err

Notes on testjob (for more information, see the testjob man page)
qsub options
It accepts the following subset of qsub options from the command-line, script directives, and QSUB environment variable (in that order of precedence, command-line being highest):
```
	[-C directive_prefix] [-e error_file] [-j oe|eo|n] [-N jobname]
	[-o output_file] [-S shell] [-tn] [-v variable_list] [-V]
```
If you specify other qsub options via QSUB or script directives, it will simply ignore them (with a message as in the above example).
lockers
It shows you a list of lockers you have attached, as a reminder to make sure that your script or dotfiles add/attach these as needed for your job. Recall that the longjobs system attaches your home directory before starting your job, but it does not attach or add other lockers unless you include such commands in your job script or have them in dotfiles (see the defaults section above for more details). You may use testjob -n script_name to suppress this output.
standard error and output
The files where each stream is saved differ slightly from those you would get from the actual system: they are named script_name.err and script_name.out respectively, but without a JobID number. If you run the program again in the same directory with the same script file without deleting the error/output files, the streams will be appended to them.
working directory
As on the actual system, jobs start in the user's home directory by default; the stderr/out files are saved to the directory from which you run the testjob command.
authentication
Unlike the actual system, it does not prompt you for a password to create a renewable, forwardable Kerberos ticket for the job (it just uses your current tickets and tokens, since the intent is just to test a short job).
job environment variables
Environment variables available to a job running on the actual system such as PBS_O_WORKDIR, PBS_ENVIRONMENT, and PBS_JOBNAME will also be available via testjob; others such as PBS_JOBID will not. For complete details, see the testjob man page.

Running the short job on the actual system

If the system is not busy, you may prefer to test your script by actually submitting a short version of it. Note that if you submit with the longjob -submit command, you will be prompted for a time limit. If you submit with qsub you can specify a short limit with the option:

  -l walltime=hh:mm:ss

For example, for a 10-minute limit use the following:

  athena% qsub -l walltime=00:10:00 ...

(The system will interpret a single value as seconds and xx:yy as minutes:seconds; to avoid confusion it's best to specify all three time fields.)

Longjobs Documentation: Overview | Job Scripts (this page) | Running Jobs | Checking Job Status | Quick Reference and FAQ

Last modified: Thu, Jan 2 2003