This document presents an overview of the Longjobs Test service, with
an emphasis on network and security aspects.

The software used in the Longjobs Test service is based on PBS, the
Portable Batch System (see http://www.openpbs.org).  PBS supports our
desired model of a master server dispatching jobs to a set of
execution (slave) servers, as they are available.  There are three
server components in a PBS system -- the master server program
(referred to simply as the server), the scheduler, and the execution
server, called MOM.  The scheduler is provided as a separate program,
to promote modularity and flexibility; it is expected that sites will
modify it, or even replace it, and it is possible to run the scheduler
and server on different machines.  No PBS daemon is needed to run any
client programs; all requests are exchanged directly with the master
server.  In our implementation, the master server and scheduler will
run on one master host, and execution servers will run on a pool of
slave hosts.


I. Protocols

PBS employs several different protocols for communicating between its
various components.  As distributed, all communication is IP-based;
authentication and authorization are done via standard UNIX mechanisms
-- using IP address for host-based access control, and privileged port
numbers from trusted hosts for user-based control.  For the latter,
the unmodified PBS client programs invoke a setuid program to
authenticate a newly opened connection; much of the focus of our work
has been in changing this model to one using Kerberos-based
authentication and authorization, as well as adding the server
functionality for managing tickets and tokens.

The main protocol uses a standard message format for "requests"
(commands and queries) and replies, as the basis of communication
between client programs and the PBS server, between the server
and MOM (execution server), and between two master servers (used in a
request forwarding scheme we are unlikely to employ).  Communication
is done via a TCP connection to a known port.

Requests are structured as a message header, body, and extension.  The
fixed-format header identifies the request type and user.  The message
body format varies by request type.  The extension is optionally used
for additional string data.  Internally, these structures contain both
string and integer data.  They are encoded by the sender into
system-independent string format, and decoded back to internal format
by the receiver, using an encoding scheme called Data is Strings
(DIS).  Similarly, replies are structured as a fixed header, including
return and auxiliary codes, and a per-type variable body.

Most requests consist of a single message; the Queue Job request
involves several messages (sub-requests).  The protocol supports
aborting of a multi-message request, and recovery from a broken
network connection.

[See the PBS "External Reference Specification" for more information,
in https://web.mit.edu/longjobs/pbs/doc/pbs_extref.ps]

For our test set-up, we have extended the protocol to support a new
request type, for doing Kerberos 5 mutual authentication.  When a
connection to the server is established, no requests will be permitted
until the Kerberos authentication request is completed successfully;
the client's Kerberos principal will be saved as part of the connection
data, and be available when authorizing subsequent requests made on
that connection.  This exchange replaces the traditional authentication
and authorization based on IP address, privileged port, and user name
(see below).

We also extended the Queue Job request to support transmitting optional
job credentials (TGT).  There were placeholders in the code for
handling generic credentials, which made this a bit easier.  We also
extended the DIS library to provide for encryption, using Kerberos 5;
this is currently used only in certain request types, most notably
when transmitting a TGT.

In addition to the request protocol, there are several other protocols
employed between the various server components, which we have not
Kerberized.  The PBS server connects to the scheduler periodically to
tell it to perform a scheduling pass.  The connection is kept open
while the scheduler makes requests, as a privileged client, to run
jobs and obtain server information.  Since we will run the scheduler
and server on the same host, we have added a UNIX domain (local)
socket for this connection, and disabled the scheduler's TCP listening
socket.

The scheduler connects periodically to each MOM server, using the
"Resource Monitor" protocol to query the execution machine's status
(e.g. load average).  The original protocol also contains commands
in which privileged clients can direct the server to perform various
configuration tasks; we have disabled the privileged commands,
while still allowing the resource queries.

There is also a UDP-based protocol, referred to as RPP, used mostly by
MOM to communicate with other MOM's when executing a multiple-node
job.  Since we have no plans to support such jobs, we have disabled
this inter-MOM portion of the protocol.  However, the master server
makes use of this same protocol to "ping" MOM's periodically, to see
if they are still alive; this portion of the protocol remains in
place.

Finally, the stock PBS includes support for an "interactive" request,
i.e. one where the MOM server connects back to the submitting client
program in order to give the user interactive control over the job.
We have removed this capability.


II. Authorization

Where the stock PBS uses the host IP address, privileged client port,
or user@host, when authorizing an access level for a particular request,
we will use the Kerberos principal in the access lists.  Thus,
unprivileged requests will be restricted to users in the
ATHENA.MIT.EDU Kerberos realm; privileged users (administrators) must
have their full principal added to the master server's manager or
operator access list.  For server-to-server requests, the code will
ensure that the caller's service principal matches that of a known
slave host, or the master.  (Thus, both the server and MOM acquire,
and periodically renew, a Kerberos 5 TGT to be used when initiating
connections; for the test, we will use the "daemon" service name).

Instead of using user@host, the Kerberos principal will be considered
the owner of a submitted job.  Any subsequent requests against an
existing job, e.g. modifying or canceling the job, must originate
from that same principal (or an administrative principal).

The user name for the job will be the first component of the principal
name.  When starting a job, instead of requiring local passwd accounts and
using rhosts-style verification, the execution server will get the
user's passwd entry from Hesiod, and add it to the slave's /etc/passwd
for the duration of the job.

A user must have a valid Longjobs account in order to submit a job,
and supply the account name at submit time.  Such an account will be
created based on the registration, performed by an administrator, of a
group or individual, for a number of job hours, valid up to an
expiration date.  For a group, the system will get the member users
from Moira, and track the usage of each user.  (We also have the
capability of restricting access to certain queues by user or group,
which will be used to reserve execution machines).  For the test,
usage will be tracked, per user and per account, in a database
maintained on the master server machine.


III. Ticket Management for a Queued Job

In order for a job to run with the user's normal Kerberos credentials,
the submit program, by default, will acquire a long-lived, renewable
Kerberos 5 ticket-granting ticket, which will be encrypted and
forwarded to the master, and stored there along with other job data.
The default and maximum renewable lifetime of the TGT will be one week;
the user can optionally specify a shorter lifetime.

The master server program will manage these user tickets, ensuring
that they are renewed as needed until the job has completed, by
scheduling an internal task well before the ticket endtime.  If the
master finds that the ticket is nearing the end of its maximum
renewable lifetime, it will attempt to notify the user via email
and/or Zephyr.  The user can run a special client command to create a
new ticket for the waiting job.

Files containing tickets will be excluded from the master's system
backup; they will be included in a periodic save of the spool area to
local disk.  The user also has the option at submit time of forwarding
his/her existing TGT, or of forwarding no credentials at all.  In the
latter case, the job will run without tickets or tokens, and so will
only be able to use world-accessible files.  (We could eventually
offer the option of running with "default" AFS tokens, though we have
not implemented that for this test).

When a job is dispatched to a slave, the TGT will be forwarded to
that slave, in much the same way as when it was forwarded originally.
For the job execution, a "shepherd" process is created which forks,
execs, and waits for the user's shell, while managing the user's
tickets and tokens; the functionality we have added for this shepherd
process is described below.  When the slave notifies the server that
the job has completed, the TGT will be destroyed.

Note that the only time the user's password is used is when creating
the renewable ticket, on the client; the password is never transmitted
to any server.

IV. Tickets/Tokens in Job Execution

After successfully receiving all of the job data, the slave server
proceeds as follows:

1. The slave server creates the user account; the user's home
directory is attached, but without authentication.

2. The server forks a child process which will become the job "shepherd";
this child sets a new process authentication group, and performs all
subsequent job set-up.

3. The shepherd caches the TGT in a file in /tmp, owned by the user,
converts it to a Kerberos 4 TGT which is also cached in /tmp, and sets
the KRB5CCNAME and KRBTKFILE variables in the job environment
accordingly.

4. The shepherd authenticates for the user's home directory, and
creates the job's standard output and error files, owned by the user.
(Normally these will be created in the server spool area, and copied
to the user at the end of job, but there is an option to write these
directly to the user's home directory).  The user's script (stored
in the spool area) is opened as the job's standard input.

5. The shepherd forks another child process, which becomes the "real"
job; it sets its uid and gid to those of the user, and execs the
user's shell.  The shell is not a login shell, and there is no
controlling terminal.

6. The shepherd sets its effective uid and gid to those of the user
and waits for the job process to exit, waking periodically to renew
tickets and tokens.  The renewal process is as follows:

a) First it renews the Kerberos 5 TGT, then iterates over other
credentials in the Kerberos 5 cache, renewing and storing them in a
new cache file which is moved into place.

b) It then converts the new Kerberos 5 TGT to a Kerberos 4 TGT, and,
much like the Kerberos 5 case, renews all other tickets in the
Kerberos 4 cache.

c) Next it re-authenticates for any AFS cell for which there are
current tokens.  For the test, only mit.edu AFS cells are supported.

d) It re-authenticates for all attached filesystems; this takes care
of any NFS filesystems, as well as any AFS cells for which there were
no tokens prior to the renewal.  (It must set its effective uid to
root for this step, so that it can acquire the attach table lock;
it reverts back to the user's uid immediately afterward).

e) Finally, it sets the next renew time for one hour prior to the
earliest expiration time of any of the new tickets and tokens, making
sure that this interval is at least one minute, and no more than
10 hours.

Note that there is no facility for replacing the executing job's TGT
on the slave, either from the server or from a client.

7. When the user process exits, the shepherd collects it, and tries to
copy the standard output and error files back to the user's directory.
If it cannot, it will attempt to send the files via email, subject to
a maximum size (currently set at 256 KB).  The shepherd deletes the
files when it has successfully disposed of them.  (If the shepherd is
killed before disposing of the files, their presence will be detected
during the server cleanup, which will also attempt to mail them to the
user).

8. When the job completes, the user account is reverted, ticket caches
are destroyed, and a cleanup script is invoked (see below).


V. Integrity Checking

Jobs will always have a maximum walltime limit set and strictly enforced;
the MOM server will kill any job which exceeds that limit.  At the end
of job, MOM will run an "epilogue" script, which will perform cleanup
operations similar to the Athena reactivate script, including reverting
the passwd file, killing any remaining user processes, cleaning temp
directories, detaching filesystems, etc.

In addition, we have modified the inter-server end-of-job protocol, so
that the master directs the slave to perform a system integrity check
before another job may be dispatched to that slave.  If the integrity
check fails, the master will take the slave out of service.  The
verification is adapted from integrity checks performed on other
Athena servers.

Process accounting will be done on the slaves.  The master will
maintain PBS job accounting records, as well as the Longjobs quota
database.