This document presents an overview of the Longjobs Test service, with an emphasis on network and security aspects. The software used in the Longjobs Test service is based on PBS, the Portable Batch System (see http://www.openpbs.org). PBS supports our desired model of a master server dispatching jobs to a set of execution (slave) servers, as they are available. There are three server components in a PBS system -- the master server program (referred to simply as the server), the scheduler, and the execution server, called MOM. The scheduler is provided as a separate program, to promote modularity and flexibility; it is expected that sites will modify it, or even replace it, and it is possible to run the scheduler and server on different machines. No PBS daemon is needed to run any client programs; all requests are exchanged directly with the master server. In our implementation, the master server and scheduler will run on one master host, and execution servers will run on a pool of slave hosts. I. Protocols PBS employs several different protocols for communicating between its various components. As distributed, all communication is IP-based; authentication and authorization are done via standard UNIX mechanisms -- using IP address for host-based access control, and privileged port numbers from trusted hosts for user-based control. For the latter, the unmodified PBS client programs invoke a setuid program to authenticate a newly opened connection; much of the focus of our work has been in changing this model to one using Kerberos-based authentication and authorization, as well as adding the server functionality for managing tickets and tokens. The main protocol uses a standard message format for "requests" (commands and queries) and replies, as the basis of communication between client programs and the PBS server, between the server and MOM (execution server), and between two master servers (used in a request forwarding scheme we are unlikely to employ). Communication is done via a TCP connection to a known port. Requests are structured as a message header, body, and extension. The fixed-format header identifies the request type and user. The message body format varies by request type. The extension is optionally used for additional string data. Internally, these structures contain both string and integer data. They are encoded by the sender into system-independent string format, and decoded back to internal format by the receiver, using an encoding scheme called Data is Strings (DIS). Similarly, replies are structured as a fixed header, including return and auxiliary codes, and a per-type variable body. Most requests consist of a single message; the Queue Job request involves several messages (sub-requests). The protocol supports aborting of a multi-message request, and recovery from a broken network connection. [See the PBS "External Reference Specification" for more information, in https://web.mit.edu/longjobs/pbs/doc/pbs_extref.ps] For our test set-up, we have extended the protocol to support a new request type, for doing Kerberos 5 mutual authentication. When a connection to the server is established, no requests will be permitted until the Kerberos authentication request is completed successfully; the client's Kerberos principal will be saved as part of the connection data, and be available when authorizing subsequent requests made on that connection. This exchange replaces the traditional authentication and authorization based on IP address, privileged port, and user name (see below). We also extended the Queue Job request to support transmitting optional job credentials (TGT). There were placeholders in the code for handling generic credentials, which made this a bit easier. We also extended the DIS library to provide for encryption, using Kerberos 5; this is currently used only in certain request types, most notably when transmitting a TGT. In addition to the request protocol, there are several other protocols employed between the various server components, which we have not Kerberized. The PBS server connects to the scheduler periodically to tell it to perform a scheduling pass. The connection is kept open while the scheduler makes requests, as a privileged client, to run jobs and obtain server information. Since we will run the scheduler and server on the same host, we have added a UNIX domain (local) socket for this connection, and disabled the scheduler's TCP listening socket. The scheduler connects periodically to each MOM server, using the "Resource Monitor" protocol to query the execution machine's status (e.g. load average). The original protocol also contains commands in which privileged clients can direct the server to perform various configuration tasks; we have disabled the privileged commands, while still allowing the resource queries. There is also a UDP-based protocol, referred to as RPP, used mostly by MOM to communicate with other MOM's when executing a multiple-node job. Since we have no plans to support such jobs, we have disabled this inter-MOM portion of the protocol. However, the master server makes use of this same protocol to "ping" MOM's periodically, to see if they are still alive; this portion of the protocol remains in place. Finally, the stock PBS includes support for an "interactive" request, i.e. one where the MOM server connects back to the submitting client program in order to give the user interactive control over the job. We have removed this capability. II. Authorization Where the stock PBS uses the host IP address, privileged client port, or user@host, when authorizing an access level for a particular request, we will use the Kerberos principal in the access lists. Thus, unprivileged requests will be restricted to users in the ATHENA.MIT.EDU Kerberos realm; privileged users (administrators) must have their full principal added to the master server's manager or operator access list. For server-to-server requests, the code will ensure that the caller's service principal matches that of a known slave host, or the master. (Thus, both the server and MOM acquire, and periodically renew, a Kerberos 5 TGT to be used when initiating connections; for the test, we will use the "daemon" service name). Instead of using user@host, the Kerberos principal will be considered the owner of a submitted job. Any subsequent requests against an existing job, e.g. modifying or canceling the job, must originate from that same principal (or an administrative principal). The user name for the job will be the first component of the principal name. When starting a job, instead of requiring local passwd accounts and using rhosts-style verification, the execution server will get the user's passwd entry from Hesiod, and add it to the slave's /etc/passwd for the duration of the job. A user must have a valid Longjobs account in order to submit a job, and supply the account name at submit time. Such an account will be created based on the registration, performed by an administrator, of a group or individual, for a number of job hours, valid up to an expiration date. For a group, the system will get the member users from Moira, and track the usage of each user. (We also have the capability of restricting access to certain queues by user or group, which will be used to reserve execution machines). For the test, usage will be tracked, per user and per account, in a database maintained on the master server machine. III. Ticket Management for a Queued Job In order for a job to run with the user's normal Kerberos credentials, the submit program, by default, will acquire a long-lived, renewable Kerberos 5 ticket-granting ticket, which will be encrypted and forwarded to the master, and stored there along with other job data. The default and maximum renewable lifetime of the TGT will be one week; the user can optionally specify a shorter lifetime. The master server program will manage these user tickets, ensuring that they are renewed as needed until the job has completed, by scheduling an internal task well before the ticket endtime. If the master finds that the ticket is nearing the end of its maximum renewable lifetime, it will attempt to notify the user via email and/or Zephyr. The user can run a special client command to create a new ticket for the waiting job. Files containing tickets will be excluded from the master's system backup; they will be included in a periodic save of the spool area to local disk. The user also has the option at submit time of forwarding his/her existing TGT, or of forwarding no credentials at all. In the latter case, the job will run without tickets or tokens, and so will only be able to use world-accessible files. (We could eventually offer the option of running with "default" AFS tokens, though we have not implemented that for this test). When a job is dispatched to a slave, the TGT will be forwarded to that slave, in much the same way as when it was forwarded originally. For the job execution, a "shepherd" process is created which forks, execs, and waits for the user's shell, while managing the user's tickets and tokens; the functionality we have added for this shepherd process is described below. When the slave notifies the server that the job has completed, the TGT will be destroyed. Note that the only time the user's password is used is when creating the renewable ticket, on the client; the password is never transmitted to any server. IV. Tickets/Tokens in Job Execution After successfully receiving all of the job data, the slave server proceeds as follows: 1. The slave server creates the user account; the user's home directory is attached, but without authentication. 2. The server forks a child process which will become the job "shepherd"; this child sets a new process authentication group, and performs all subsequent job set-up. 3. The shepherd caches the TGT in a file in /tmp, owned by the user, converts it to a Kerberos 4 TGT which is also cached in /tmp, and sets the KRB5CCNAME and KRBTKFILE variables in the job environment accordingly. 4. The shepherd authenticates for the user's home directory, and creates the job's standard output and error files, owned by the user. (Normally these will be created in the server spool area, and copied to the user at the end of job, but there is an option to write these directly to the user's home directory). The user's script (stored in the spool area) is opened as the job's standard input. 5. The shepherd forks another child process, which becomes the "real" job; it sets its uid and gid to those of the user, and execs the user's shell. The shell is not a login shell, and there is no controlling terminal. 6. The shepherd sets its effective uid and gid to those of the user and waits for the job process to exit, waking periodically to renew tickets and tokens. The renewal process is as follows: a) First it renews the Kerberos 5 TGT, then iterates over other credentials in the Kerberos 5 cache, renewing and storing them in a new cache file which is moved into place. b) It then converts the new Kerberos 5 TGT to a Kerberos 4 TGT, and, much like the Kerberos 5 case, renews all other tickets in the Kerberos 4 cache. c) Next it re-authenticates for any AFS cell for which there are current tokens. For the test, only mit.edu AFS cells are supported. d) It re-authenticates for all attached filesystems; this takes care of any NFS filesystems, as well as any AFS cells for which there were no tokens prior to the renewal. (It must set its effective uid to root for this step, so that it can acquire the attach table lock; it reverts back to the user's uid immediately afterward). e) Finally, it sets the next renew time for one hour prior to the earliest expiration time of any of the new tickets and tokens, making sure that this interval is at least one minute, and no more than 10 hours. Note that there is no facility for replacing the executing job's TGT on the slave, either from the server or from a client. 7. When the user process exits, the shepherd collects it, and tries to copy the standard output and error files back to the user's directory. If it cannot, it will attempt to send the files via email, subject to a maximum size (currently set at 256 KB). The shepherd deletes the files when it has successfully disposed of them. (If the shepherd is killed before disposing of the files, their presence will be detected during the server cleanup, which will also attempt to mail them to the user). 8. When the job completes, the user account is reverted, ticket caches are destroyed, and a cleanup script is invoked (see below). V. Integrity Checking Jobs will always have a maximum walltime limit set and strictly enforced; the MOM server will kill any job which exceeds that limit. At the end of job, MOM will run an "epilogue" script, which will perform cleanup operations similar to the Athena reactivate script, including reverting the passwd file, killing any remaining user processes, cleaning temp directories, detaching filesystems, etc. In addition, we have modified the inter-server end-of-job protocol, so that the master directs the slave to perform a system integrity check before another job may be dispatched to that slave. If the integrity check fails, the master will take the slave out of service. The verification is adapted from integrity checks performed on other Athena servers. Process accounting will be done on the slaves. The master will maintain PBS job accounting records, as well as the Longjobs quota database.