Longjobs Discovery Report

Project Information
Executive Summary
Business Case
The Proposed Solution
Design Issues
Status
Resources
- Costs
- Future Work
Appendix
- Key Documents
- Other Documents

Project Information

Sponsor

Vijay Kumar, Assistant Provost and Director, Academic Computing

Discovery Team

Bob Basch, Athena UNIX Platform Team
Abby Fox, Faculty Liaison, Academic Computing
Ted McCabe, Athena Server Operations

Consulting Assistance

Mitch Berger, Student Employee, Information Systems
Bill Cattey, Team Leader, Athena UNIX Platform Team
Phil Long, Sr. Strategist, Academic Computing
Miki Lusztig, Athena UNIX Platform Team
Cana McCoy, Athena Server Operations
Tim McGovern, Sr. Project Manager, I/T Discovery
Anne Salemme, Network Operations
Naomi Schmidt, Team Leader, Academic Computing Support Team, and Manager, Educational Planning and Support (Emeritus)
Garry Zacheiss, Athena Server Operations

URL

http://web.mit.edu/longjobs/

Email

longjobs-dev@mit.edu

Executive Summary

Rationale

The purpose of this project was to examine the need to execute long-running jobs in the Athena environment, and, if possible, to design a feasible service to satisfy that need. Under current rules of use, our customers have no reliable way to run procedures of long duration on an unattended Athena workstation. This has led to much frustration in the user community, and frequent abuse of the rules.

Recommendations

Develop and support a centralized network batch service for Athena.
Use PBS (Portable Batch System) software as the basis of the service, modifying it to be compatible with the Athena environment.
Keep the initial scale relatively small; do not open to the general community.
Employ a group-based quota system to regulate use of the service, and a reservation mechanism to ensure sufficient resources for classes during specific time periods.
Defer consideration of a billing component.

Conceptual Design

We proposed a service with the following characteristics:

Transparent; compatible with the Athena interactive experience
Dedicated execution machines, comparable to those in Athena clusters, would run one job at a time
Access controlled via registration
Submit jobs from any Athena workstation
Jobs queued to a master server, which dispatches jobs to execution machines as they become available
Jobs subject to a strict limit of elapsed time
Kerberos-based authentication and authorization; use of renewable, forwardable tickets
AFS support
Usage limited by a group-based quota system

Status

Developed a prototype of the proposed service.
Conducted a successful test of the prototype with selected faculty and students.

Next Steps

Continue testing through the fall 2001 term.
Analyze test results.
Complete development based on existing prototype.
Consider needed enhancements.

Business Case

Currently, our customers have no reliable way to execute long-running procedures, even those requiring no user interaction, without physically remaining at a workstation console. The Athena Rules of Use state that workstations cannot be left unattended for longer than 20 minutes, and that such an unattended session can be logged out by another user. The inability to run such jobs has resulted in a steady demand for a "long job" service, by faculty who assign work requiring long-running computations, and by individual students, both graduate and under-graduate, as part of their academic work. The lack of such a service has inevitably led to frustration and abuse, and necessitated our spending more resources on enforcement than is desirable.

In order to characterize the specific need further, we outlined the design of a service which, as conceived, would be compatible with the Athena environment, and prepared a questionnaire to solicit feedback on it from interested faculty and representative students. The results demonstrated that our envisioned solution would meet a preponderance of the expressed need, and provided a basis for quantifying that need.

We concluded that we should design a service which will give users the ability to run non-interactive, unattended jobs within the Athena environment, by providing a pool of dedicated, centrally-managed Athena server machines on which users can execute such non-interactive jobs remotely, in as regulated, ordered, and predictable a manner as possible.

The Proposed Solution

[The following is adapted from the Proposed test of Longjobs service document.]

The Goal

The intent is to provide a system that is as "transparent", and compatible with the Athena interactive computing experience, as possible; a user should easily be able to submit a job as a script of the same shell commands that would be entered during a normal interactive login session. The execution machines would be comparable to normal Athena workstations, with the same facilities available, except for those facilities which are only suitable for interactive capabilities (e.g. a display).

General Architecture

Access to the service would be granted through a registration process administered by the appropriate support group (see Registration, Support, and Reservations). Usage would be limited by a group-based quota system, maintained on the master; a quota of job hours would be granted at registration time.

A registered user could submit a job from any Athena workstation, specifying the type of Athena machine they want the job to run on, and other desired job parameters. The job would be directed immediately to a central "master" server, which would dispatch jobs to "slave" execution machines, as they become available. An execution slave would only run one job at a time, so a job would remain queued at the master until a suitable slave machine became available. Authentication and authorization would be based on the user's Kerberos principal.

Queues and Scheduling

All submitted jobs will enter precisely one of several queues; the queues will distinguish jobs based on machine type and time limit, and will have access control lists. We recommend having queues with at least two different time limits, for relatively short or long jobs.

Scheduling of all jobs will be based on a modified first-in, first-out scheme. A slave machine will serve one or more queues (e.g. the "short" or "long" queue, or both) and may be migrated between queues to satisfy varying demands on load.

Users can query the server to obtain job status, and cancel jobs that were no longer wanted. Users would optionally be notified, via e-mail and/or Zephyr, when a job begins and/or ends.

Job Execution

User accounts would be added, and home directories attached, on the execution machine only for the duration of the job. Jobs would be subject to a strict limit of elapsed time; any job exceeding that limit would be killed. Since a slave machine will only run one job at a time, the user is assured of having no contention for CPU cycles or other resources while the job is running, so that the time needed to complete the job will be as predictable as possible. At the end of each job, the slave will perform a cleanup and check procedure, to ensure system integrity. Standard output and error files would be written to the user's directory, or emailed to the user.

Tickets/Tokens

In order for the job to run with Kerberos credentials, as is required for most authentication and authorization purposes in the Athena environment, the user would optionally be able to acquire a long-lived, renewable Kerberos 5 ticket-granting ticket, which would be forwarded as part of the job. The master and slave servers would manage these tickets, renewing them as needed. In addition, the execution server would use this ticket to acquire Kerberos 4 tickets, and AFS tokens, the latter being most critical for users to be able to access their home and other directories from within the job context. Users who did not wish to forward a long-lived ticket could optionally choose to forward an existing short-lived ticket, or to run the job without any tickets or tokens. A renewable ticket could be maintained up to the maximum life permitted by our Kerberos configuration (currently one week).

Implementation

We recommend using the Portable Batch System (PBS) as a basis for our system. PBS was developed at NASA Ames Research Center, and is now supported by Veridian Corporation. The open-source version, OpenPBS, is freely available; a Professional version, PBS Pro, is also available at very low cost, including source, to educational sites. Our prototype is based on the open-source version of PBS.

Athena Development will make the necessary enhancements and additions to PBS, including the support for Kerberos authentication and authorization, the management of renewable tickets and tokens, compatibility with the Athena login system, Zephyr support, and integration with a group-based quota system. The server machines will be administered by Athena Server Operations, and registrations by the appropriate support group (see below).

Registration, Reservations, and Support

The Faculty Liaisons will handle registrations, reservations, and support for course-based use of the service (through faculty and TAs). Athena Accounts and Consulting would handle registrations and support for other uses, according to whatever policies are developed for such use of the service.

To ensure that sufficient resources are available for their classes, instructors would be able to request a reservation from the Faculty Liaisons for a particular number and type of machines during a specific time period. This would be implemented via a separate queue mapping to an appropriate subset of machines in the pool, ACLd to the group for that class. We would require advance notice for such requests (in order to ensure sufficient machines are available, to transfer general queues to non-reserved machines, etc.). Similarly, reservations would probably need to be staggered to build in some free time for handling transitions.

Registrations would include an automatic expiration date, which would typically be the end of the current semester.

Accounting

The system will produce accounting records for each job, containing user and account names, as well as other pertinent information. Initially, we will use these to track utilization; eventually, these records could be input to a billing system, if such a component were desired or needed to fund expansion of the service.

Machines

We recommend that the machines in the slave pool be comparable to the workstations found in Athena clusters. If feasible, all platforms in the current Athena release should be represented in the slave pool. We recommend a two-year renewal cycle for the slaves, in order to remain comparable with the latest hardware found in the clusters. Some number of the older slaves could remain in use, perhaps serving special queues for jobs where top performance is not essential.

Design Issues

We identified the following issues concerning our proposed design:

While predictability was an important factor in our design decisions, there is no guarantee that a user's job will run before a certain time. However, unacceptable delays should not be a chronic problem. The registration and quota system will serve to regulate the service, and prevent over-subscription. As with most other services, users need to consider contention for the service when planning their work.
There is also no guarantee that any external resources the job might require, such as locker software, or a software license based on the number of active users, will be available when the job executes. We believe that this issue does not need to be addressed in the initial implementation; if experience warrants, we should consider the feasibility of ways for the service to keep jobs queued until the required resource is available.
Jobs in this environment will be completely non-interactive; there will be no X display or controlling terminal. Programs which require such an interactive component, even for a brief initialization stage, will not be supported. In gathering data on user needs, we discovered that the ability to run interactive procedures would seldom, if ever, be required; we found that any software typically used in long-running computations tends to be capable of running in a non-interactive mode. We proposed providing a test facility, by which users could test that a program runs successfully in the non-interactive longjobs environment. We addressed this in the prototype in providing a program which tests a job script, by executing the script in a local environment which simulates the environment in which the script would run if submitted as a longjob. We consider documentation of the non-interactive capabilities of the software available on Athena to be an important part of the support effort.
Even if the user acquires a long-lived renewable Kerberos ticket for use by the job, there is no guarantee that the ticket will not expire while the job is waiting to execute. We addressed this problem in the prototype by implementing a program by which the user can create a new ticket-granting ticket for an already-queued job, and having the server notify the user when a job's ticket is nearing expiration.
Using long-lived Kerberos tickets would be a serious problem if the system were ever compromised; currently there is no way for a stolen ticket to be revoked, even if reported. Cautious users may prefer to submit jobs without a ticket; however, the job would then not be able to access the user's AFS directories (unless the directory was world-accessible). A way to alleviate this problem would be to have an option by which the execution server could acquire default AFS tokens for the job. One such possibility would be to use a special AFS group for the execution servers; the user would have to add this group to the directory's ACL, thus also making the directory accessible to other users of the service. Another possibility would be to create a per-user group to which the longjobs server had administrative rights; this is more secure than the first option, but raises significant overhead and support cost issues.
Due to the special security considerations of this system (particularly with respect to managing long-lived tickets), the execution servers will be configured to run whatever system management procedures are deemed necessary to ensure system integrity (similar to what is done now on Athena dial-up machines). Since this could involve a prohibitive amount of new development work for a particular platform, it is conceivable we may choose not to provide any execution servers for that platform. The goal, however, would be to have available all platforms which are part of the current Athena release, if feasible.
Based on the perceived need, the intent is to provide execution machines which are essentially equivalent to workstations in the current Athena deployment. Our proposed service will not provide users with additional computational capacities, beyond what is available on a typical Athena workstation; in particular, we do not propose supporting any of the following at this time:
- "Supercomputer" machines, or machines with much more memory or disk space than might be found on a machine in an Athena cluster.
- "Parallel" execution, i.e. a job requiring multiple execution machines.
- Execute jobs on, or submit jobs from, non-Athena platforms.
However, there should be nothing inherent in the system which would preclude its being extended to support any of the above in the future (though other considerations might make such support unfeasible).
During the initial design discussions, we did not consider the issue of providing temporary disk space to be critical; our expectation was that users would copy any needed job input/output from/to their lockers, just as they would during interactive sessions. However, testing revealed that there would be a need for temporary space, to handle situations where, in an interactive session, users would use local devices (e.g. Zip drive) to copy large data files in and/or out. A short term solution is to create temporary lockers or volumes manually; during testing, this was done by the FLs for one class (following the existing procedure for providing extra student coursework space via volumes mounted in the course locker). If users outside of classes request extra space, that would create an added support burden (perhaps to be handled by Athena Accounts) and would require clear policy guidelines. Long term, we should consider implementing an automated system to create and clean up temporary disk space in AFS, perhaps using a separate AFS cell in which the master's Kerberos principal would have administrative control. (In PBS, this issue is dealt with via a "stage-in/stage-out" facility, in which the user specifies files to be copied in to, or out of, the execution server at the beginning or end of job; we feel this is not terribly useful in our environment, where the real issue is having the files on-line in the first place).
We do not envision supporting "very long jobs", i.e. those requiring many days or weeks to complete. The maximum time limit for any queue will probably be on the order of one or two days. Even if a procedure requires more execution time, it is generally possible, and advisable, to break it up into smaller pieces. Eventually, it may make sense to extend the maximum time limit for jobs running in special queues on older slaves. In any case, the choice of maximum time limit is technically constrained by the maximum lifetime of renewable Kerberos tickets.
A controversial issue was whether to require a billing component from the outset, in order to be able to recover the costs of meeting greater demand. We recommended deferring this question because:
- Many feel that it is inappropriate to charge for a service meeting a basic academic computing need.
- We expect that the budget would cover the cost of running a service at the anticipated initial scale. Planning for future expansion or rationing in case of demand exceeding resources would presumably be folded into the existing Academic Computing rounds, faculty proposals, and renewal processes.
In addition, given the lack of a general billing infrastructure for handling academic computing services, adding a billing component could incur a significant cost in staff resources.
Because of security requirements, users will not be able to run jobs as root. We see no need to support submitting or executing jobs except as the user's null-instance Kerberos principal, using the normal user ID as given by Hesiod.

Status

Following the report of our initial recommendations, at the request of the Sponsor, we proposed a feasiblility test of the service. Upon approval, we completed the necessary development of the prototype, and launched the test for selected courses and individual students in the Spring 2001 term. Five SGI O2 and five Sun Ultra 5 machines were purchased for the test slave pool, plus another Ultra 5 to serve as the master.

At the end of the term, we reported the results of the test. While the results were generally favorable, we would have preferred to have had greater load, and to have tested the reservation system. We recommended continuing the test in the Fall 2001 term, and successfully launched the test in September 2001, after having ported the existing software to Athena 9.0, and adding a Moira DCM feed for group membership queries.

The test experience also helped to identify and clarify potentially needed future development work, which could be the basis for subsequent Delivery tasks.

Resources

Costs

Hardware

For the test, $36K was spent for the purchase of test machines:
```
        6  Sun Ultra 5 (1 master, 5 slaves)   $10,388
	5  SGI O2                             $25,980
	---------------------------------------------
	total                                 $36,368
    
```
The cost of additional hardware will be determined at the appropriate time. However, we recommend that the actual service roll-out include at least 5 Linux slaves, if at all feasible; also, we recommend that no additional SGI machines be purchased, given the expected short lifetime of that platform in the Athena environment. Finally, we may want to consider purchasing additional hardware (e.g. a storage system) to address master server redundancy issues.
Staff

The following staff time was spent in preparing for the test:
- Athena Development: 6 FTE months
- Athena Server Operations: 1.5 FTE months
- Support: 1 FTE month
We estimate the following staff resources would be required for ongoing maintenance:
- Dev: 10-20% FTE
- Ops: 1/3 FTE
- Support: 15-25% ongoing.

Future Work

In the course of implementing and deploying the service prototype, we identified potential future development tasks. Generally, these can be categorized as follows:

Completing and polishing the existing implementation, e.g. where short cuts may have been taken in the interest of faster deployment of the prototype. This work should largely be done before an actual service roll-out.
Adding features which were not considered essential initially, e.g. temporary disk space. Such tasks might be done before an initial roll-out, if feasible, or deferred as possible future enhancements, based on customer feedback.
Upgrading or replacing system components, e.g. OpenPBS and/or the quota database, in order to support a larger-scale service, or more advanced capabilities. It would not be required to complete these tasks for the expected initial scale.

Whether and when to do any of these tasks would depend on specific Delivery scope decisions. (See our notes on the issues involved in these tasks).

Appendix

Key Documents

Initial service description
Questionnaire soliciting customer feedback on our service description
Summary of initial recommendations to stakeholders
Service test proposal
Feasibility test overview
Spring 2001 test results
Network and security aspects of the service
Development ToDo list
Notes on Development Tasks
Ops considerations
Support considerations

Longjobs Discovery Report

Table of Contents

Project Information

Sponsor

Discovery Team

Consulting Assistance

URL

Email

Executive Summary

Rationale

Recommendations

Conceptual Design

Status

Next Steps

Business Case

The Proposed Solution

The Goal

General Architecture

Queues and Scheduling

Job Execution

Tickets/Tokens

Implementation

Registration, Reservations, and Support

Accounting

Machines

Design Issues

Status

Resources

Costs

Future Work

Appendix

Key Documents

Other documents