Longjobs Discovery Report
Table of Contents
Project Information
Sponsor
Vijay Kumar, Assistant Provost and Director, Academic Computing
Discovery Team
- Bob Basch, Athena UNIX Platform Team
- Abby Fox, Faculty Liaison, Academic Computing
- Ted McCabe, Athena Server Operations
Consulting Assistance
- Mitch Berger, Student Employee, Information Systems
- Bill Cattey, Team Leader, Athena UNIX Platform Team
- Phil Long, Sr. Strategist, Academic Computing
- Miki Lusztig, Athena UNIX Platform Team
- Cana McCoy, Athena Server Operations
- Tim McGovern, Sr. Project Manager, I/T Discovery
- Anne Salemme, Network Operations
- Naomi Schmidt, Team Leader, Academic Computing Support Team, and
Manager, Educational Planning and Support (Emeritus)
- Garry Zacheiss, Athena Server Operations
URL
http://web.mit.edu/longjobs/
Email
longjobs-dev@mit.edu
Executive Summary
Rationale
The purpose of this project was to examine the need to execute
long-running jobs in the Athena environment, and, if possible, to
design a feasible service to satisfy that need. Under current rules
of use, our customers have no reliable way to run procedures of long
duration on an unattended Athena workstation. This has led to much
frustration in the user community, and frequent abuse of the rules.
Recommendations
- Develop and support a centralized network batch service for Athena.
- Use PBS (Portable Batch System) software as the basis of the
service, modifying it to be compatible with the Athena environment.
- Keep the initial scale relatively small; do not open to the general
community.
- Employ a group-based quota system to regulate use of the service,
and a reservation mechanism to ensure sufficient resources for classes
during specific time periods.
- Defer consideration of a billing component.
Conceptual Design
We proposed a service with the following characteristics:
- Transparent; compatible with the Athena interactive experience
- Dedicated execution machines, comparable to those in Athena clusters,
would run one job at a time
- Access controlled via registration
- Submit jobs from any Athena workstation
- Jobs queued to a master server, which dispatches jobs to execution
machines as they become available
- Jobs subject to a strict limit of elapsed time
- Kerberos-based authentication and authorization; use of renewable,
forwardable tickets
- AFS support
- Usage limited by a group-based quota system
Status
- Developed a prototype of the proposed service.
- Conducted a successful test of the prototype with selected faculty
and students.
Next Steps
- Continue testing through the fall 2001 term.
- Analyze test results.
- Complete development based on existing prototype.
- Consider needed enhancements.
Business Case
Currently, our customers have no reliable way to execute long-running
procedures, even those requiring no user interaction, without physically
remaining at a workstation console. The
Athena Rules of
Use state that workstations cannot be left unattended for longer than
20 minutes, and that such an unattended session can be logged out by
another user. The inability to run such jobs has resulted in a steady
demand for a "long job" service, by faculty who assign work requiring
long-running computations, and by individual students, both graduate
and under-graduate, as part of their academic work. The lack of such
a service has inevitably led to frustration and abuse, and
necessitated our spending more resources on enforcement than is
desirable.
In order to characterize the specific need further, we outlined the
design of a service which, as conceived, would be compatible with the
Athena environment, and prepared a
questionnaire
to solicit feedback
on it from interested faculty and representative students. The
results demonstrated that our envisioned solution would meet a
preponderance of the expressed need, and provided a basis for
quantifying that need.
We concluded that we should design a service which will give users
the ability to run non-interactive, unattended jobs within the Athena
environment, by providing a pool of dedicated, centrally-managed
Athena server machines on which users can execute such non-interactive
jobs remotely, in as regulated, ordered, and predictable a manner as
possible.
The Proposed Solution
[The following is adapted from the
Proposed test of Longjobs service
document.]
The Goal
The intent is to provide a system that is as "transparent", and
compatible with the Athena interactive computing experience, as
possible; a user should easily be able to submit a job as a script of
the same shell commands that would be entered during a normal
interactive login session. The execution machines would be comparable
to normal Athena workstations, with the same facilities available,
except for those facilities which are only suitable for interactive
capabilities (e.g. a display).
General Architecture
Access to the service would be granted through a registration process
administered by the appropriate support group (see
Registration, Support, and Reservations). Usage would be limited by a
group-based quota system, maintained on the master; a quota of job
hours would be granted at registration time.
A registered user could submit a job from any Athena workstation,
specifying the type of Athena machine they want the job to run on, and
other desired job parameters. The job would be directed immediately
to a central "master" server, which would dispatch jobs to "slave"
execution machines, as they become available. An execution slave
would only run one job at a time, so a job would remain queued at the
master until a suitable slave machine became available. Authentication
and authorization would be based on the user's Kerberos principal.
Queues and Scheduling
All submitted jobs will enter precisely one of several queues; the
queues will distinguish jobs based on machine type and time limit,
and will have access control lists. We recommend having queues with
at least two different time limits, for relatively short or long jobs.
Scheduling of all jobs will be based on a modified first-in, first-out
scheme. A slave machine will serve one or more queues (e.g. the
"short" or "long" queue, or both) and may be migrated between queues
to satisfy varying demands on load.
Users can query the server to obtain job status, and cancel jobs that
were no longer wanted. Users would optionally be notified, via
e-mail and/or Zephyr, when a job begins and/or ends.
Job Execution
User accounts would be added, and home directories attached, on the
execution machine only for the duration of the job. Jobs would be
subject to a strict limit of elapsed time; any job exceeding that
limit would be killed. Since a slave machine will only run one job at
a time, the user is assured of having no contention for CPU cycles or
other resources while the job is running, so that the time needed to
complete the job will be as predictable as possible. At the end of
each job, the slave will perform a cleanup and check procedure, to
ensure system integrity. Standard output and error files would be
written to the user's directory, or emailed to the user.
Tickets/Tokens
In order for the job to run with Kerberos credentials, as is required
for most authentication and authorization purposes in the Athena
environment, the user would optionally be able to acquire a long-lived,
renewable Kerberos 5 ticket-granting ticket, which would be forwarded
as part of the job. The master and slave servers would manage these
tickets, renewing them as needed. In addition, the execution server
would use this ticket to acquire Kerberos 4 tickets, and AFS tokens,
the latter being most critical for users to be able to access their
home and other directories from within the job context. Users who did
not wish to forward a long-lived ticket could optionally choose to
forward an existing short-lived ticket, or to run the job without any
tickets or tokens. A renewable ticket could be maintained up to the
maximum life permitted by our Kerberos configuration (currently one
week).
Implementation
We recommend using the
Portable Batch System
(PBS) as a basis for our system. PBS was developed at
NASA Ames Research Center, and
is now supported by
Veridian Corporation. The open-source version, OpenPBS, is freely
available; a Professional version,
PBS Pro, is also available at very low cost, including source, to
educational sites. Our prototype is based on the open-source version
of PBS.
Athena Development will make the necessary enhancements and additions
to PBS, including the support for Kerberos authentication and
authorization, the management of renewable tickets and tokens,
compatibility with the Athena login system, Zephyr support, and
integration with a group-based quota system. The server machines will
be administered by Athena Server Operations, and registrations by
the appropriate support group (see below).
Registration, Reservations, and Support
The Faculty Liaisons will handle registrations, reservations, and
support for course-based use of the service (through faculty and
TAs). Athena Accounts and Consulting would handle registrations and
support for other uses, according to whatever policies are developed
for such use of the service.
To ensure that sufficient resources are available for their classes,
instructors would be able to request a reservation from the Faculty
Liaisons for a particular number and type of machines during a
specific time period. This would be implemented via a separate queue
mapping to an appropriate subset of machines in the pool, ACLd to the
group for that class. We would require advance notice for such
requests (in order to ensure sufficient machines are available, to
transfer general queues to non-reserved machines, etc.). Similarly,
reservations would probably need to be staggered to build in some free
time for handling transitions.
Registrations would include an automatic expiration date, which would
typically be the end of the current semester.
Accounting
The system will produce accounting records for each job, containing
user and account names, as well as other pertinent information.
Initially, we will use these to track utilization; eventually, these
records could be input to a billing system, if such a component were
desired or needed to fund expansion of the service.
Machines
We recommend that the machines in the slave pool be comparable to
the workstations found in Athena clusters. If feasible, all platforms
in the current Athena release should be represented in the slave pool.
We recommend a two-year renewal cycle for the slaves, in order to
remain comparable with the latest hardware found in the clusters.
Some number of the older slaves could remain in use, perhaps serving
special queues for jobs where top performance is not essential.
Design Issues
We identified the following issues concerning our proposed design:
- While predictability was an important factor in our design decisions,
there is no guarantee that a user's job will run before a certain
time. However, unacceptable delays should not be a chronic problem.
The registration and quota system will serve to regulate the
service, and prevent over-subscription. As with most other
services, users need to consider contention for the service when
planning their work.
- There is also no guarantee that any external resources the job might
require, such as locker software, or a software license based on the
number of active users, will be available when the job executes. We
believe that this issue does not need to be addressed in the
initial implementation; if experience warrants, we should consider
the feasibility of ways for the service to keep jobs queued until
the required resource is available.
- Jobs in this environment will be completely non-interactive; there
will be no X display or controlling terminal. Programs which require
such an interactive component, even for a brief initialization stage,
will not be supported. In gathering data on user needs, we discovered
that the ability to run interactive procedures would seldom, if ever,
be required; we found that any software typically used in long-running
computations tends to be capable of running in a non-interactive mode.
We proposed providing a test facility, by which users could test that a
program runs successfully in the non-interactive longjobs environment.
We addressed this in the prototype in providing a program which tests
a job script, by executing the script in a local environment which
simulates the environment in which the script would run if submitted
as a longjob. We consider documentation of the non-interactive
capabilities of the software available on Athena to be an important
part of the support effort.
- Even if the user acquires a long-lived renewable Kerberos ticket
for use by the job, there is no guarantee that the ticket will not
expire while the job is waiting to execute. We addressed this problem
in the prototype by implementing a program by which the user can
create a new ticket-granting ticket for an already-queued job, and
having the server notify the user when a job's ticket is nearing
expiration.
- Using long-lived Kerberos tickets would be a serious problem if the
system were ever compromised; currently there is no way for a stolen
ticket to be revoked, even if reported. Cautious users may
prefer to submit jobs without a ticket; however, the job would then
not be able to access the user's AFS directories (unless the
directory was world-accessible). A way to alleviate this problem
would be to have an option by which the execution server could
acquire default AFS tokens for the job. One such possibility would
be to use a special AFS group for the execution servers; the user
would have to add this group to the directory's ACL, thus also
making the directory accessible to other users of the service.
Another possibility would be to create a per-user group to which the
longjobs server had administrative rights; this is more secure than
the first option, but raises significant overhead and support cost
issues.
- Due to the special security considerations of this system (particularly
with respect to managing long-lived tickets), the execution servers
will be configured to run whatever system management procedures
are deemed necessary to ensure system integrity (similar to what is
done now on Athena dial-up machines). Since this could involve a
prohibitive amount of new development work for a particular
platform, it is conceivable we may choose not to provide any
execution servers for that platform. The goal, however, would be to
have available all platforms which are part of the current Athena
release, if feasible.
- Based on the perceived need, the intent is to provide execution
machines which are essentially equivalent to workstations in the
current Athena deployment. Our proposed service will not provide
users with additional computational capacities, beyond what is
available on a typical Athena workstation; in particular, we do
not propose supporting any of the following at this time:
- "Supercomputer" machines, or machines with much more memory or disk
space than might be found on a machine in an Athena cluster.
- "Parallel" execution, i.e. a job requiring
multiple execution machines.
- Execute jobs on, or submit jobs from, non-Athena platforms.
However, there should be nothing inherent in the system which would
preclude its being extended to support any of the above in the
future (though other considerations might make such support unfeasible).
- During the initial design discussions, we did not consider the
issue of providing temporary disk space to be critical; our
expectation was that users would copy any needed job input/output
from/to their lockers, just as they would during interactive
sessions. However, testing revealed that there would be a need for
temporary space, to handle situations where, in an interactive
session, users would use local devices (e.g. Zip drive) to copy large
data files in and/or out. A short term solution is to create
temporary lockers or volumes manually; during testing, this was done
by the FLs for one class (following the existing procedure for providing
extra student coursework space via volumes mounted in the course locker).
If users outside of classes request extra space, that would create an added
support burden (perhaps to be handled by Athena Accounts) and would
require clear policy guidelines. Long term,
we should consider implementing an automated system to create and
clean up temporary disk space in AFS, perhaps using a separate
AFS cell in which the master's Kerberos principal would have
administrative control. (In PBS, this issue is dealt with via a
"stage-in/stage-out" facility, in which the user specifies files
to be copied in to, or out of, the execution server at the beginning
or end of job; we feel this is not terribly useful in our
environment, where the real issue is having the files on-line in
the first place).
- We do not envision supporting "very long jobs", i.e. those requiring
many days or weeks to complete. The maximum time limit for any
queue will probably be on the order of one or two days. Even if a
procedure requires more execution time, it is generally possible, and
advisable, to break it up into smaller pieces.
Eventually, it may make sense to extend the maximum time limit for
jobs running in special queues on older slaves. In any case, the
choice of maximum time limit is technically constrained by the
maximum lifetime of renewable Kerberos tickets.
- A controversial issue was whether to require a billing component
from the outset, in order to be able to recover the costs of meeting
greater demand. We recommended deferring this question because:
- Many feel that it is inappropriate to charge for a service
meeting a basic academic computing need.
- We expect that the budget would cover the cost of running a
service at the anticipated initial scale. Planning for future
expansion or rationing in case of demand exceeding
resources would presumably be folded into the existing Academic
Computing rounds, faculty proposals, and renewal processes.
In addition, given the lack of a general billing infrastructure
for handling
academic computing services, adding a billing component could
incur a significant cost in staff resources.
- Because of security requirements, users will not be able to run
jobs as root. We see no need to support submitting or executing
jobs except as the user's null-instance Kerberos principal, using
the normal user ID as given by Hesiod.
Status
Following the report of our initial recommendations, at the request of the
Sponsor, we
proposed
a feasiblility test of the service. Upon approval, we completed
the necessary development of the prototype, and launched the
test for
selected courses and individual students in the Spring 2001 term.
Five SGI O2 and five Sun Ultra 5 machines were purchased for the
test slave pool, plus another Ultra 5 to serve as the master.
At the end of the term, we reported the
results
of the test. While the results were generally favorable, we would have
preferred to have had greater load, and to have tested the reservation system.
We recommended continuing the test in the Fall 2001 term, and successfully
launched the test in September 2001, after having ported the existing
software to Athena 9.0, and adding a Moira DCM feed for group membership
queries.
The test experience also helped to identify and clarify potentially needed
future
development work, which could be the basis for subsequent Delivery
tasks.
Resources
Costs
- Hardware
For the test, $36K was spent for the purchase of test machines:
6 Sun Ultra 5 (1 master, 5 slaves) $10,388
5 SGI O2 $25,980
---------------------------------------------
total $36,368
The cost of additional hardware will be determined at the appropriate
time. However, we recommend that the actual service roll-out include
at least 5 Linux slaves, if at all feasible; also, we recommend that
no additional SGI machines be purchased, given the expected short
lifetime of that platform in the Athena environment. Finally, we
may want to consider purchasing additional hardware (e.g. a storage
system) to address master server redundancy issues.
- Staff
The following staff time was spent in preparing for the test:
- Athena Development: 6 FTE months
- Athena Server Operations: 1.5 FTE months
- Support: 1 FTE month
We estimate the following staff resources would be required for
ongoing maintenance:
- Dev: 10-20% FTE
- Ops: 1/3 FTE
- Support: 15-25% ongoing.
Future Work
In the course of implementing and deploying the service prototype, we
identified potential
future development tasks. Generally, these can
be categorized as follows:
- Completing and polishing the existing implementation, e.g. where
short cuts may have been taken in the interest of faster deployment
of the prototype. This work should largely be done before an actual
service roll-out.
- Adding features which were not considered essential initially,
e.g. temporary disk space. Such tasks might be done before an
initial roll-out, if feasible, or deferred as possible future
enhancements, based on customer feedback.
- Upgrading or replacing system components, e.g. OpenPBS and/or
the quota database, in order to support a larger-scale service, or
more advanced capabilities. It would not be required to complete
these tasks for the expected initial scale.
Whether and when to do any of these tasks would depend on specific
Delivery scope decisions. (See our
notes
on the issues involved in these tasks).
Appendix
Key Documents
Other documents