Proposed test of longjobs service

We propose to implement a test longjobs service, based on the service design we have outlined, and involving a small of number of courses, and possibly some individual users. The broad goals of the test are:

Determine how well our service design satisfies the users' needs.
Test if the proposed registration/quota system will adequately satisfy the requirement that the service not be over-subscribed.
Obtain a clearer picture of the costs of an eventual service, and the ways it would be feasible to pay for those costs.
Ensure that the system complements the Athena environment, including the security infrastructure.

Document Layout

This document:

Provides an overview of the proposed service design:
- The problem
- The proposed solution
Discusses the major concerns that have been expressed over IS' providing a longjobs service:
Lists open issues we have identified:
Describes the proposed test, including its specific size and scope, goals, and conditions:

The Problem

The major problem we are addressing is the requirement, as stated in the Athena Rules of Use, that users remain present at the console of an Athena workstation for the duration of a session. This presents a dilemma for users who need to execute long-running, typically non-interactive jobs for legitimate academic purposes, which has led to frustration and abuse.

Our proposed service would give users the ability to run non-interactive, unattended jobs within the Athena environment, by providing a pool of dedicated, centrally-managed Athena server machines on which users can execute such non-interactive jobs remotely, in as regulated, ordered, and predictable a manner as possible.

The Proposed Solution

The goal

The intent would be to provide a system that is as "transparent", and compatible with the Athena interactive computing experience, as possible; a user should easily be able to submit a job as a script of the same shell commands that would be entered during a normal interactive login session. The execution machines would be comparable to normal Athena workstations, with the same facilities available, except for those facilities which are only suitable for interactive capabilities (e.g. a display).

General architecture

Access to the service would be granted through a registration process administered by the appropriate support group (see Registration, Support, and Reservations). Usage would be limited by a group-based quota system, maintained on the master; a quota of job hours would be granted at registration time.

A registered user could submit a job from any Athena workstation, specifying the type of Athena machine they want the job to run on, and other desired job parameters. The job would be directed immediately to a central "master" server, which would dispatch jobs to "slave" execution machines, as they become available. An execution slave would only run one job at a time, so a job would remain queued at the master until a suitable slave machine became available. Authentication and authorization would be based on the user's Kerberos principal.

Queues and Scheduling

All submitted jobs will enter precisely one of several queues; the queues will distinguish jobs based on machine type and time limit, and will have access control lists. We recommend having queues with at least two different time limits, for relatively short or long jobs.

Scheduling of all jobs will be based on a first-in, first-out scheme. A slave machines will serve one or more queues (e.g. the "short" or "long" queue, or both) and may be migrated between queues to satisfy varying demands on load.

Users can query the server to obtain job status, and cancel jobs that were no longer wanted. Users would optionally be notified, via e-mail and/or Zephyr, when a job begins and/or ends.

Job execution

User accounts would be added, and home directories attached, on the execution machine only for the duration of the job. Jobs would be subject to a strict limit of elapsed time; any job exceeding that limit would be killed. Since a slave machine will only run one job at a time, the user is assured of having no contention for CPU cycles or other resources while the job is running, so that the time needed to complete the job will be as predictable as possible. At the end of each job, the slave will perform a cleanup and check procedure, to ensure system integrity. Standard output and error files would be written to the user's directory, or emailed to the user.

Tickets/tokens

In order for the job to run with Kerberos credentials, as is required for most authentication and authorization purposes in the Athena environment, the user would optionally be able to acquire a long-lived, renewable Kerberos 5 ticket-granting ticket, which would be forwarded as part of the job. The master and slave servers would manage these tickets, renewing them as needed. In addition, the execution server would use this ticket to acquire Kerberos 4 tickets, and AFS tokens, the latter being most critical for users to be able to access their home and other directories from within the job context. Users who did not wish to forward a long-lived ticket could optionally choose to forward an existing short-lived ticket, or to run the job without any tickets or tokens. A renewable ticket could be maintained up to the maximum life permitted by our Kerberos configuration (currently one week).

Implementation

We will use the Portable Batch System (PBS) as a basis for our system. PBS was developed at NASA Ames Research Center, and is supported by MRJ Technology Solutions, recently acquired by Veridian Corporation. It is freely available in source form, and we are free to modify, though not redistribute, it.

Athena Development will make the necessary enhancements and additions to PBS, including the support for Kerberos authentication and authorization, the management of renewable tickets and tokens, compatibility with the Athena login system, Zephyr support, and integration with a group-based quota system. The server machines will be administered by Athena Server Operations, and registrations by the appropriate support group (see below).

Registration, Reservations, and Support

The Faculty Liaisons will handle registrations, reservations, and support for course-based use of the service (through faculty and TAs). Athena Accounts and Consulting would handle registrations and support for other uses, according to whatever policies are developed for such use of the service.

To ensure that sufficient resources are available for their classes, instructors would be able to request a reservation from the Faculty Liaisons for a particular number and type of machines during a specific time period. This would be implemented via a separate queue mapping to an appropriate subset of machines in the pool, ACLd to the group for that class. We would require advance notice for such requests (in order to ensure sufficient machines are available, to transfer general queues to non-reserved machines, etc.). Similarly, reservations would probably need to be staggered to build in some free time for handling transitions.

See Over-subscription for more details on the registration process.

Concerns

Over-subscription

A low or non-existent cost for using the longjobs service may encourage more usage than can readily be supported. However, if the service is to be viable, the cost for the user needs to be low enough to encourage the use of the service instead of the already "free" service provided by the available cluster machines. In other words, it is important not to create an under-subscription problem in addressing this issue.

Methods to prevent over-subscription need not involve simply some charge for using the service. There may also be barriers to usage, or throttles that control the amount of usage. The methods available to this service involve registration, quotas that limit usage, billing for actual use, and the queueing system.

For a service test, we propose the use of quotas and registrations. The issues surrounding billing are discussed in the next section.

The service will be available only to registered users. Registration will be based on the user's Kerberos principal, and will require some description of the intended use, and an estimate of the quota needed. (If we decide to implement billing, a billable account would also be required).

Registrations would typically be done on a group basis, by departments and/or faculty members on behalf of students in a class. Registered users will be assigned one or more group names, which is intended to classify the intended uses. A request for registration should be denied if the service is already at its subscription limit. The user will have to provide a valid (user, group-name) pair to be able to submit a job to the service.

Registrations will expire. The expiration time will either be specified/negotiated at registration time, imposed by policy (e.g. no registration may last more than the current year or current semester), or some combination thereof. Expiration of registrations is necessary for reasonably accurate estimates of quota likely to be used.

The quota available to a user is the sum total of quotas available for all of the (user, group-name) pairs for that user. The quota provided is not necessarily what is requested at registration, it may be less due to constraints on the actual available quota hours available. Quotas will be valid until a registration expires or is renewed - in the case of renewal, a new quota will be assigned. Quotas may be valid only on certain types of machines (e.g. SGI only, older sparcs only), to allow for better estimation of future service usage.

The queueing system acts to help keep the service from being over-subscribed by providing both positive and negative re-inforcement to the user. Specifically to teach the users to plan for submitting their jobs at non-peak times in order to get their jobs running and finished faster.

Since delays in running jobs may manifest as the highest apparent cost to the user, there is an incentive on our part to limit registrations and quota assigned. This, however, will be at the expense of turning people away from the service and pushing them toward use of cluster machines. Balancing these costs will take some experience.

Costs

As our policy will state explicitly that execution machines will be comparable to cluster workstations, with no intention of providing super-computing capability, we estimate that the cost of purchasing a slave for the proposed test currently would be about $2800 per machine for a Sun (Ultra 5/400, 256MB RAM), $1700 for a Linux PC (Dell GX110, 256MB RAM), and $4800 for an SGI (rackmount O2, 128MB RAM).

For an eventual service roll-out, we propose a 2 year renewal cycle for the slaves, in order to keep pace with the cluster machines. Old slaves could be recycled for other services, or kept in the pool, with a factor to equate them with a job-hour on the faster slaves.

For the master server machine to be used in an eventual production service, our ballpark estimate is a cost of up to $15000; this includes the optional cost of some form of hardware-based redundancy. This would not be needed for the proposed test, for which we estimate a cost of about $3900 (Ultra 5/400, 512MB RAM).

Finally, we estimate that the following staff costs would be incurred:

Dev: 6 FTE mos. initial development, plus 10% FTE for ongoing maintenance.
Ops: 1-2 FTE mos. initial, 1/3 FTE ongoing.
Support: 1 FTE month initial, 15-25% FTE ongoing.

We will track the staff time spent maintaining the service, to make sure that it does not become an inordinate drain on IS resources.

Security

The system will perform Kerberos 5-based authentication and authorization.
Users will not be given root access to the execution machine.
Users will not be allowed to login remotely to the execution machine. However, we do not plan to implement additional measures to suppress the ability to run interactive commands (e.g. xterm), as is done on the dialup machines, since system load is not an issue, and such measures would likely be ineffective, given that the user would already have the ability to execute arbitrary commands on the slave. (See Open Issues).
Users will have the option of running jobs without tickets, and with no or default tokens, if they do not want to create a long-lived ticket for their jobs.
Regular integrity checks will be performed on the slave and master machines. Slaves will only run one job at a time; any user processes still running at the end of a job will be killed.
No jobs will run on the master.

Scale, Robustness

One suitably-configured master server should be able to support many times the number of slaves currently being considered.
The number of jobs being managed by the master is probably a better measure of size than number of slaves, and this number will probably be relatively low in our environment, compared to other sites' use of batch facilitites. The strict regulation of the service provided by use of registration and quotas will also tend to keep the scale smaller than might be feared.
We envision the use of some form of redundancy, either hardware or software-based, so that the master server will not be a single point of failure. This would not be done for the proposed test, however.
Should the service grow sufficiently, additional masters could be deployed, such that each master had its own set of slaves. We could also consider implementing a multiple master scheme.
Registration could be automated, if we decide to make the service available to a wider community.

Open Issues

Billing

The possibility of billing for use is a controversial issue. On the one hand, charging users directly for usage could help prevent the system from being over-subscribed (see above), and/or allow us to recover at least part of the costs of running the service. On the other hand, many of the users we have corresponded with seem to feel strongly that, just as with the use of "free" cluster workstations, they should not be billed for using the service for valid academic purposes. Also, currently there is no general billing infrastructure for handling academic computing services, so adding a billing component would likely incur a significant cost in staff resources.

Given the presumption that at least some fraction of the service will be subsidized, i.e. so that users would not be billed directly for such use, the lack of supporting infrastructure to do billing, and the controversy surrounding the issue for a service viewed as meeting a basic academic computing need, we recommend that no billing system be implemented for our proposed test of the service.

IS could decide to add a billing component, possibly as part of an actual roll-out of the service, or later, as a basis for expanding the service, or to recover the costs of maintaining the service. Regardless, the system will always write accounting records, containing any fields (e.g. account number, type) which might be required to interface to an eventual billing system. This data would be derived from the group under which the user submitted the request. Such records will also be critical in our determining how well the system performs; we will write tools to extract information such as average job and queue wait times.

We expect that the proposed service test would help to suggest what eventual billing model, if any, might be appropriate.

SGI platform support

A disproportionate share of the expressed need for the service is to run jobs using software which only runs on the SGI platform, though we are striving to reduce the share of SGI machines in Athena clusters. For now, given that we intend to support SGI software on cluster machines until 2003, and the suggested 2-year renewal cycle of slave servers, we recommend that we support the SGI platform for the test. Our hope is that the migration path for the software in question will be clearer by the time we need to purchase the machines; if so, we could revise this recommendation accordingly.

Suppressing interactive capabilities

There is concern that the system should disallow the ability of users to run interactive shells in their jobs (e.g. by starting an xterm displaying back to the user's workstation), so that they would not be able to probe the execution machine for possible security weaknesses. While our Sun dialup servers do forbid running xterm (and other X client) processes which display remotely, via modifications to the kernel, that is done in the interest of minimizing system load; that is not an issue for a longjobs slave. It could be argued that a significant effort to make similar modifications to the other operating systems used by this service would not be worthwhile. (Note that any such process started by the user's job would be killed at the end of job).

Proposed service test

We propose a service test, restricted to a few classes and possibly a small number of individual students, in order to examine the basic feasibility of the proposed service to satisfy the core need: running long, unattended jobs on Athena for academic purposes. The main goals of this test will be to determine whether this approach would provide users with an effective solution, and to gather data on its resource requirements. Whether to actually establish a longjobs service, on what scale and with what funding sources are not questions to be determined by this team, but we believe the test will yield important information needed as input to these decisions.

Questions to be answered by the test

Does a batch system satisfy the preponderance of user needs?
How well does PBS work in our environment (including local enhancements)?
Are there any issues with particular 3partysw applications?
Is the basic scheduler implementation satisfactory?
Is our security model appropriate and adequate?
Does the service's automatic renewal of tickets and tokens work?
Is the master server configuration satisfactory?
Is the user interface satisfactory?
Are our provisions for dealing with failure modes satisfactory?
Are our accounting records satisfactory, both for analysis and SAP?
What are the usage patterns?
- usage level in idle periods vs. peak periods
- wait times and queue length
- usage during reserved times vs. other periods
- collisions between classes
- implications of these patterns for Ops and support
- implications for capacity planning and budget
How well does the provided capacity meet the actual usage?
Is the proposed quota scheme workable?
- quota calculations vs. actual usage
- user behavior, requests for adjustments, other support questions
Is the proposed reservation system workable?
How accurate were our operational and support estimates?
- Support burden of registrations, quotas, reservations
- Burden of additional server platforms for ops
How much did the test really cost?
What are likely additional costs for Ops and support with scale?
What billing model is suggested by the test experience?

Participants and Expectations

Selection of participants would be made closer to implementation time, when such details as platform support for SGI-only applications and needs for the specific term are clearer; some likely candidates are those whose estimates are shown in the section on scale, below. The number and type of machines allotted to the test will also be a factor in how many classes and of what sizes to include, but we would recommend distribution across several departments with the greatest perceived need, willingness to help test under the prescribed conditions, and best apparent chance of providing representative data.

We recommend also including some individual students with legitimate academic needs outside of classes, to gather as much data as possible on the continuing cases of students who appear to be violating the Rules of Use (against leaving a machine unattended or using multiple machines) in the clusters because they have no other means of running non-interactive jobs.

We will explain to the participants that we are conducting the test to determine whether such a service meets their needs, to answer technical questions, and to gather data on what such a service would cost IS as the basis for determining whether an acceptable billing model could be applied to it. All participants would be required to agree to some basic conditions:

this is a feasibility test, not a beta; there is no guarantee of service and participants must have backup plans ready in case of problems
there is no implied promise of a future service roll-out, and no commitment to subsidize usage costs if a service is established
testers will be expected to provide requested feedback in a timely fashion

Timeline

Realistically, spring 2001 is the earliest term when we could begin a test. The test should run for a full term, after which there should be a decision to do one of the following:

extend the test (if additional data is needed)
begin implementing an actual service (perhaps a staged pilot)
abandon the longjobs effort, explain why to the testers and other users awaiting a decision, and decide how else to address their needs

The following steps need to be completed before the test can begin:

write software required for the test
line up the participants
buy machines (requires two months lead time)
code review (at least with respect to security issues)
usability test
documentation

Scale Factors

Factors in considering the scale for the test include:

Peak usage periods: the calculations below give examples of the relation between number of machines and time to process the set of jobs for a single class at a given peak period.
Collisions between classes: in order to learn about behavior of both the system and users when more than one class is submitting jobs, we would ideally include 3 or more classes in the test.
3partysw applications: A large proportion of classes are interested in longjobs for BioSym/MSI software, which is currently available only for SGI. While this software may migrate to other platforms over the next several years and we plan to phase SGIs out of the clusters by 2003, given that we aim to support software that runs in the clusters and would renew equipment on a 2-year cycle, it does not seem reasonable to rule out SGI support at the outset. As we near implementation, future platform support for these applications may be clearer and perhaps lead to revised recommendations.
Machine cost: For slave machines, we estimate hardware will cost:
- $2800 per Sun (Ultra 5/400, 256MB RAM)
- $1700 per Linux PC (Dell GX110, 256MB RAM)
- $4800 per SGI (rackmount O2, 128MB RAM)
and for the master server, the estimated cost is $3900 (Ultra 5/400, 512MB). A more complete discussion is in the Costs section above.

Peak Usage Calculation

Assuming students working on an assignment will tend to submit their jobs in close proximity, we look at the time span to complete the whole set for a given class, vs. how many machines are available to them.

Observations:

given M dedicated machines, the time span to complete a total of J job-hours will be J/M hours, or J/24M days
M of the students can have their jobs complete in real time (e.g., 24 hrs for a 24-hr job); the remainder would wait some multiple of the job length, up to the aggregate time span (e.g., next M students wait 48 hrs, next M wait 72 hrs, ..., last M wait J/M hrs).

Case 1

25-30 students
20-24 hours per job
600 total job-hours

no. of machines	approx. time span	ratio (span:job length)
1	25 days	25x
5	5 days	5x
7	3.5 days	3.5x
10	2.5 days	2.5x
25	1 day	1x

This corresponds to estimates from Ceder (MatSci, 30 students x 20 hrs/job) and Rutledge (ChemEng, 25 students x 24 hrs/job), both on SGI.

Case 2

50 students
3 hours per job
150 total job-hours

no. of machines	approx. time span	ratio (span:job length)
1	1 week	56x
2	3 days	24x
6	1 day	8x
10	15 hours	5x
25	6 hours	2x
50	3 hours	1x

This corresponds to estimates from Cesnik (AeroAstro) and Gupta (Sloan), who each estimated 50 students for "a few hours", both on Sun.

Number of machines: Three Options

We can take one of several approaches:

Small number of Suns and SGIs (e.g. 5-7)
pro:
- tests applications on both platforms
- tests operational and support issues for multiple platforms
con:
- fewer classes each platform ==> fewer collisions to observe
Larger number of Suns, no SGIs (e.g 10-15)
pro:
- more classes ==> more chances to observe collisions
con:
- no data on SGI applications
- no data on supporting multiple platforms
Larger number of one platform, small number of other (e.g. 10, 5-7)
pro:
- compromise between the above
con:
- higher equipment cost

Last modified: Thu Sep 27 18:53:24 EDT 2001