We are prepared to report back on the expressed needs for a longjobs
service.  We solicited and received feedback on a service description
outline that we prepared, and feel that we now have enough data to
recommend that we proceed to develop a project pilot.

To summarize, we outlined a service with the following features:

    * Users could submit jobs from any Athena workstation.  They could
      also monitor the job queues and status, and cancel an unwanted
      job.  Job parameters and requirements (e.g. platform, time)
      would be supplied at submit time.

    * A master server would schedule and dispatch jobs to slave
      execution servers using a first-in/first-out algorithm,
      modified for fairness.

    * Execution machines would run only one job at a time.

    * Jobs would be subject to a strict limit of elapsed time; the
      maximum time allowed would be on the order of a couple of days.

    * Authentication and authorization would be based on the user's
      Kerberos principal.

    * The service would support the use of forwardable, renewable
      Kerberos 5 tickets, which would be used to acquire Kerberos 4
      tickets and AFS tokens.  The system would handle renewal
      automatically.  Users would also have the option of forwarding
      a non-renewable ticket, or of running the job with no ticket at
      all, possibly with server-based default tokens.

    * All software which can be run non-interactively on a current
      Athena machine would be supported.

    * Execution machines would be comparable to Athena cluster
      machines; all platforms that are part of the current Athena
      release would be represented in the machine pool.

[The questionnaire form is at:

     http://web.mit.edu/longjobs/doc/s-d-feedback.html

The service description is at:

     http://web.mit.edu/longjobs/doc/service-description.html]

The questionnaire asked how a longjobs service might be used, and
to what extent a service such as we described would meet user needs.
We corresponded with faculty members, graduate students, and Stopit
offenders; the responses we received were generally favorable toward
the proposed service as described.

Summing up the responses, we estimated a potential need for on the
order of 10,000 job hours in a peak month.  Since one execution
machine could provide about 700 job hours in a month, we thus estimate
that an initial pool of 15-20 execution servers could handle the
expected peak load.

Our description left an accounting/billing component as an open question.
Given that:

      1) many feel that it is inappropriate to charge for a service
	 meeting a basic academic computing need;

      2) we expect that the budget would cover the cost of running a
         service at the anticipated initial scale;

We feel that the best approach to prevent overuse, at least initially,
is to implement a group-based quota system.  Under this scheme, users
would receive a certain quota of job-hours based on membership in
registered groups, e.g.  course-based groups.  Individual users could
also register for the service, by justifying that they would use it
for a valid academic purpose.

Key features of this scheme include:

    * Users could have quotas for multiple groups, including the general
      group for individual use; they would specify which quota to
      charge at job submission time.

    * Professors could arrange to reserve a set of execution servers
      for a course-based group, similar to the way rooms are reserved now,
      for class assignments.  These machines would be unavailable to
      serve other jobs for the duration of the reservation.

    * Registrations and quotas would have an automatic expiration period,
      probably defaulting to one semester.

    * The system will write accounting records for all jobs.  Initially,
      this would be used only to help us monitor how well the service is
      performing.

    * A billing component could be added later, if necessary, to
      recover the costs of meeting greater demand, though there would
      presumably always be quotas for some amount of free use for
      appropriate needs.

We also have several areas of concern:

    * Use of the service seems likely to be characterized by peaks
      of heavy usage (e.g. for class assignments), along with periods
      of low usage.  We need to have enough execution servers to
      provide reasonable service at peak times.

    * A disproportionate share of the expressed need is to run jobs which
      only run on the SGI platform, though we are striving to reduce the
      share of SGI machines in Athena clusters.  Our hope is that the
      migration path for the software in question will be clearer by
      the time we need to purchase the machines.

    * Given the lack of a general billing infrastructure for handling
      academic computing services, adding a billing component would
      likely incur a significant cost in staff resources.

The master and slave server machines would be administered by Athena
Server Operations.  General registrations would be handled by
Accounts, and Faculty Liaisons would administer the course-based
reservations.  General user support would be provided by OLC,
course-related support by the FLs.

The software for the system would be based on the Portable Batch System,
a freely-available network batch package.

We estimate that the following staff costs would be incurred:

    * Dev: 6 FTE mos. initial development, plus 10% FTE for ongoing
      maintenance.

    * Ops: 1-2 FTE mos. initial, 1/3 FTE ongoing.

    * Support: 1 FTE month initial, 15-25% FTE ongoing.

We also estimate the following hardware costs:

    * Up to $5000 per slave server.  These would be on a 2-year
      renewal cycle; at least some of the retired servers could
      possibly remain in the pool, others could be used for other
      services.

    * Up to $15000 for a master server; this includes the cost of some
      form of hardware-based redundancy, which would not be required
      for a pilot.  One suitable master server would be able to support
      several times the number of slaves we are considering.

In conclusion, the team recommends that we proceed to develop a pilot
longjobs service, with 15-20 execution servers.  We would invite
several representative faculty and courses to participate in the
pilot, along with certain individual users whose requirements would be
a good test for the system.