We are prepared to report back on the expressed needs for a longjobs service. We solicited and received feedback on a service description outline that we prepared, and feel that we now have enough data to recommend that we proceed to develop a project pilot. To summarize, we outlined a service with the following features: * Users could submit jobs from any Athena workstation. They could also monitor the job queues and status, and cancel an unwanted job. Job parameters and requirements (e.g. platform, time) would be supplied at submit time. * A master server would schedule and dispatch jobs to slave execution servers using a first-in/first-out algorithm, modified for fairness. * Execution machines would run only one job at a time. * Jobs would be subject to a strict limit of elapsed time; the maximum time allowed would be on the order of a couple of days. * Authentication and authorization would be based on the user's Kerberos principal. * The service would support the use of forwardable, renewable Kerberos 5 tickets, which would be used to acquire Kerberos 4 tickets and AFS tokens. The system would handle renewal automatically. Users would also have the option of forwarding a non-renewable ticket, or of running the job with no ticket at all, possibly with server-based default tokens. * All software which can be run non-interactively on a current Athena machine would be supported. * Execution machines would be comparable to Athena cluster machines; all platforms that are part of the current Athena release would be represented in the machine pool. [The questionnaire form is at: http://web.mit.edu/longjobs/doc/s-d-feedback.html The service description is at: http://web.mit.edu/longjobs/doc/service-description.html] The questionnaire asked how a longjobs service might be used, and to what extent a service such as we described would meet user needs. We corresponded with faculty members, graduate students, and Stopit offenders; the responses we received were generally favorable toward the proposed service as described. Summing up the responses, we estimated a potential need for on the order of 10,000 job hours in a peak month. Since one execution machine could provide about 700 job hours in a month, we thus estimate that an initial pool of 15-20 execution servers could handle the expected peak load. Our description left an accounting/billing component as an open question. Given that: 1) many feel that it is inappropriate to charge for a service meeting a basic academic computing need; 2) we expect that the budget would cover the cost of running a service at the anticipated initial scale; We feel that the best approach to prevent overuse, at least initially, is to implement a group-based quota system. Under this scheme, users would receive a certain quota of job-hours based on membership in registered groups, e.g. course-based groups. Individual users could also register for the service, by justifying that they would use it for a valid academic purpose. Key features of this scheme include: * Users could have quotas for multiple groups, including the general group for individual use; they would specify which quota to charge at job submission time. * Professors could arrange to reserve a set of execution servers for a course-based group, similar to the way rooms are reserved now, for class assignments. These machines would be unavailable to serve other jobs for the duration of the reservation. * Registrations and quotas would have an automatic expiration period, probably defaulting to one semester. * The system will write accounting records for all jobs. Initially, this would be used only to help us monitor how well the service is performing. * A billing component could be added later, if necessary, to recover the costs of meeting greater demand, though there would presumably always be quotas for some amount of free use for appropriate needs. We also have several areas of concern: * Use of the service seems likely to be characterized by peaks of heavy usage (e.g. for class assignments), along with periods of low usage. We need to have enough execution servers to provide reasonable service at peak times. * A disproportionate share of the expressed need is to run jobs which only run on the SGI platform, though we are striving to reduce the share of SGI machines in Athena clusters. Our hope is that the migration path for the software in question will be clearer by the time we need to purchase the machines. * Given the lack of a general billing infrastructure for handling academic computing services, adding a billing component would likely incur a significant cost in staff resources. The master and slave server machines would be administered by Athena Server Operations. General registrations would be handled by Accounts, and Faculty Liaisons would administer the course-based reservations. General user support would be provided by OLC, course-related support by the FLs. The software for the system would be based on the Portable Batch System, a freely-available network batch package. We estimate that the following staff costs would be incurred: * Dev: 6 FTE mos. initial development, plus 10% FTE for ongoing maintenance. * Ops: 1-2 FTE mos. initial, 1/3 FTE ongoing. * Support: 1 FTE month initial, 15-25% FTE ongoing. We also estimate the following hardware costs: * Up to $5000 per slave server. These would be on a 2-year renewal cycle; at least some of the retired servers could possibly remain in the pool, others could be used for other services. * Up to $15000 for a master server; this includes the cost of some form of hardware-based redundancy, which would not be required for a pilot. One suitable master server would be able to support several times the number of slaves we are considering. In conclusion, the team recommends that we proceed to develop a pilot longjobs service, with 15-20 execution servers. We would invite several representative faculty and courses to participate in the pilot, along with certain individual users whose requirements would be a good test for the system.