Assumptions: )Pool of machines for each CPU type will exceed 15 machines in size. )Maintenance will be provided by existing staff. Currently known, important, design details of the service: )Slave servers run the users' jobs, a single master server manages the queue and scheduling of users' jobs. )Jobs may have the submitting user's kerberos tickets. )A slave machine will run only one job at a time. )No job data (except for log info) is maintained on any service machine after a job's completion. Issues: System Integrity Part of the service involves providing slave machines in known states for users to run their jobs on. Therefore steps must be taken to insure that the slaves are indeed configured as expected. The slaves need special attention because they will be running arbitrary user code. Thus there must be measures in place to detect any system changes possibly wrought by the user code. Possible solutions 1) Running cleanup/reactivate between jobs 2) Full scale security mods in the pattern of the dialups 3) Some compromise in between. In some sense, these machines are like cluster machines, except users won't know the root password - so one could argue that measures that are good enough for cluster machines are good enough for the slaves. However, cluster machines are trivially re-installed at the merest hint of user modification and re-installation is not an acceptable fallback for server machines running special software. Full scale dialup mods are likely an option, but some of these involve kernel mods and it would be unwise to assume that this is necessarily trivial - tho' it likely is. Another detail of #2 is that the actual security check is triggered by a separate machine - the master would be perfect for this given the desire for a security check to not impact a running job. Ops will likely own and maintain whatever solution is chosen. However, having the master coordinate an outside integrity check may require that the service support such an option to avoid interfering with a slave's running job. ----------- Coping with and cleaning up after outages In an ideal world, the following would happen: 1i) Dead machines can be replaced by hot spares or new machines that are set up from scratch. 2i) For a single slave outage, there is sufficient info recorded that any interrupted job can be requeued to be restarted on another slave. 3i) For a master outage, any affected queue can be recreated. 4i) For a service outage: * queue data is cached in multiple locations so that queue recreation, if necessary, is likely; * logging is sufficient to construct a list of users, if any, whose jobs were lost/interrupted. In a pessimal world (i.e. worst case that is acceptable), the following would happen: 1p) Dead machines can be replaced by spares or new machines set up from scratch, and any missing configuration data is restored from backup. 2p) For a single slave outage, there is sufficient info to note which user was affected - allowing the user to be notified. 3p) For a master outage involving the loss of queue state, there is likely to be sufficient info to construct a list of affected users. In the event that there is no such info, respond as a service outage. 4p) For a service outage, OLC and other support services are notified of the general failure so they may properly respond to the inevitable user questions. Of course designing a system that only meets the worst case requirements is justified only in the face of other conflicting issues - thus requiring some decision regarding the tradeoffs. The difference between 1i and 1p in ops' experience tends to be due entirely to where the configuration data is maintained. If the configuration data is maintained in AFS and propagated to the servers, then 1i is usually achieved. The difference between 2i and 2p appears to depend in part on how thorough the logging info is, and in part on whether users' jobs are expected to be restartable. The logging info we can control during the design phase of the system. However, restartability will depend in part on how the users write their jobs. For example, their jobs may change some state external to the slave, in AFS perhaps, and not deal gracefully with that altered state upon a restart. It may be technically possible to associate some user definable flag to a job that ops could use to indicate if we can restart said interrupted job. Still, 2p appears to closer to eventual solution than 2i. The difference between 3i and 3p is *very* dependent on the design of a fallback system for maintaining queue state. The current service design has only one master in use by the service. The queue data will likely be very dynamic and thus a recovery system must somehow address the issue of whether and how to keep from losing the queue state in the event of a master outage. One possible solution that would meet 3i appears to be maintaining a second, ancillary, master. This ancillary master would receive live updates to the queues managed by the real master but not actually manage any slaves. This ancillary master would be configured identically to the live master except for the details surrounding the actuall queue processing. In the event of an outage of the real master, the configuration of the ancillary master would be changed, hopefully by a small, simple change, and placed into service replacing the real master. The outage to users would then appear as a short interruption in the ability to queue new jobs and a short delay in jobs starting on the slaves. One obvious detail that leaves some doubt as to the efficacy of this solution is the management of user authentication. As to the differences between 3 and 4, and thus 4i and 4p, the design of a single master system appears to equate a master outage with a service outage. As a result, ops is likely to place strong emphasis on having a fallback design that approaches 3i/4i. A failure to approach 3i/4i sufficiently close will likely force ops to purchase a system for the master that violates our long-standing model of cookie-cutter servers. This violation would be necessary to implement redundancy in the hardware of the system to compensate for the lack of redundancy in the service design. It should be pointed out that the likelihood of ops needing to make such a trade-off depends greatly on the value placed in the data for the queue state. It is currently assumed that the value will be high, but experience may prove otherwise. --------- Long term maintenance Many of the resolutions for coping with outages also affect the ease of maintainence. Maintainence issues that are seperate from coping with outages include: 1) adding and removing individual machines from service 1a) forcibly rebooting a machine 2) upgrading server OS or hardware 3) upgrading service software These issues generally depend on each other. For example, it's rather difficult to upgrade a server's hardware w/o removing it from service. Also, a new machine may require a new OS, on which the old service software may not compile thus requiring a new version of the service software. Still, isolating and reducing these interdependencies is desirable. In any case, common maintenance tasks include: Upgrades to hardware (which falls under #1) and software (#2, #3) need to be testable before inflicting them upon the unwary general public - ops will need a test configuration. Tasks #1 and #1a are likely to be very common given the number of the machines involved and thus it needs to be as simple as possible. Tasks #2 and #3 are expected to happen in a flurry of activity, once yearly, with patches to be applied throughout the year. Difficulty in attaining easy maintenance is expected from three fronts: 1) The work necessary to install machines and reconfigure the system to bring new machines into service. 2) The work necessary to coordinate/cope with differences between running machines as they are upgraded, as doing so en mass is infeasible. 3) Determining the results of service outages/downtime and informing users thereof. If maintenance tasks must result in service configuration changes (e.g. the service needs a config file listing which machines do what), then it is better if the config file needs to live in only one place (e.g., the master). The choice of solution for the problem of system integrity will very likely impact the ease of adding and removing machines from service and doing upgrades. The trade-offs need to be addressed. Option #2 in the integrity section will likely be untenable from the standpoint of easy maintenance. Possible Solutions These very much depend on design decisions for the service, some of which may have need to be made anew each year as new platforms come into use for Athena. Whatever the solution, it would preferably include: 1) Configuration details concisely located in few locations. 2) Interdependence of servers is minimized. 3) Logging is extensive and easily filtered/processed. 4) Backward compatibility from any new service software with earlier versions. Also helpful, though perhaps not as strongly desired is forward compatibility. 5) A small test configuration, open to the willing and wary users of the production service. 6) The slaves fit ops' cookie-cutter model and, given the likely number of machines, are rack-mountable. --------- System logging As mentioned in the maintainence section, system logging is required. General rules of thumb: 1) The more that's logged, the better. 2) Demand/usage of the service needs to be measurable. 3) Abuses should be noticeable and trackable. Abuse is best if defined in some clear cut fashion. For example, on the dialups significantly impairing use of the system for others is considered abuse. Possible "Solutions" It is currently expected that logging info will include: )user name )submitting machine )job command and associated meta data for running the job )job submit time )job run start time )job run end time )executing slave )regular snapshots of queues' lengths and slaves in use More info may be deemed necessary as service experience is gained. Also a definition of service abuse is needed. Defining how the usual Athena Rules of Use" applies to this service is likely to be sufficient. --------- Pool size The demand for the service is expected to fluctuate considerably, most likely peaking during the end of each semester. There needs to be enough machines in the pool to reasonably handle those peaks. However "reasonable" is defined in this context, the issues for ops are: 1) What is the maximum number of machines ops can support? 2) How much lead time is required to insert new slaves into the service pool? 3) How much lead time is available for predicting/estimating the likely size of the peaks? 4) On what time scale is the size of the slave pool considered dynamic? The easy answer for Ops would be roughly, "Make sure we have enough machines so that we don't have to worry about the peaks". However, we have to recognize in this document that we may find that various constraints prevent us from simply "throwing hardware at the problem". In any case, various constraints (e.g. delivery and set up time, possibly more power and networking, etc.) conspire to prevent ops from making significant increases in the pool size in a time frame less than 2 months (#2). This is time from realization to implementation. So whatever method is found for managing the pool size must recognize this fact. On the flip side, the answer to #3 likely reaches a maximum of two months, for the following reasons: a) Service demand will likely change from one semester to the next as professors change their course requirements. b) Ops will be wary of changing the slave pool size after drop-date. c) It is this author's understanding of the experience of the faculty liaisons that professors tend to wait until the last minute before expressing a service need. In this context then, the *best* ops can hope for is learning of the extent of course-related service demand as the semester begins. Assuming that the course-related service demand is a reasonable indicator of campus-wide service demand, this gives a lead time of about two months between the start of a semester and drop-date. As a result, the answer to #4 is on the order of two months, given current constraints from #2 and #3. #1 is an unknown, for now, due to the impending change to a new machine room. Possible Solution Whatever method is eventually used for managing the pool size, it is likely to include the following: 1) As each semester begins, ops and the FLs assess the demand expressed by professors based on past experience to estimate the pool size anticipated as necessary for crunch time. 2) Ops notifies the FLs of how well the anticipated need can be met. 3) As the semester draws to a close, Ops and FLs assess how accurate the need was predicted, thus refining whatever technique is used for step 1. It is likely that the cost of estimating low will be incurred as frustrated users and badgered FLs. Ops therefore will want to estimate high, if allowed by budget, space and other resource constraints.