Assumptions:
)Pool of machines for each CPU type will exceed 15 machines in size.
)Maintenance will be provided by existing staff.

Currently known, important, design details of the service:
)Slave servers run the users' jobs, a single master server manages
the queue and scheduling of users' jobs.
)Jobs may have the submitting user's kerberos tickets.
)A slave machine will run only one job at a time.
)No job data (except for log info) is maintained on any service
machine after a job's completion.

Issues:

System Integrity

Part of the service involves providing slave machines in known states
for users to run their jobs on.  Therefore steps must be taken to
insure that the slaves are indeed configured as expected.

The slaves need special attention because they will be running
arbitrary user code.  Thus there must be measures in place to detect
any system changes possibly wrought by the user code.

Possible solutions

1) Running cleanup/reactivate between jobs
2) Full scale security mods in the pattern of the dialups
3) Some compromise in between.

In some sense, these machines are like cluster machines, except users
won't know the root password - so one could argue that measures that
are good enough for cluster machines are good enough for the slaves.
However, cluster machines are trivially re-installed at the merest
hint of user modification and re-installation is not an acceptable
fallback for server machines running special software.

Full scale dialup mods are likely an option, but some of these involve
kernel mods and it would be unwise to assume that this is necessarily
trivial - tho' it likely is.  Another detail of #2 is that the actual
security check is triggered by a separate machine - the master would be
perfect for this given the desire for a security check to not impact a
running job.

Ops will likely own and maintain whatever solution is chosen.
However, having the master coordinate an outside integrity check may
require that the service support such an option to avoid interfering
with a slave's running job.

-----------

Coping with and cleaning up after outages

In an ideal world, the following would happen:
1i) Dead machines can be replaced by hot spares or new machines that
are set up from scratch.
2i) For a single slave outage, there is sufficient info recorded that 
any interrupted job can be requeued to be restarted on another slave.
3i) For a master outage, any affected queue can be recreated.
4i) For a service outage:
	* queue data is cached in multiple locations so that queue 
	  recreation, if necessary, is likely;
	* logging is sufficient to construct a list of users, if any, 
	  whose jobs were lost/interrupted.

In a pessimal world (i.e. worst case that is acceptable), the 
following would happen:
1p) Dead machines can be replaced by spares or new machines set up from 
scratch, and any missing configuration data is restored from backup.
2p) For a single slave outage, there is sufficient info to note which 
user was affected - allowing the user to be notified.
3p) For a master outage involving the loss of queue state, there is 
likely to be sufficient info to construct a list of affected users. 
In the event that there is no such info, respond as a service outage.
4p) For a service outage, OLC and other support services are notified 
of the general failure so they may properly respond to the inevitable 
user questions.

Of course designing a system that only meets the worst case 
requirements is justified only in the face of other conflicting 
issues - thus requiring some decision regarding the tradeoffs.

The difference between 1i and 1p in ops' experience tends to be due
entirely to where the configuration data is maintained.  If the
configuration data is maintained in AFS and propagated to the servers,
then 1i is usually achieved.

The difference between 2i and 2p appears to depend in part on how
thorough the logging info is, and in part on whether users' jobs are
expected to be restartable.  The logging info we can control during
the design phase of the system.  However, restartability will depend
in part on how the users write their jobs.  For example, their jobs may 
change some state external to the slave, in AFS perhaps, and not deal
gracefully with that altered state upon a restart.  It may be
technically possible to associate some user definable flag to a job
that ops could use to indicate if we can restart said interrupted
job.  Still, 2p appears to closer to eventual solution than 2i.

The difference between 3i and 3p is *very* dependent on the design of
a fallback system for maintaining queue state.  The current service
design has only one master in use by the service.  The queue data will
likely be very dynamic and thus a recovery system must somehow address
the issue of whether and how to keep from losing the queue state in
the event of a master outage.

One possible solution that would meet 3i appears to be maintaining a
second, ancillary, master.  This ancillary master would receive live
updates to the queues managed by the real master but not actually
manage any slaves.  This ancillary master would be configured
identically to the live master except for the details surrounding the
actuall queue processing.  In the event of an outage of the real
master, the configuration of the ancillary master would be changed,
hopefully by a small, simple change, and placed into service replacing
the real master.  The outage to users would then appear as a short
interruption in the ability to queue new jobs and a short delay in
jobs starting on the slaves.  One obvious detail that leaves some
doubt as to the efficacy of this solution is the management of user
authentication.

As to the differences between 3 and 4, and thus 4i and 4p, the design
of a single master system appears to equate a master outage with a
service outage.  As a result, ops is likely to place strong emphasis
on having a fallback design that approaches 3i/4i.  A failure to
approach 3i/4i sufficiently close will likely force ops to purchase a
system for the master that violates our long-standing model of
cookie-cutter servers.  This violation would be necessary to implement
redundancy in the hardware of the system to compensate for the lack of
redundancy in the service design.

It should be pointed out that the likelihood of ops needing to make
such a trade-off depends greatly on the value placed in the data for
the queue state.  It is currently assumed that the value will be high,
but experience may prove otherwise.

---------

Long term maintenance

Many of the resolutions for coping with outages also affect the ease
of maintainence.  Maintainence issues that are seperate from coping
with outages include:
1) adding and removing individual machines from service
1a) forcibly rebooting a machine
2) upgrading server OS or hardware
3) upgrading service software

These issues generally depend on each other.  For example, it's rather
difficult to upgrade a server's hardware w/o removing it from
service.  Also, a new machine may require a new OS, on which the old
service software may not compile thus requiring a new version of the
service software.  Still, isolating and reducing these
interdependencies is desirable.

In any case, common maintenance tasks include:

Upgrades to hardware (which falls under #1) and software (#2, #3) need
to be testable before inflicting them upon the unwary general public -
ops will need a test configuration.  Tasks #1 and #1a are likely to be
very common given the number of the machines involved and thus it
needs to be as simple as possible.  Tasks #2 and #3 are expected to
happen in a flurry of activity, once yearly, with patches to be
applied throughout the year.

Difficulty in attaining easy maintenance is expected from three fronts:
1) The work necessary to install machines and reconfigure the system
   to bring new machines into service.
2) The work necessary to coordinate/cope with differences between
   running machines as they are upgraded, as doing so en mass is
   infeasible.
3) Determining the results of service outages/downtime and informing
   users thereof.

If maintenance tasks must result in service configuration changes
(e.g. the service needs a config file listing which machines do what),
then it is better if the config file needs to live in only one place
(e.g., the master).

The choice of solution for the problem of system integrity will very
likely impact the ease of adding and removing machines from service
and doing upgrades.  The trade-offs need to be addressed.  Option #2
in the integrity section will likely be untenable from the standpoint
of easy maintenance.

Possible Solutions

These very much depend on design decisions for the service, some of
which may have need to be made anew each year as new platforms come
into use for Athena.  Whatever the solution, it would preferably
include:

1) Configuration details concisely located in few locations.
2) Interdependence of servers is minimized.
3) Logging is extensive and easily filtered/processed.
4) Backward compatibility from any new service software with earlier
   versions.  Also helpful, though perhaps not as strongly desired is
   forward compatibility.
5) A small test configuration, open to the willing and wary users of
   the production service.
6) The slaves fit ops' cookie-cutter model and, given the likely
   number of machines, are rack-mountable.

---------

System logging

As mentioned in the maintainence section, system logging is required.

General rules of thumb:
1) The more that's logged, the better.
2) Demand/usage of the service needs to be measurable.
3) Abuses should be noticeable and trackable.

Abuse is best if defined in some clear cut fashion.  For example, on 
the dialups significantly impairing use of the system for others is 
considered abuse.

Possible "Solutions"

It is currently expected that logging info will include:
)user name
)submitting machine
)job command and associated meta data for running the job
)job submit time
)job run start time
)job run end time
)executing slave
)regular snapshots of queues' lengths and slaves in use

More info may be deemed necessary as service experience is gained.

Also a definition of service abuse is needed.  Defining how the usual
Athena Rules of Use" applies to this service is likely to be
sufficient.

---------

Pool size

The demand for the service is expected to fluctuate considerably, most
likely peaking during the end of each semester.  There needs to be
enough machines in the pool to reasonably handle those peaks.  However
"reasonable" is defined in this context, the issues for ops are:

1) What is the maximum number of machines ops can support?
2) How much lead time is required to insert new slaves into the
   service pool?
3) How much lead time is available for predicting/estimating the
   likely size of the peaks?
4) On what time scale is the size of the slave pool considered
   dynamic?

The easy answer for Ops would be roughly, "Make sure we have enough
machines so that we don't have to worry about the peaks".  However, we
have to recognize in this document that we may find that various
constraints prevent us from simply "throwing hardware at the problem".

In any case, various constraints (e.g. delivery and set up time,
possibly more power and networking, etc.) conspire to prevent ops from
making significant increases in the pool size in a time frame less
than 2 months (#2).  This is time from realization to implementation.
So whatever method is found for managing the pool size must recognize
this fact.

On the flip side, the answer to #3 likely reaches a maximum of two
months, for the following reasons:
a) Service demand will likely change from one semester to the next as
   professors change their course requirements.
b) Ops will be wary of changing the slave pool size after drop-date.
c) It is this author's understanding of the experience of the faculty
   liaisons that professors tend to wait until the last minute before
   expressing a service need.  In this context then, the *best* ops
   can hope for is learning of the extent of course-related service
   demand as the semester begins.

Assuming that the course-related service demand is a reasonable
indicator of campus-wide service demand, this gives a lead time of
about two months between the start of a semester and drop-date.

As a result, the answer to #4 is on the order of two months, given
current constraints from #2 and #3.

#1 is an unknown, for now, due to the impending change to a new
machine room.

Possible Solution

Whatever method is eventually used for managing the pool size, it is
likely to include the following:
1) As each semester begins, ops and the FLs assess the demand
   expressed by professors based on past experience to estimate
   the pool size anticipated as necessary for crunch time.
2) Ops notifies the FLs of how well the anticipated need can be met.
3) As the semester draws to a close, Ops and FLs assess how accurate
   the need was predicted, thus refining whatever technique is used
   for step 1.

It is likely that the cost of estimating low will be incurred as
frustrated users and badgered FLs.  Ops therefore will want to
estimate high, if allowed by budget, space and other resource
constraints.