Notes on the items identified as potentially needing further development work, including lists of some of the issues, and guesstimates as to the amount of work involved. * Don't allow group membership to be inferred - make owner name (and job name?) protected attributes (change qstat) - hack select code to prevent selecting by account or user name 2-3 days dev * Automatic temp disk space creation/clean-up - make separate AFS cell where master has admin access - master would track cell usage - add space resource that could be specified in qsub - job would wait in queue until space could be created - master creates volume on suitable partition, creates user, makes mount point, adds user to acl - AFS path must be passed to job - env. variable? - at EOJ, master must mark volume to be deleted somehow - file with EOJ timestamp in top level directory? - scheduler issues - resource not tied to node: race problem? - periodic master task would nuke volumes after some grace period following EOJ - what about input side? (much less likely case, and harder to solve) 1-3 months dev, + ops maintenance (likely absorbable), + machine cost * Upgrade to OpenPBS 2.3 or PBSPro - resource monitor and ping queries between master and MOM now UDP-based by default - configure option can be used for TCP - done mostly to keep down node from hanging things, we avoid this by timing out on the connect - these connections are not now kerberized, so need some work anyway - we already updated to support later Solaris rev's, and incorporated relevant bug fixes - PBS Pro has useful features: - advance reservations - requeue or delete jobs running on dead nodes - can associate node(s) with queue - new node attributes, resources - improved pbsnodes command ~2 weeks to upgrade to OpenPBS 2.3, more for PBS Pro (latter not including time to incorporate wanted scheduler features) + $1000/yr support for PBS Pro * Better scheduler - Reservations - Maui and PBS Pro have some sort of reservation support - node-based queue limits (e.g. aggregate) - PBS Pro may support this - maybe use resources_available queue attribute somehow - predictable execution time - might be feasible to change the current scheduler to transmit the job sort order to the server, and have the latter calculate the estimated time - this would be problematic if the "one job on one node" model were to change (scheduler backfill, etc.) - ability to query - Maui has this, though not Kerberized ~2-3 months to replace the scheduler - needs lots of investigation, testing * Replace quota DB with Oracle - probably don't bother for anticipated scale of service 1-2 months * Quota system tweaks - Better definition of how to deal with member's record when the containing group's record is modified 1 week * Default tokens - IP host-based group that all slaves would be in - < 1 week - User-based group for which master has admin rights (add slave to group for duration of job) - more overhead and support costs (esp. with respect to clean-up of groups) - groups either added to Moira, or directly to AFS protection database - more secure than first option guess 2 weeks dev, but need to sort out support issues first * Encrypt all network requests - currently only encrypts data such as credentials, files - priority would be to encrypt all user/master communications 2 weeks * Clean up "RPP" messages (inter-server status, UDP-based) - server uses this to ping nodes, note which are down - options include removing this entirely, or Kerberizing 1-3 weeks * General code clean-up - inspect for short-cuts taken in prototype (important) - 1-2 weeks - make suitable for incorporation - nice long-term goal, not urgent - distinguish between Athena-specific (e.g. quota system, login system) and generally useful features (Kerberos, AFS support) - big effort, may not be worth the cost at this point - OpenPBS future is uncertain * Command namespace - wrappers for all PBS commands? - rename commands? 1 week * Linux slaves - qualify and purchase machines - port ops server modifications - PBS (with local mods) already ported ~1 week to qualify machine, 1-3 weeks for ops port * User documentation - review by other Support staff, reorganization and polishing by TPS for wider use guess 3 weeks, collective * Internal support documentation - collect basic admin, diagnostic commands in one place - document policies for non-course use, particularly for: - determining eligibility of users, mapping to account types - initial quota, guidelines for increase requests - requests for addiitonal disk space - quota and capacity calculations - what to do with requests beyond current resources or outside allowed uses (in particular, to avoid bouncing stopit cases back to clusters) 2 weeks for doc, but many policy decisions need to be made first see /mit/longjobs/doc/support-issues for more details