Ken's (not-so-)brief guide to the nightly testing setup.

----

The future:

I've written notes elsewhere (multiple times?) on ways we might
improve this, if we had the time and interest to invest in it.
Currently it's very clunky, with data collection being driven by me
remembering to run a script each day (or every few days) while using
my Kerberos tickets *and* my ssh key.  (Some machines run Kerberos
rlogind, some run sshd, neither is/was set up on all of them.)

Something set up under a web server with accounts for the (possibly
non-MIT) test machines, backed by a database for storing the actual
info, would make a lot more sense.  Generate the web pages dynamically
by pulling data out of the database (which, frankly, probably would
happen less often than updates to the database), and much of the
collection-time work is gone.  We'd probably want to augment it with
some kind of account-management hooks so the account registration
wouldn't be 100% manual, and you could attach data like an email
address for the responsible contact person etc.

Other packages like "buildbot" may be worth looking into as well.  My
impression of buildbot is that it's probably good for giving a simple
status report and running centrally-driven tests, and doing them
continually as changes are made; maybe not so good for letting outside
contributors drop their own machines of whatever configuration into
the test setup.  (Ours isn't great for that either, but in a new
system we might want to improve on it.)

But enough about that; back to the *current* setup....

----

Running the nightly builds and tests:

The test machines run the "cron-script" script, via cron, obviously.
Generally this is done in the non-privileged "krbsnap" account.  Some
systems run it from AFS; others have local copies, especially if the
OS doesn't support AFS well, so updates to the scripts have to be
propagated manually.

It should kick off after the nightly snapshot is done.  There's no
hook into the snapshot mechanism here, but assuming 4AM or later
US/Eastern time is generally safe, if the nightly snapshot works at
all.  I usually spread out the times a little bit so as to not have
them all slam the ftp server at once, not that it would be a terribly
high load anyways.  If they're not started too late, most of the
builds will be done before I'm likely to collect the results.
However, doing {shared,static}x{32-bit,64-bit}x{release,trunk} on a
slow machine will take a long time, whatever you do.  And because of
the use of fixed port numbers in the test suite, and the tendency to
just kill off all processes under the current uid that look like they
might have been servers started up for testing, you can't do separate
runs in parallel without some redesign.  (Build+check in parallel,
sure; check+check, no.)

The cron script decides based on the machine name what combinations of
options to test -- gcc vs native cc, 32-bit vs 64-bit, trunk vs
release branch, that sort of thing.  Used to include shared vs static
libraries but we don't support static builds any more.  The script
also tweaks $PATH if needed, and knows for each host how to add a
subject line to mail generated off the command line.  (The generated
email goes to krbdev-auto@mit.edu, a public list.)

For each set of options -- selected by setting certain environment
variables -- the "run-build" script is invoked once.  That script
adjusts the path, figures out where TCL is, stuff like that, writes
various bits of info (compiler, locations and versions of various
programs, OS version, environment variables, build options) to its log
file, and then executes several phases -- download, unpack, reconf,
configure, build, install, check.  It disables the use of /dev/random
before compiling, because on some mostly-idle test systems that can
lead to long delays that cause tests to fail or just not run to
completion.

On some systems this script also turns on malloc debugging options or
other local bits of magic that might help catch bugs.  Some builds are
done in parallel (GNU make "-j" option); others are not.

The release branch snapshot to test is identified in one of the
scripts.  Currently it's set to 1.6 still.  It might be possible to
specify that both 1.6 and 1.7 branches should be tested, as well as
the trunk; I haven't tried.

The run-build script sets the PORTBASE which alters the service port
numbers used by the DejaGnu tests for running the KDC and admin
server; this should reduce conflicts with other test invocations that
might happen to be running around the same time under other uids.

If Zephyr is available on the local machine, it also reports it's
completion status on class krb5, instance autobuild; there may be one
or two messages going to me personally.  Reporting problems via xmpp
to the krb5 chat room might get them more attention.  Maybe.

Most of the machines in the nightly testing setup are MIT's, many
deactivated and some picked up from the reuse list.  Some are on the
bottom shelf of the test cluster; others are in offices.  One or two
are mine, and I'll have to review the list to see what's coming out.
It won't mess things up too much, there'll just be an ssh timeout or
several, and empty spots in the summary table.

It would be good to ensure that at least one NetBSD platform stays
active in the list, especially if tests of multithreaded use get
added, as NetBSD has one of the pickier thread implementations around,
whereas Linux and others sometimes let you get away with doing broken
things.  Similarly, at least one tested architecture should have
strict memory access alignment restrictions to check for another class
of bugs, whereas x86 platforms will still "work", just maybe a bit
slower to access the memory than if the code were actually correct.

The current list of test machines from which we try to collect results
is in get-logs.mk: dcl (Athena Solaris) all-in-one (Athena Linux)
rsts-11 (Tru64 5.1A, offline) gamma-11 (Mac, offline) opteron-prime
(Debian Linux, W92 machine room) salamander (sparc64 NetBSD, test
cluster).  Others not currently online or not currently in use: sfdf
(ia64?, Garry's office?) capacitor-bank (SGI, Tom's office) rsx-11
(ppc-aix4.3.3, possibly still in Thomas's office, broken disk?).

The test runs on dcl will often fail if someone isn't logged in,
because the "reactivate" process causes the locker with the Sun
compiler in it to disappear while the build is running.  There's
probably some way to hack around that with "screen" or something; I
just haven't bothered.  (UPDATE: I've just hacked the crontab script
to restrict the hours during which it'll reactivate, so staying logged
in shouldn't be necessary, but if there's another Athena OS update for
Solaris it'll likely get clobbered.)  Also, dcl uses a second cron
job, run just before the regular nightly build script, to acquire
tickets so it can access the sunsoft compiler.

----

Collecting results:

The "update" script pulls in the various log files from the build
systems via rsh or ssh to the invoker's own account on the test
machine (not the test account), and unpacks them into a local tree
under /var/raeburn/nightly/tmp.  It fetches them via an invocation of
GNU make, to do it in parallel, which might be a bit gratuitous.  The
script "grab-logs" knows how to fetch the logs from one test run, and
has hard-coded knowledge of which remote-shell program to use for
which remote host.

The trees are then flattened out into a single directory each, with
altered file names, so "kadmin/passwd/unit-test/kpasswd.sum" would
become "kadmin!passwd!unit-test!kpasswd.sum".  The log files
(everything but the smallest status files) are compressed and given a
".txtgz" suffix, which in the web server configuration indicates that
the file should be described as text and transferred compressed if
possible, but the receiving web browser will see the uncompressed text
file.  Even compressed, some of these files can be over a megabyte,
but the dejagnu output is highly repetitive and compresses very well.
Without compression, the log files use up a huge amount of space, and
very quickly.

The files are then copied over to krbdev.mit.edu, and the
"make-web-pages.pl" script run over them.  This examines the status
files and (re)generates a collection of web pages, giving an overall
status summary or per-platform reports, or per-build listings of all
available log files.

The scripts to collect log data need to know what configurations are
tested and where the files live, so they have to be kept in sync with
the scripts actually running the tests.

The code in util.pl has hardcoded paths for the location of the data
to process -- /var/www/nightly if you're on krbdev.mit.edu, or
/var/raeburn/nightly/tmp on any other system.  Obviously that'll need
to be fixed.  For a while I was generating the HTML on my desktop
machine and then pushing them to the web server, but since I didn't
keep the older data on my desktop machine, the web pages needed
regenerating on the server anyways.

I think the files in /var/www/nightly may be owned by me, probably
without group write access.  That's easy enough to repair.

Once in a long while, the old data in /var/www/nightly might need
thinning out to reclaim disk space.  Since I started compressing the
log files, this isn't needed very much.

----

Untested:

We don't do automatic nightly builds on Windows.  I think there might
still be a cron job run by the pismere folks, but we tend not to hear
any results from it.  Without something like this, the Windows support
code is likely to suffer bitrot.

We don't run tests outside our tree automatically, like gssmonger or
gsstest (Martin Rex's test suite).

None of the test cases needing root privileges (telnet, rlogin) are
run automatically.

And, of course, anything our in-tree test suite doesn't cover (LDAP,
plugins, multithreaded programs, MS extensions, vague-errors,
kdc-kdb-update) isn't tested by the automated nightly testing.

----

The test cluster:

The four W92 test cluster machines we have (all located on the bottom
shelf) are in kind of a sorry state, having gotten little love over
the years:

 * rsts-11.mit.edu: an Alpha running Tru64 5.1A, but with a broken
   disk
 * salamander.mit.edu: an old Sun running an old NetBSD release, in
   64-bit mode I believe
 * lxiv.mit.edu: an x86_64(?) box we originally got for Windows
   testing; it might still be providing remote access for jaltman but
   I doubt it
 * another old Sun that as far as I recall I never got around to
   installing

The Sun is using our "usual" root password.  It's configured to reboot
daily because some bug in the kernel or in expect sometimes causes the
expect processes run by dejagnu to linger.  (Actually, it looks like
it's configured to reboot twice each night.  Oops.)

We've got 64-bit SPARC testing happening on Solaris (dcl), too, so it
would probably be reasonable to get an x86(_64) VM to run NetBSD for
testing, move anything we still care about off the x86_64 box into
VMs, and ditch all of this hardware.  (If you do deactivate and
discard the Alpha, I might be interested in adding it to my
collection.)  Nothing we work on currently relates to BIOS, hardware,
boot time, etc., so virtualization is probably fine.

Andrew Boardman is in charge of the test cluster these days, I think.
At least, he was involved in the big move and reorganization not too
long ago.  If you keep hardware there, someone should be on the
w92-test-cluster list to keep apprised of any goings-on there.

----

Misc:

The main obstacle to doing multiple "make check" runs in parallel is
probably the kadmin-related tests that run one script to start the
server processes, then run some tests, then run another script to kill
off the servers.  If the launching and quitting of servers were folded
into those test suites themselves, as happens in tests/dejagnu, we
could track the correct process id and kill just that server process
when the tests are done.  Or we could use a script that launches
server programs, invokes another command (runtest or whatever), then
kills the servers, and kills them on ^C as well.

The choice of port numbers for the servers also needs not to conflict,
but that would probably be easy to manage in the program or script
that launches the multiple parallel invocations.