Ken's (not-so-)brief guide to the nightly testing setup. ---- The future: I've written notes elsewhere (multiple times?) on ways we might improve this, if we had the time and interest to invest in it. Currently it's very clunky, with data collection being driven by me remembering to run a script each day (or every few days) while using my Kerberos tickets *and* my ssh key. (Some machines run Kerberos rlogind, some run sshd, neither is/was set up on all of them.) Something set up under a web server with accounts for the (possibly non-MIT) test machines, backed by a database for storing the actual info, would make a lot more sense. Generate the web pages dynamically by pulling data out of the database (which, frankly, probably would happen less often than updates to the database), and much of the collection-time work is gone. We'd probably want to augment it with some kind of account-management hooks so the account registration wouldn't be 100% manual, and you could attach data like an email address for the responsible contact person etc. Other packages like "buildbot" may be worth looking into as well. My impression of buildbot is that it's probably good for giving a simple status report and running centrally-driven tests, and doing them continually as changes are made; maybe not so good for letting outside contributors drop their own machines of whatever configuration into the test setup. (Ours isn't great for that either, but in a new system we might want to improve on it.) But enough about that; back to the *current* setup.... ---- Running the nightly builds and tests: The test machines run the "cron-script" script, via cron, obviously. Generally this is done in the non-privileged "krbsnap" account. Some systems run it from AFS; others have local copies, especially if the OS doesn't support AFS well, so updates to the scripts have to be propagated manually. It should kick off after the nightly snapshot is done. There's no hook into the snapshot mechanism here, but assuming 4AM or later US/Eastern time is generally safe, if the nightly snapshot works at all. I usually spread out the times a little bit so as to not have them all slam the ftp server at once, not that it would be a terribly high load anyways. If they're not started too late, most of the builds will be done before I'm likely to collect the results. However, doing {shared,static}x{32-bit,64-bit}x{release,trunk} on a slow machine will take a long time, whatever you do. And because of the use of fixed port numbers in the test suite, and the tendency to just kill off all processes under the current uid that look like they might have been servers started up for testing, you can't do separate runs in parallel without some redesign. (Build+check in parallel, sure; check+check, no.) The cron script decides based on the machine name what combinations of options to test -- gcc vs native cc, 32-bit vs 64-bit, trunk vs release branch, that sort of thing. Used to include shared vs static libraries but we don't support static builds any more. The script also tweaks $PATH if needed, and knows for each host how to add a subject line to mail generated off the command line. (The generated email goes to krbdev-auto@mit.edu, a public list.) For each set of options -- selected by setting certain environment variables -- the "run-build" script is invoked once. That script adjusts the path, figures out where TCL is, stuff like that, writes various bits of info (compiler, locations and versions of various programs, OS version, environment variables, build options) to its log file, and then executes several phases -- download, unpack, reconf, configure, build, install, check. It disables the use of /dev/random before compiling, because on some mostly-idle test systems that can lead to long delays that cause tests to fail or just not run to completion. On some systems this script also turns on malloc debugging options or other local bits of magic that might help catch bugs. Some builds are done in parallel (GNU make "-j" option); others are not. The release branch snapshot to test is identified in one of the scripts. Currently it's set to 1.6 still. It might be possible to specify that both 1.6 and 1.7 branches should be tested, as well as the trunk; I haven't tried. The run-build script sets the PORTBASE which alters the service port numbers used by the DejaGnu tests for running the KDC and admin server; this should reduce conflicts with other test invocations that might happen to be running around the same time under other uids. If Zephyr is available on the local machine, it also reports it's completion status on class krb5, instance autobuild; there may be one or two messages going to me personally. Reporting problems via xmpp to the krb5 chat room might get them more attention. Maybe. Most of the machines in the nightly testing setup are MIT's, many deactivated and some picked up from the reuse list. Some are on the bottom shelf of the test cluster; others are in offices. One or two are mine, and I'll have to review the list to see what's coming out. It won't mess things up too much, there'll just be an ssh timeout or several, and empty spots in the summary table. It would be good to ensure that at least one NetBSD platform stays active in the list, especially if tests of multithreaded use get added, as NetBSD has one of the pickier thread implementations around, whereas Linux and others sometimes let you get away with doing broken things. Similarly, at least one tested architecture should have strict memory access alignment restrictions to check for another class of bugs, whereas x86 platforms will still "work", just maybe a bit slower to access the memory than if the code were actually correct. The current list of test machines from which we try to collect results is in get-logs.mk: dcl (Athena Solaris) all-in-one (Athena Linux) rsts-11 (Tru64 5.1A, offline) gamma-11 (Mac, offline) opteron-prime (Debian Linux, W92 machine room) salamander (sparc64 NetBSD, test cluster). Others not currently online or not currently in use: sfdf (ia64?, Garry's office?) capacitor-bank (SGI, Tom's office) rsx-11 (ppc-aix4.3.3, possibly still in Thomas's office, broken disk?). The test runs on dcl will often fail if someone isn't logged in, because the "reactivate" process causes the locker with the Sun compiler in it to disappear while the build is running. There's probably some way to hack around that with "screen" or something; I just haven't bothered. (UPDATE: I've just hacked the crontab script to restrict the hours during which it'll reactivate, so staying logged in shouldn't be necessary, but if there's another Athena OS update for Solaris it'll likely get clobbered.) Also, dcl uses a second cron job, run just before the regular nightly build script, to acquire tickets so it can access the sunsoft compiler. ---- Collecting results: The "update" script pulls in the various log files from the build systems via rsh or ssh to the invoker's own account on the test machine (not the test account), and unpacks them into a local tree under /var/raeburn/nightly/tmp. It fetches them via an invocation of GNU make, to do it in parallel, which might be a bit gratuitous. The script "grab-logs" knows how to fetch the logs from one test run, and has hard-coded knowledge of which remote-shell program to use for which remote host. The trees are then flattened out into a single directory each, with altered file names, so "kadmin/passwd/unit-test/kpasswd.sum" would become "kadmin!passwd!unit-test!kpasswd.sum". The log files (everything but the smallest status files) are compressed and given a ".txtgz" suffix, which in the web server configuration indicates that the file should be described as text and transferred compressed if possible, but the receiving web browser will see the uncompressed text file. Even compressed, some of these files can be over a megabyte, but the dejagnu output is highly repetitive and compresses very well. Without compression, the log files use up a huge amount of space, and very quickly. The files are then copied over to krbdev.mit.edu, and the "make-web-pages.pl" script run over them. This examines the status files and (re)generates a collection of web pages, giving an overall status summary or per-platform reports, or per-build listings of all available log files. The scripts to collect log data need to know what configurations are tested and where the files live, so they have to be kept in sync with the scripts actually running the tests. The code in util.pl has hardcoded paths for the location of the data to process -- /var/www/nightly if you're on krbdev.mit.edu, or /var/raeburn/nightly/tmp on any other system. Obviously that'll need to be fixed. For a while I was generating the HTML on my desktop machine and then pushing them to the web server, but since I didn't keep the older data on my desktop machine, the web pages needed regenerating on the server anyways. I think the files in /var/www/nightly may be owned by me, probably without group write access. That's easy enough to repair. Once in a long while, the old data in /var/www/nightly might need thinning out to reclaim disk space. Since I started compressing the log files, this isn't needed very much. ---- Untested: We don't do automatic nightly builds on Windows. I think there might still be a cron job run by the pismere folks, but we tend not to hear any results from it. Without something like this, the Windows support code is likely to suffer bitrot. We don't run tests outside our tree automatically, like gssmonger or gsstest (Martin Rex's test suite). None of the test cases needing root privileges (telnet, rlogin) are run automatically. And, of course, anything our in-tree test suite doesn't cover (LDAP, plugins, multithreaded programs, MS extensions, vague-errors, kdc-kdb-update) isn't tested by the automated nightly testing. ---- The test cluster: The four W92 test cluster machines we have (all located on the bottom shelf) are in kind of a sorry state, having gotten little love over the years: * rsts-11.mit.edu: an Alpha running Tru64 5.1A, but with a broken disk * salamander.mit.edu: an old Sun running an old NetBSD release, in 64-bit mode I believe * lxiv.mit.edu: an x86_64(?) box we originally got for Windows testing; it might still be providing remote access for jaltman but I doubt it * another old Sun that as far as I recall I never got around to installing The Sun is using our "usual" root password. It's configured to reboot daily because some bug in the kernel or in expect sometimes causes the expect processes run by dejagnu to linger. (Actually, it looks like it's configured to reboot twice each night. Oops.) We've got 64-bit SPARC testing happening on Solaris (dcl), too, so it would probably be reasonable to get an x86(_64) VM to run NetBSD for testing, move anything we still care about off the x86_64 box into VMs, and ditch all of this hardware. (If you do deactivate and discard the Alpha, I might be interested in adding it to my collection.) Nothing we work on currently relates to BIOS, hardware, boot time, etc., so virtualization is probably fine. Andrew Boardman is in charge of the test cluster these days, I think. At least, he was involved in the big move and reorganization not too long ago. If you keep hardware there, someone should be on the w92-test-cluster list to keep apprised of any goings-on there. ---- Misc: The main obstacle to doing multiple "make check" runs in parallel is probably the kadmin-related tests that run one script to start the server processes, then run some tests, then run another script to kill off the servers. If the launching and quitting of servers were folded into those test suites themselves, as happens in tests/dejagnu, we could track the correct process id and kill just that server process when the tests are done. Or we could use a script that launches server programs, invokes another command (runtest or whatever), then kills the servers, and kills them on ^C as well. The choice of port numbers for the servers also needs not to conflict, but that would probably be easy to manage in the program or script that launches the multiple parallel invocations.