Thu Oct 22 01:33: busy-beaver

    b-b spontaneously rebooted. The best explanation I can come
    up with, and it's a bad one, is that I mistyped when I was executing
    sysrqs and mistyped sysrq-b, which is reboot. But I don't buy that,
    because I didn't see any console output from it, and it happened while
    I was not typing.

    There's nothing useful in any of the logs (they all show a boot, but no
    cause for shutdown/reboot)

# Thu Oct 22 19:14: busy-beaver

    AFS was wedged in a really interesting way. While load was over 2000 we
    were able to get a shell.  nslcd was kill -9'ed and then restarted;
    kill -9'ing the shell unwedged the console.  1540 processes from failed
    Nagios probes.  We removed the machine from the pool by shutting down
    postfix.

    /proc/meminfo reports 1.5/4GB physical and 5.5/8GB swap free.

    Processes (ns-slapd, httpd.worker, fs, check_disk) were
    mostly stuck in traces like:
    ns-slapd      D 0000000000000002     0  1534      1   
     ffff8800b4c53de8 0000000000000286 0000000082dcc877 0000000082dcc877
     ffff8800b4c53da0 ffffffff8100ee82 0000000000000206 0000000082dcc877
     ffff8800b4d03248 000000000000e2e8 ffff8800b4d03248 0000000000012d00
    Call Trace:
     [<ffffffff8100ee82>] ? check_events+0x12/0x20
     [<ffffffff8100b46d>] ? xen_mc_issue.clone.0+0x34/0x4d
     [<ffffffff8100b58a>] ? xen_write_cr0+0x3f/0x46
     [<ffffffff81496bf6>] schedule+0x21/0x49
     [<ffffffff81498c3e>] __down_read+0xa9/0xd5 
     [<ffffffff810519e5>] ? finish_task_switch+0x6c/0xfb
     [<ffffffff81497d14>] down_read+0x3e/0x59
     [<ffffffff810e94b0>] sys_madvise+0x88/0x510
     [<ffffffff81498ce0>] ? trace_hardirqs_off_thunk+0x3a/0x6c
     [<ffffffff810a6db4>] ? audit_syscall_entry+0x12d/0x16d
     [<ffffffff81498ca4>] ? trace_hardirqs_on_thunk+0x3a/0x3c
     [<ffffffff81012082>] system_call_fastpath+0x16/0x1b

    No BUGs or OOPSes were in dmesg output.  /proc/1534/syscall was:

        28 0x7f589cc16000 0x289000 0x4 0x2a9450 0x1000 0x8 0x7f58a1e1ad38 0x39330dae77

    So this is madvise((void *)something, (size_t)289 pages, MADV_DONTNEED).

    MADV_DONTNEED
           Do not expect access in the near future.  (For the  time  being,
           the  application is finished with the given range, so the kernel
           can free resources associated with it.)  Subsequent accesses  of
           pages  in this range will succeed, but will result either in re-
           loading of the memory contents from the underlying  mapped  file
           (see  mmap(2)) or zero-fill-on-demand pages for mappings without
           an underlying file.

    We should set up remote syslog.

    Can we decide that the syscall itself is irrelevant and the presence
    of schedule() etc. means that something else strange happened?

    That doesn't seem unreasonable to me, but you're far more knowledgeable
    about the kernel.