First, I'm going to tell you what scheduler activations are, and why you want them.
Next, I'll describe how I implemented scheduler activations in the NetBSD kernel, and give some examples of the operations that take place.
After that, I'll discuss the thread library that I implemented using this system,
and finally I'll sum up and take questions.
Threads are a popular mechanism for expressing concurrency in imperative programming. They permit a programmer to provide several sequences of code to execute independently - in parallel, if possible - with explicit synchronization of shared data.
In a Unix system, the kernel is responsible for scheduling code to run. Traditionally, the kernel scheduled processes. Each process contained an address space and a single stream of execution.
With the introduction of threads, suddenly a process is more complicated. We have to ask what controls the scheduling of the individual threads within a process.
[slide: user threads]
One possible answer is that the kernel is oblivious to the presence
of threads. The kernel schedules processes as usual, and each
process is responsible for scheduling its own threads.
However, remember that we got into the thread programming business in order to express concurrency. Consider some situations where concurrency exists: the presence of multiple CPUs, or the ability to run an I/O operation in the background while also executing code. An application-level thread system can not take advantage of these situations, because the kernel believes that there is only one stream of execution. One stream can not be split onto several processors, and if that stream blocks - by entering a slow I/O operation, for example - the entire process must wait until the blocking operation completes. So user-level threads turn out to be a poor way to exploit concurrency.
[slide: kernel threads]
A different possible answer to the thread scheduling question is to
make the kernel schedule all threads. This requires modifying the
kernel to support creating new streams of execution that share the
other resources of a process, such as the address space and file
descriptors. Frequently, this is done with a not-quite-fork()
system call.
One-to-one thread scheduling, as this technique is known, does provide the concurrency we desire: the kernel, aware of each of the threads, can distribute them across processors, and can schedule threads to run in the place of a thread that has blocked.
But there are some drawbacks to this approach. The threads consume more kernel memory, which is a moderately precious resource; inter-thread synchronization generally requires kernel calls, which are much slower than the user-level thread operatons; and if the number of threads does not exactly match the number of processors that the kernel has allocated, time is wasted by time-slicing between threads of the same process that ought to be cooperating with one another.
[slide: two-level threads]
Given these two extremes of implementation, it is natural to want
to strike some kind of compromise that gives us the benefits of
both. We want to expose an application's parallelism to the kernel,
but not burden the kernel with managing each and every thread
operation. Scheduler activations, a system invented about ten years
ago by Thomas Anderson et al. at the University of Washington, is a
way of achiving this compromise.
Under scheduler activations, the kernel provides the application with the abstraction of virtual processors: a guarantee to provide a certain number of CPUs to execute thread code. Events that would affect the number of running threads are communicated directly to the application in a way that maintains the number of virtual processors. The kernel does not make thread scheduling decisions directly; it passes all the relevant information to the application to permit it to make the thread scheduling decision. This deferental approach is crucual to the scheduler activations model.
Some scheduling events include application code being blocked, stopped, or restarted by the kernel - for example, by a thread executing a read() system call on a network socket. When a blocking event such as that occurs, the number of processors executing application code decreases. The kernel uses that processor to deliver a message to the application about the event, and then permits the application to continue to use it to execute code.
Now that we have an outline of the structure of a scheduler activations system, let's look at:
[slide: LWP]
Therefore, the prelude to implementing scheduler activations was
separating process context from execution context in the NetBSD
kernel, and permitting multiple execution contexts in a single
process. This was a slow but largely mechanical undertaking. The
parts of the classic BSD struct proc that related to execution
context were relocated to a new structure, struct lwp (LWP for
``light-weight process'', following Solaris and others). This
included such fields as scheduling counters and priority, the run
state and sleep fields, and the kernel stack.
Following this I examined every use of a struct proc variable, to determine whether it was being used for its process context or execution context, and adjusted them to use struct lwp instead when appropriate. Execution context turned out to be the prevalent use of such variables, especially the global variable curproc.
I also added several system calls to directly manipulate LWPs, borrowing from the Solaris interface. This was useful as a testbed for the LWP infrastructure, before diving into scheduler activations.
Now I'm going to explain the implementation of scheduler activations themselves by walking through the lifecycle of a multithreaded application that uses SAs - for simplicity of explanation, on a uniprocessor machine.
[slide: SA initialization]
First, the application notifies the kernel of its intent to use
scheduler activations with the sa_register() system call. As a
parameter to this call, the application passes a pointer to its
upcall handler.
Next, the application allocates memory for upcall stacks and passes it to the kernel with the sa_stacks() call.
[slide: SA starting]
Next, the application starts the SA subsystem by calling
sa_enable(). From this point on, the kernel will use upcalls to
communicate with the application. sa_enable() itself triggers the
first upcall, and we get to see how that operates.
The kernel function sa_upcall() is invoked by sa_enable(). sa_upcall() reads the state of the application registers out of the trap frame and stores them in a ucontext_t structure. The ucontext_t, one of the stacks passed in by the application, and the type of the upcall - in this case, SA_UPCALL_NEWPROC - are stored in a structure, and a flag is set in the LWP's flag field that it has an upcall pending.
As the code exits the kernel, it passes through the userret() function, which is the last stop before returning to userlevel. userret() notices that the upcall flag is set, and calls sa_upcall_userret(). sa_upcall_userret() copies out the ucontext_t onto the upcall stack, and sets up the trapframe and stack so that the application will execute in the upcall handler with the proper arguments:
sa_upcall(SA_UPCALL_NEWPROC, {{NULL, 1, 0}, {&context, 1, 0}}, 1, 0, NULL)
Upon returning to userland, the upcall handler runs, sees that the application has been issued a virtual processor, and uses the ucontext_t structure to jump back to the code path that contained the sa_enable() call.
Unlike a signal handler, an upcall handler does not jump back into the kernel when it is finished; instead, it simply runs application code. This presents a small bookkeeping problem: the kernel does not know when the stack it supplied to an upcall is no longer needed, and it will eventually run out. Rather than have the kernel attempt to examine the stack locations of LWPs and guess whether the upcall stacks are still in use, the application is responsible for recycling the upcall stacks with the sa_stacks() system call, once they are no longer needed. This operation can be batched so that the system call is incurred only after N upcalls.
[slide: SA preemption]
The scheduler activations application is not at all immune to
normal process scheduling. If other processes are runnable, the
kernel will decide to run another process when the quantum
expires. The running process is made to call the preempt() function
in the kernel, where it puts itself on a run queue and switches to
another process. For a scheduler activations process, this
represents the reduction of the virtual processor
allocation. However, since the allocation has just been reduced to
zero, there is no way to inform the process of this - the
notification of preemption only occurs after the process has been
resumed by the kernel.
[slide: SA block/unblock]
Some time later, the process makes a system call which blocks -
let's say select(). Select() blocks by using the kernel interface
tsleep(). Tsleep() sees that the process is using scheduler
activations, and once it has done its bookkeeping, instead of
calling mi_switch() to jump to another runnable process, it calls
sa_switch(). sa_switch() creates a new LWP 2, calls
sa_upcall(SA_UPCALL_BLOCKED, LWP 1) on it, and then switches
directly to LWP 2, which returns to userlevel as described
before. There, the upcall handler sees that LWP 1 was blocked. It
maps that to a thread ID, notes the fact that that thread is in the
kernel, and then picks a thread to run from its queue (implicitly,
on the new LWP).
Eventually, the kernel wakes up the LWP sleeping in select() when data arrives. This results in LWP 1 continuing inside sa_switch(), which is inside tsleep(). sa_switch() arranges for an SA_UPCALL_UNBLOCKED upcall, with one twist: LWP 2 is stopped, and its context is passed to the upcall handler as the "interrupted" LWP. When LWP 1 returns to userlevel and runs the upcall handler, it sees that LWP 1 was unblocked and LWP 2 was interrupted; both of the associated threads can be added to the thread run queue.
If this were not done, LWP 1 and 2 would both be running, and would be time-sliced against one another by the kernel scheduler, taking scheduling control out of the hands of the application, and violating the invariant that the number of virtual processors executing is the same as the number of physical processors allocated.
[slide: SA yield]
Later, another LWP blocks and the upcall handler is invoked. This
time, however, the upcall handler sees that there are no runnable
threads - the process has exhausted its concurrency. In order to be
a good citizen and reallocate the processor to another use, the
upcall handler calls sa_yield(), informing the kernel that the
application no longer needs that virtual processor. The sa_yield()
system call does not return.
Finally, the application exits. The exit code in the kernel disables scheduler activations, cleans up all of the LWPs (in addition to the one that called exit()), and then terminates the process.
[slide: list of events]
The complete list of events that generate upcalls is:
The kernel code that implements scheduler activations does have some machine-dependent bits, and requires LWP-conversion of the platforms-specific code. For NetBSD, where we support almost 50 platforms and a dozen CPU types, this means a lot of total effort, even if each bit is small. So far, we have done the CPU-specific work for the alpha, arm, i386, m68k, mips, powerpc, and vax, and it seems to be about a day's work per target CPU.
The library registers an upcall handler and enables scheduler activations when the first non-main() thread is created. It maintains a run queue of threads, tracks threads blocked in the kernel or sleeping on user-level synchronization resources, and generally does all the things you want a pthreads library to do. As expected, most of the upcall events are blocking and unblocking events for I/O, and preemption to run other programs. Logging in over the network to run test programs produces a lot of preemptions as the sshd process keeps preempting the tests.
One of the first tasks that the upcall handler faces is identifying the threads that were affected by scheduler activation events. The thread library uses stack-based identification; the stack pointer in a saved context can very quickly be looked up and used to find the thread control block. This is somewhat easier than explicitly tracking the mapping between LWPs and threads, and it puts the burden of work in the upcall code rather than in the thread switch code. On architectures where the ABI dedicates a register to thread support, that can be used instead.
A difficulty faced by the library is that upcalls can occur at any moment and interrupt any section of code. This presents problems for critical sections, which are normally protected by spinlocks. It would be very easy for an application to deadlock if an upcall interrupted anything holding the run queue lock, for example. Following the leads of the original scheduler activations implementation and the Mach implementation, I use a strategy of recovery, rather than prevention. That is, no effort is made to prevent upcalls at inconvenient moments. Instead, before the upcall handler acquires any spinlocks itself, it examines any interrupted threads to determine if they were holding any spinlocks (the spinlock code manages a spinlock counter in the TCB). If they were, then they are permitted to run until their lock count goes to zero, at which point they return to the upcall handler. This is also used if any of the interrupted threads were themselves upcalls; they are permitted to run to completion, as they may hold state about other threads that is not yet known by the rest of the system. The code that handles this is unfortunately quite complicated, due to the possible recursive preemption of upcalls.
The thread library also has some CPU-specific code to implement the user-level context switches.
Preliminary microbenchmarks show that the thread library keeps its promise of providing thread operations that are as fast as a pure user-level implementation, but without sacraficing concurrency in order to do so. It is, however, too early and too immature to present numerical measurements.
There are several avenues for future work. First is to get the code used and tested by more people: Scheduler activations will be integrated into the main development sources of NetBSD in the near future. Next is integration with NetBSD's SMP support. Performance tuning will, of course, play a large role. Some low-hanging fruit for performance optimizations include inter-thread lazy-FPU context switching (which has some difficult interactions with user-level thread switches), speeding up the thread identification routine, and restructuring the spinlock-resolution code to avoid touching the TCB on every lock and unlock. With luck, other people will experiment with this framework and develop even more interesting systems that they can tell us about at future conferences.