I/O Event Handling Under Linux

Richard Gooch

3-JUN-1999

Introduction

I/O Event handling is about how your Operating System allows you to manage a large number of open files (file descriptors in UNIX/POSIX, or FDs) in your application. You want the OS to notify you when FDs become active (have data ready to be read or are ready for writing). Ideally you want a mechanism that is scalable. This means a large number of inactive FDs cost very little in memory and CPU time to manage.

First I should start off by stating my goal: to develop a thin interface that makes best use of the facilities the OS provides, scaling as much as possible. Where possible, use of standard POSIX facilities are preferred. A seconday goal is to provide a convenient interface for applications. The reason for preferring standard POSIX interfaces is that it introduces less conditional blocks in the interface code.

The Traditional UNIX Way

Traditional Unix systems provide the select(2) and/or poll(2) system calls. With both of these you pass an array of FDs to the kernel, with an optional timeout. When there is activity, or when the call times out, the system call will return. The application must then scan the array to see which FDs are active. This scheme works well with small numbers of FDs, and is simple to use. Unfortunately, for thousands of FDs, this does not work so well.

The kernel has to scan your array of FDs and check which ones are active. This takes approximately 3 microseconds (3 us) per FD on a Pentium 100 running Linux 2.1.x. Now you might think that 3 us is quite fast, but consider if you have an array of 1000 FDs. This is now 3 milliseconds (3 ms), which is 30% of your timeslice (each timeslice is 10 ms). If it happens that there is initially no activity and you specified a timeout, the kernel will have to perform a second scan after some activity occurs or the syscall times out. Ouch! If you have an even bigger application (like a large http server), you can easily have 10000 FDs. Scanning times will then take 30 ms, which is three timeslices! This is just way too much.

You might say that 3 ms for 1000 FDs is not such a big deal: a user will hardly notice that. The problem is that the entire array of FDs is scanned each time you want to go back to your polling loop. The way these applications work is that after checking for activity on FDs, the application processes the activity (for example, reading data from active FDs). When all the activity has been processed, the application goes back to polling the OS for more FD activity. In many cases, only a small number of FDs are active at any one time (say during each timeslice), so it may only take a few milliseconds to process all the activity. High performance http servers can process hundreds or thousands of transactions per second. A server that takes 2 ms to process each active FD can process 500 transactions per second. If you add 3 ms for FD scanning in the kernel, you now have 5 ms per transaction. That only gives 200 transactions per second, a massive drop in performance.

There is another problem, and that is that the application needs to scan the "returned" FD array that the kernel has updated to see which FDs are active. This is yet another scan of a large array. This isn't as costly as the kernel scan, for reasons I'll get to later, but it is still a finite cost.

New POSIX Interfaces

A fairly simple proposal is to use the POSIX.4 Asynchronous I/O (AIO) interface (aio_read() and friends). Here we would call aio_read() for each FD. This would then queue thousands of asynchronous I/O requests. This model looks appealing, until we look under the hood of some aio_*() implementations. The Linux glibc implementation is a case in point: there is no kernel support. Instead, the C library (glibc 2.1) launches a thread per FD for which there are outstanding AIO requests (up to the maximum number of configured threads). In general, implementing this facility in the C library is reasonable, as it avoids kernel bloat. However, if you use this facility to start thousands of AIO requests, you may end up creating thousands of threads. This is no good, since threads are costly. The "obvious" solution is to implement AIO in the Linux kernel, then. Another solution is to use userspace tricks to avoid the scalability problems (see the description of migrating FDs below). These solutions may be fine if you only want to run under Linux, but is not much help if you want to run under another OS which also implements AIO using threads (and for which you don't have the source code so you can change the implementation). The point here is that there appears to be no guarantee that aio_*() implementations are scalable across platforms which support it.

It is also worth noting that POSIX.4 Asynchronous I/O is not necessarily available on all POSIX.4 compliant systems (facilities defined by POSIX.4 are optional). So even if you were prepared to limit your application to POSIX.4 systems, there is still no guarantee that AIO is available. Many or most implementations will be scalable, but we can't be sure all are scalable, so we need an alternative.

I should also point out that I don't think that the POSIX.4 AIO interface is particularly appropriate for a network server application. AIO doesn't provide a callback interface, so it is harder to build higher-layer network connection objects in your application. So just writing a better (scalable) implementation of AIO is not the ideal, either.

Readiness Event Queues and other OSes

Some other operating systems provide a mechanism called I/O completion ports and some have event queues. These are a mechanism to tell the OS that you want to be notified when there is activity on a FD. Usually, when a FD becomes active (i.e. becomes ready for reading or writing) , the OS will send a message (a "readiness event") to a "port" (perhaps another FD such as a pipe). This message contains the FD that has become active. The one port can have many readiness events from many FDs sent to it. The key difference here is that you do not need to pass the OS a massive FD array each time you want to listen for events: you only need to tell the OS once for each FD that you want to receive readiness events for that FD. The kernel no longer needs to scan a massive FD array each time through your polling loop, and nor does the application. This is an appealing approach, and scales very well.

Unfortunately, I/O readiness queues are not POSIX. They're not even Unix. We could add them to Linux, but it would mean that applications that relied on this mechanism would be unportable, Linux-only. Also, it could involve significant additions to the kernel, which sets off my bloat alarms. I would hope that it is possible to develop an effective alternative that uses standard POSIX/UNIX functionality.

Optimising Existing UNIX Interfaces

There are improvements we can make for the massive FD scanning problem. Firstly we can optimise the way the scanning is done inside the kernel. Right now (2.1.106) the kernel has to call the poll() method for each file structure. This is expensive. Back in the 2.1.5x kernels, I coded a better implementation for the kernel which sped things up almost 3 times. While this requires modifications to drivers to take advantage of this, it has the advantage of not changing the semantics we expect from UNIX. Note one other interesting feature of this optimisation: it centralises event notification, which in turn would make implementing I/O readiness queues simpler. I'm not sure how closure of FDs before readiness events are read should be handled. This could complicate their implementation.

Doing this optimisation does not solve our problem, though. It only pushes the problem away for a while.

Making Better Use of Existing UNIX Interfaces

Note that for my purposes, it is better to optimise the application so that it works well on many OSes rather than optimising a single OS. Creating new interfaces for Linux is a last resort. Also note that this section assumes that an OS of interest does not have an existing (preferably POSIX) mechanism that supports FD management in a scalable way.

Another solution (which would also benefit from the kernel optimisation discussed above) is for the application to divide the FD array into a number of smaller FD arrays, say 10. You then create 10 threads, each of which has a polling loop using its smaller FD array. So each FD array is now 100 entries long. While this doesn't change the total number of FDs that must be scanned, it does change when they have to be scanned. Since most FDs are inactive, not all the threads will be woken up. Too see how this works, consider the example where, at any time (say during a single timeslice of 10 ms), only 5 FDs are active. Assuming these FDs are randomly, uniformly distributed, at most 5 threads will need to be woken up. These threads then process the activity and go back to the start of their polling loops. Where we win is that only 5 threads had to go back and call select(2) or poll(2). Since they each have 100 entry FD arrays, the kernel only has to scan 500 FDs. This has halved the amount of scanning required. The scanning load has gone from 30% to 15% by this simple change. If you were to instead use 100 threads, you would still only have at most 5 threads woken up for activity, and hence the total number of FDs scanned this timeslice would be 50. This takes down the scanning load to 0.15%, which is negligible.

There is one thing to watch out for here: if you use select(2) in your polling loop, be aware that the size of your FD array is equal to the value of your largest FD. This is because select(2) uses a bitmask for its FD array. This means one of your threads will want to poll FDs 991 to 1000. Unfortunately, your FD array is still 1000 long. What's worse, the kernel still has to do a minimal scan for all those 1000 FDs. The solution to this is to use poll(2) instead, where you only have to pass as many FDs as you want to poll, and the kernel scans only those.

This solution sounds ideal: just create lots and lots of threads. At the extreme, you create one thread per FD. There is a problem here, however, as each thread consumes system resources. So you need to compromise between the number of threads and the FD scanning load. Also, the more threads you have the more cache misses you induce, so this is something to avoid as well. Fortunately in this case most threads will be running nearly the same code at the same time, so cache pollution should not be a significant problem.

A more advanced solution is to have dynamic migration of FDs depending on whether they are mostly active or inactive. In the simplest case, you only have two threads. One which polls mostly active FDs and the other polls mostly inactive FDs. The thread for active FDs will be woken up very frequently, but on the other hand will have only a small number of FDs to scan. The other thread will have to scan a large number of FDs, but it will only be woken up occasionally. For each FD an activity counter is kept. When a FD on the mostly inactive list is deemed to be fairly active, it is migrated to the mostly active list. A reverse operation occurs for fairly inactive FDs on the mostly active list.

I favour this solution, since it can be implemented solely in userspace and is portable to other POSIX systems. I have an existing software library which has (amongst other things) support for managing events on FDs. I plan on extending this library to use the above technique. The library is distributed under the LGPL. Watch this space for results.

My approach is to squeeze as much performance as we can out of the existing POSIX/UNIX interface by optimising the kernel and doing clever things in userspace. We can then evaluate how much of a bottleneck polling is under Linux. If polling overheads (for very large numbers of FDs) are kept to within a few percent, I firmly believe there is no need for I/O readiness queues. If polling overheads remain above 10%, then we may consider I/O readiness queues or other extensions to Linux. However, it would be better to add kernel support for scalable AIO rather than implement readiness queues, since AIO is a POSIX standard whereas readiness queues are not.

Minor Additions to UNIX Interfaces

OK, now I'll get back to another point I raised earlier, and that is the time taken for the application to scan a list of FDs for activity. We would like to reduce this time as well, and we can without much effort at all. Instead of using poll(2) we can implement poll2(2), a new syscall which is very much like poll(2) but returns an array of active FDs. I implemented this last year when I first started thinking about optimising FD management under Linux.

So now the application only needs to search a very small array. This new syscall is based on the principle of not duplicating work. The kernel has just gone and scanned all these FDs for us, why not keep a record of that work, rather than doing it again in the application? Note that the current Linux polling implementation is so slow that implementing poll2(2) will make little difference. However, once the optimisations I've proposed are added, it will make a difference (yes, I've done the benchmarks). Oh, and before you take my comments on the Linux polling implementation too far, note that I've benchmarked it against several other OSes, and Linux came out far better in any case. My point is that we can still do a lot better.

You might ask, why implement poll2(2) if we have this clever userspace solution that avoids many of the problems? Well, the point is that it would speed up the application processing the inactive list of FDs, and that is always a good thing. However, it may turn out that the gain from poll2(2) is marginal. You could also argue that since poll2(2) is non-POSIX, non-UNIX, then why bother with it at all, and why not simply implement I/O readiness queues? Well, I could say that poll2(2) is only a small departure from UNIX whereas I/O readiness queues are more radical. However, this isn't a very strong argument. I'm waiting to see the results of the userspace solution before considering poll2(2) further. Furthermore, it would be better to make the Linux implementation of AIO scalable, since this is a standard POSIX interface. So poll2(2) is probably dead. It does have one nice feature, though: it would be easier for an application to change to using poll2(2) than to change to using AIO.

Another possibility is to make a slight extension to the existing UNIX asynchronous signal delivery mechanism. Currently, you can reqest the OS to deliver a signal when a FD becomes ready for reading or writing. Unfortunately, if you do this for multiple FDs your signal handler doesn't know which FD is ready for activity. This is because UNIX signals do not carry extra information with them. You can't use a separate signal number for each FD, since UNIX has only a few signal numbers. However, POSIX real-time signals can carry a word of data to the signal handler. So we could extend Linux such that if you request a POSIX RT signal to be delivered when an FD is ready for I/O, the word of data is the FD number itself. Unfortunately, depending on this behaviour would once again be Linux-specific. However, such a system does look like an attractive way of providing the foundations for a scalable AIO implementation.

Mixing and Matching

A good implementation of POSIX.4 AIO should be superior to my migrating FD scheme, since AIO should require no polling whatsoever. Therefore my interface code should be able to make use of AIO if available. However, since an AIO implementation may in fact not scale well, it's performance will have to be compared to the migrating FD scheme to determine whether or not it should be utilised.

Similarly, for a system without POSIX.4 AIO but with readiness queues it would make sense for the interface code to utilise this facility.

The Thundering Herd Problem

A note on the "thundering herd" problem: it's not really of much interest in this discussion. The "problem" arises because people attempt to increase concurrency of accepting new connections by having multiple threads all blocking waiting on select(2). Because the kernel wakes up all threads, rather than one thread per new connection, we have a "thundering herd" of freshly woken up threads. These problems are best solved by treating the accepting of incoming connections as just another case of I/O management, rather than special-casing them.

Other Resources

Readers may find a recent paper presented at USENIX98 of interest. The author also has a list of other research on this topic. The paper essentially describes a mechanism for improving the speed of select(2) by adding internal state information to the kernel about FD activity.

Comments Please

I invite comments or additions to this document. My purpose is to explain the various issues involved and serve as a primer for more debate. I hope to avoid recurring debates on the linux-kernel list which go over the same ground and either never contribute anything new to the arguments, or take many messages before something new is added. If you have a strong technical argument that I've missed, I'll be happy to add it to this document.

Dan Kegel has written an interesting page on scaling WWW and ftp servers. Check it out.

Original: 22-JUN-1998

Back to my Home Page

Richard Gooch (rgooch@atnf.csiro.au)