MULTICS DESIGN DOCUMENT MDD-019 To: MDD Distribution From: Robert S. Coren Date: July 15, 1986 Subject: Traffic Control Abstract: This document describes the policies and algorithms of Multics Traffic Control, which is that part of the supervisor that manages the allocation of processors among processes. Revisions: REVISION DATE AUTHOR initial 86-07-15 Robert S. Coren _________________________________________________________________ Multics Design Documents are the official design descriptions of the Multics Trusted Computing Base. They are internal documents, which may be released outside of Multics System Development only with the approval of the Director. i MDD-019 Traffic Control CONTENTS Page Section 1 Overview of Multics Traffic Control . . . 1-1 1.1 Definitions . . . . . . . . . . . 1-2 1.1.1 Scheduling States . . . . . . 1-2 1.1.2 Other Definitions . . . . . . 1-2 1.2 Data Bases . . . . . . . . . . . . 1-4 1.3 Transitions Between Scheduling States . . . . . . . . . . . . . . . 1-5 1.4 General Principles of Scheduling Policy . . . . . . . . . . . . . . . 1-6 1.4.1 Interactive Priority . . . . 1-6 1.4.2 Administrative Controls . . . 1-6 1.4.2.1 Percent Scheduling . . . 1-7 1.4.2.2 Deadline Scheduling . . 1-7 1.4.2.3 Real-time Deadlines . . 1-7 1.4.2.4 Specifying Response Times and Eligilibility Quanta 1-8 1.4.2.5 Eligibility Limits . . . 1-8 1.5 Implementation of Scheduling Algorithms . . . . . . . . . . . . . 1-9 1.5.1 Ready Queues . . . . . . . . 1-9 1.5.1.1 Ready Queues in Percent Mode . . . . . . . . . . . . . 1-9 1.5.1.2 Ready Queue in Deadline Mode . . . . . . . . . . . . . 1-11 1.5.1.3 Real-time Ready Queue . 1-11 1.5.2 Choosing a Process for Eligibility . . . . . . . . . . . 1-11 1.5.2.1 Work Class Credits . . . 1-12 1.5.2.2 Governed Work Classes . 1-12 1.5.2.3 Awarding Eligibility . . 1-13 1.5.3 Choosing a Process to Run . . 1-13 1.5.4 Preemption . . . . . . . . . 1-14 1.6 Wakeups . . . . . . . . . . . . . 1-15 1.6.1 Interprocess Signals . . . . 1-16 1.7 Timers . . . . . . . . . . . . . . 1-16 1.7.1 Alarm Timers . . . . . . . . 1-17 1.7.2 CPU Timers . . . . . . . . . 1-17 1.7.3 Timer Register . . . . . . . 1-17 Section 2 Security Considerations . . . . . . . . . 2-1 ii Traffic Control MDD-019 CONTENTS (cont) Page 2.1 Process Scheduling Parameters . . 2-1 2.2 Sending Wakeups . . . . . . . . . 2-1 iii Traffic Control MDD-019 SECTION 1 OVERVIEW OF MULTICS TRAFFIC CONTROL Traffic Control is the subsystem of the Multics supervisor that allocates processors among processes. Its job is to implement the scheduling policies of the system, which involves selecting which process(es) to run at any given time, and determining how long each process can run without having to yield to some other process. Broadly speaking, the responsibilities of traffic control can be broken down into the following categories: ox effecting the transition of each process on the system between the various possible scheduling states (see below); ox passing wakeup signals (with their associated event messages) and interrupt signals from one process to another; ox controlling the priorities of the various processes by sorting them according to the appropriate parameters; ox determining when to find a new process to run; ox finding a process to run whenever a processor becomes available. The traffic control subsystem is implemented almost entirely by one large ALM program named pxss. Various entries into pxss are used by other ring-0 procedures to change the state of the current process or to send a wakeup signal to some other process. Most of pxss runs in a wired environment with interrupts masked, in order to protect critical data bases (most notably the wired segment tc_data). Because traffic control runs entirely in ring 0, and its functions are largely inaccessible to outer-ring programs, there are not many security considerations in traffic control. Those that do exist are discussed in Section 2. The following modules are covered by this document: pxss and tc. 1-1 MDD-019 Traffic Control 111...111 DDDEEEFFFIIINNNIIITTTIIIOOONNNSSS 111...111...111 SSSccchhheeeddduuullliiinnnggg SSStttaaattteeesss At any given instant, a process is in exactly one of the states defined below. ox running: A running process is one that is currently executing on some processor. ox waiting: A waiting process is one that cannot run until the occurrence of some event that is expected to happen in a reasonably short time, such as the arrival of a page in memory or the release of a ring-0 lock. ox blocked: A blocked process is one that cannot (or will not) run until the occurrence of an event that may take an arbitrarily long time, such as a signal from some other process, or the entry of input from a terminal. ox ready: A ready process is one that is not currently running, but could run at any time that traffic control is willing to grant it a processor. ox stopped: A stopped process is one that is awaiting destruction, and cannot be allowed to run. ox ptlocking: Traffic control recognizes a special state called "ptlocking" (page-table-locking), which is equivalent to waiting when the event is the release of the page table lock. This state exists in order to optimize the interactions between traffic control and page control when there is contention on the page table lock. This document contains few references to this state; for the purposes of description, it is essentially the same as waiting. A description of a process's normal flow through the varous scheduling states is given in Section 1.3, below. 111...111...222 OOOttthhheeerrr DDDeeefffiiinnniiitttiiiooonnnsss ox eligible: An eligible process is one that is entitled to run. 1-2 Traffic Control MDD-019 Eligibility is the mechanism that is used to implement the scheduling algorithm; a ready process does not run until it has been made eligible. Once a process becomes eligible, it remains so until it either goes blocked or exhausts its eligibility quantum (see below). ox loaded: A process is loaded if certain critical pages (specifically, the first page of each of the descriptor segment and the process data segment (PDS)) are in memory and wired. A process cannot run unless it is loaded. When an eligible process is being considered as a candidate to run, if it is not loaded, the I/O to get the required pages is started, and the process is put in the waiting state. ox quantum: A process's quantum (also called "eligibility quantum" or "time slice") is the amount of virtual CPU time the process will be allowed to use before it loses eligibility. ox interaction: An interaction is an event that indicates that a human user is interacting with a process; in other words, input from the user's terminal. In practice, any wakeup originating in ring 0 when the target process is blocked is treated as an interaction. Traffic control's scheduling algorithm tends to favor processes that have frequent interactions (see 1.4, "General Principles of Scheduling Policy", below). ox work class: A work class is a group of processes that have the same set of scheduling parameters. Work classes are initially defined in the Master Group Table (MGT), which is maintained by a System Administrator. Work class 0 is reserved for the Initializer process; up to 16 work classes may be defined for other users. ox notify: When an event occurs that one or more processes are waiting for, a "notify" signal is sent to each process in the waiting state for that particular event. The effect of a notify is to put the target process in the ready state. (Once the process gets to run, the program that entered the waiting state is expected to check to make sure that the event it is interested in has actually occurred.) ox wakeup: A wakeup is a signal sent to a process that has, or may 1-3 MDD-019 Traffic Control have, entered the blocked state. A wakeup may originate in ring 0 (as a result of an event such as terminal input), or it may be generated by a program running in some process in an outer ring. ox idle process: During traffic control initialization, one "idle" process is created for each processor. The purpose of an idle process is to occupy a processor, and handle interrupts directed at that processor, if no other process can use it. 111...222 DDDAAATTTAAA BBBAAASSSEEESSS The principal data bases used by traffic control are contained in the wired segment tc_data, which consists of the following sections: ox header The tc_data header contains meters maintained by traffic control, and a variety of system-wide traffic control parameters. ox work class table The work class table is a fixed-size array of work class table entries (WCTEs), one for each possible work class. Each WCTE contains the scheduling parameters for the corresponding work class, and per-work-class meters. ox active process table Each process in existence at any given time has an entry in the active process table, referred to as an active process table entry (APTE). The active process table is an array of APTEs; the size of this array is determined at system initialization. The number of APTEs (and hence the maximum number of active processes) is taken from the "tcd" card in the configuration deck (if no such card is supplied, a system default value of 40 is used). When a process is created, an empty APTE is found and assigned to it; when it is destroyed, its APTE is set back to empty. A process's APTE contains its process ID, its current scheduling state, various flags and parameters used in scheduling the process, and a forward and backward thread used to place the process in one of several queues, as explained in Section 1.5, "Implementation of Scheduling Algorithms," below. The high-order 18 bits of a process ID contain the offset within tc_data of the process's APTE. 1-4 Traffic Control MDD-019 ox interprocess transmission table The interprocess transmission table (ITT) is a single system-wide table in whose entries information associated with each wakeup is stored by the sending process in order to be found by the receiving process. It is described in more detail in MDD-018, "Interprocess Communication". 111...333 TTTRRRAAANNNSSSIIITTTIIIOOONNNSSS BBBEEETTTWWWEEEEEENNN SSSCCCHHHEEEDDDUUULLLIIINNNGGG SSSTTTAAATTTEEESSS In this section we will follow a process through a sample sequence of scheduling states. 1. When a process is created, it is in the blocked state. 2. A wakeup is sent to the process. It is placed in the ready state, and threaded into the appropriate "ready queue" (see "Scheduling Algorithms", below). 3. At some later time, the process is chosen for eligibility. It is removed from the ready queue and placed in the "eligible queue", remaining in the ready state. 4. The process reaches the head of the eligible queue, and is selected to run. It is placed in the running state, and starts to run on some processor. 5. The process takes a page fault, and cannot proceed until the needed page is read in. It enters the waiting state, without losing eligibility. 6. The page I/O completes, and the process receives a notify. It is placed in the ready state. If any processor is idle, the process is placed in the running state, and starts to run on that processor. Otherwise, it remains ready and eligible, and will be selected to run at some later time. 7a. The process exhausts its eligibility quantum. It gives up its processor and its eligibility is revoked. It is placed in the ready state, and threaded into a ready queue as in 2, above. 7b. The process goes blocked. It gives up its processor and its eligibility is revoked. It is placed in the blocked state, where it remains until it receives a wakeup, at which point the cycle resumes at 2, above. 8. The Initializer sends a "stop" wakeup to the process, in preparation for destroying it. The "stop_pending" flag in the process's APTE is turned on. If the process is blocked, it is put in the ready state and threaded into a ready queue. If it 1-5 MDD-019 Traffic Control is running, a connect is sent to interurpt the processor on which it is running. 9. When a process with apte.stop_pending on is selected to run, the ring alarm register is set so that, as soon as the process leaves ring 0, it will call back in to pxss$force_stop, which revokes its elgibility, removes it from any ready queue, and places it in the stopped state, where it remains until it is destroyed. Similarly, when a running process receives a connect interrupt with apte.stop_pending on, wired_fim calls pxss$force_stop. 111...444 GGGEEENNNEEERRRAAALLL PPPRRRIIINNNCCCIIIPPPLLLEEESSS OOOFFF SSSCCCHHHEEEDDDUUULLLIIINNNGGG PPPOOOLLLIIICCCYYY This section outlines the principles followed by traffic control in assigning priorities to processes. 111...444...111 IIInnnttteeerrraaaccctttiiivvveee PPPrrriiiooorrriiitttyyy Multics is intended primarily for interactive use, so it makes sense that interactive users should get as good response as possible. Traffic control ensures that, as a general rule, the more recently a process has interacted (i.e., received terminal input for which it had been blocked), the more likely it is to run soon. Another way to express this idea is that the more frequently a process interacts, the more often it will be awarded a processor. There are some administrative controls that can be used to determine what constitutes frequent interaction, as explained below. 111...444...222 AAAdddmmmiiinnniiissstttrrraaatttiiivvveee CCCooonnntttrrrooolllsss A system administrator can divide the projects registered on the system into work classes, where each work class has a different set of scheduling parameters.(1) The work classes are defined in the Master Group Table (MGT), where the prevailing scheduling mode (see below) is also defined. The parameters of a particular work class can be changed dynamically by means of the tune_work_class command;(2) an individual process can be moved from one work class to another by means of the set_work_class _________________________________________________________________ (1) Actually, each work class contains one or more load control groups, where each load control group usually contains one or more projects. (2) Such a change is temporary; the values in the MGT are restored at the next shift change. 1-6 Traffic Control MDD-019 command. These two commands require access to the hphcs_ gate. Essentially, the system can be run in either of two scheduling modes: "percent" mode, in which the work classes are distinguished by relative amounts of throughput, or "deadline" mode, in which the work classes are distinguished by relative response times. In either mode, it is possible to specify that certain work classes are guaranteed a maximum real-time response. 1.4.2.1 PERCENT SCHEDULING In percent mode, each work class is intended to receive a specified percentage of the available CPU time. The percentages specified for the various work classes must add up to 100%. If in fact the demand for CPU time is less than the total available, the remainder is allocated to those work classes that contain processes that have work to do, in proportion to their assigned percentages. A work class may be designated as "governed", in which case it is not permitted to receive more than its assigned percentage, even if the alternative is to allow a processor to go idle. 1.4.2.2 DEADLINE SCHEDULING In deadline mode, each work class is assigned an expected response time after interaction; that is, whenever a process interacts or exhausts its eligibility quantum, a deadline is established by adding the work class's response time to the current time, and the process is expected to run at or before the deadline. These are "soft" deadlines, in that processes are sorted by deadline, but no guarantee is made that a process will actually run before its deadline arrives. 1.4.2.3 REAL-TIME DEADLINES In either mode, one or more work classes may be given a guaranteed maximum real response time. For processes in such a work class, response deadlines are established as described above for deadline mode; however, when choosing a process to run, traffic control will always choose a ready process whose real-time deadline has arrived over any other process. In addition, the "realtime_io_priority" scheduling parameter may be turned on in order to guarantee a maximum response time after the completion of certain I/O operations (such as tape reads and writes). If the parameter is on, a real-time deadline is established whenever any process receives an "I/O wakeup" (sent by ioi_). 1-7 MDD-019 Traffic Control 1.4.2.4 SPECIFYING RESPONSE TIMES AND ELIGILIBILITY QUANTA In setting response deadlines (either in deadline mode or for real-time work classes), the system administrator actually specifies two response times: an interactive response time, which is used to establish the deadline when a process interacts, and a "subsequent" response time, which is used when the process is rescheduled after exhausting its quantum or upon receipt of a non-interactive wakeup. If the interactive response time is much smaller than the subsequent response time, the bias in favor of interactive response is increased; on the other hand, if the two times are equal, a highly interactive process will not necessarily run any sooner than a relatively non-interacitve process in the same work class. Similarly, each work class has two eligibility quanta associated with it: an interactive quantum, which is the amount of time it remains eligible when it first runs after interaction, and a "subsequent" quantum, which is used when it is rescheduled as above. In deadline mode and for real-time work classes, these two quanta are specified for each work class; in percent mode, the two quanta are system-wide, and apply to all non-real-time work classes. Small values of the interactive quantum restrict the definition of highly-interactive processes to those that use comparatively little CPU time per interaction, since those that compute for longer than the interactive quantum lose eligibility before they are ready to interact again; conversely, larger values of the interactive quantum relax the definition of frequent interaction. Large values of the subsequent quantum allow a process engaged in a long computation (which may have to wait a while before it gets to run) a chance to complete a substantial portion of that computation once it does get a processor. The response time and quantum specified for real-time I/O response apply only to the process's first burst of processor usage after an I/O wakeup; at other times, the scheduling parameters in effect for the process's work class are used. 1.4.2.5 ELIGIBILITY LIMITS There are a set of parameters that can be used to control the number of processes that can be eligible at once. In particular, the system-wide parameter "max_eligible" is the maximum number of processes that can be eligible; its values can be adjusted to affect the paging behavior of the system, depending on the amount of memory available, as explained in the Multics System 1-8 Traffic Control MDD-019 Maintenance Procedures manual.(1) In addition, in deadline mode each work class may have a maximum number of eligible processes, to prevent any one work class from completely dominating the system. Finally, the system-wide parameter "max_max_eligible" is the maximum value that max_eligible can ever attain during the course of a single bootload; it is used during system initialization as the number of segments to reserve for use as ring-0 stacks. 111...555 IIIMMMPPPLLLEEEMMMEEENNNTTTAAATTTIIIOOONNN OOOFFF SSSCCCHHHEEEDDDUUULLLIIINNNGGG AAALLLGGGOOORRRIIITTTHHHMMMSSS This section contains a brief overview of how the scheduling policies described above are implemented. The key steps are: ox threading processes into ready queues; ox choosing a ready process for eligibility. ox choosing an eligible process to run; 111...555...111 RRReeeaaadddyyy QQQuuueeeuuueeesss When a non-eligible process enters the ready state,(2) it is placed in one of several possible "ready queues". These queues are implemented through forward and backward threads in the APTE. The choice of queue and the process's position in the queue reflect its priority for obtaining eligibility. The use of ready queues is different for each scheduling mode, as described below. 1.5.1.1 READY QUEUES IN PERCENT MODE In the discussion of percent mode, the names of the following data items are used: ox tefirst: The system-wide interactive quntum. ox telast: The system-wide subsequent quantum. _________________________________________________________________ (1) In practice, max_eligible can be exceeded if a real-time deadline arrives when max_eligible procsses are already eligible (2) This includes the case of a running process losing eligibility, in which case it becomes ready and non-eligible simultaneously. 1-9 MDD-019 Traffic Control ox ti (apte.ti): "time since interaction": the amount of virtual CPU time a process has used since it last interacted. ox ts (apte.te): "time since scheduling": the amount of virtual CPU time a process has used since it was last placed in a ready queue. ox timax: an arbitrary system-wide maximum value imposed on apte.ti. This discussion does not cover processes with real-time deadlines, which are discussed separately in Section 1.5.1.3. In percent mode, the queues used are a per-system "interactive" queue and one ready queue per work class. The interactive queue is threaded from a cell in the tc_data header, and each per-work-class queue is threaded from a cell in its respective WCTE. A process is placed in the interactive queue when it first becomes ready after interacting; at other times, it is placed in its work-class queue. (This mechanism is explained in more detail below.) Processes in the interactive queue will be considered for eligibility before processes in the work class queues, as explained in Section 1.5.2. There are two fields in the APTE that control a process's position in a ready queue: apte.ti (time since interaction) and apte.ts (time since scheduling). If they are both zero, the process is placed at the end of the interactive queue; otherwise, it is placed in the work class queue, which is sorted by values of ti, smallest value first. These two values (ti and ts) are set to zero when the process is credited with an interaction. When a process loses eligibility, its ts is incremented by the virtual CPU time used since it became eligible; whenever the resulting value of ts exceeds ti by the value of the interactive quantum (called "tefirst"), ti is incremented by the value of ts, and ts is reset to 0. (The resulting value of ti is used to determine the process's position in a ready queue the next time it is made ready.) The effect of this is that ti takes on successive values of (approximately) ((2**n)-1)*tefirst, where n=1, 2, 3, etc., until either it reaches an arbitrary maximum (the system-wide parameter "timax") or the process interacts again. This subdivides the work class ready queue into a collection of subqueues according to how much time each process has used since its last interaction. A small refinement to this mechanism arises in a few cases. Normally, when a process is sorted into a ready queue, it is sorted behind other processes that have the same value of ti. Exceptions are a process that is to be placed in the "stopped" state, and a process whose work class has just been changed. 1-10 Traffic Control MDD-019 Such processes are sorted in ahead of other processes with the same value of ti. 1.5.1.2 READY QUEUE IN DEADLINE MODE In deadline mode, all processes (except for those with real-time deadlines) are sorted into a single queue, ordered by deadline. The queue used for deadline mode is in fact the interactive queue. Thus, when a non-eligible process becomes ready, its deadline is calculated by adding the appropriate response time for its work class(1) to the current time, and sorting it into the interactive queue after all processes with the same or earlier deadlines, and before all processes with later deadlines. 1.5.1.3 REAL-TIME READY QUEUE A separate queue (the "real-time queue") is used for processes with real-time deadlines. The deadline is calculated as in deadline mode (above), and the process is similarly sorted into the real-time queue by deadline. 111...555...222 CCChhhoooooosssiiinnnggg aaa PPPrrroooccceeessssss fffooorrr EEEllliiigggiiibbbiiillliiitttyyy The subroutine "find_next_eligible" is called when it is necessary to make a process eligible (because no eligible process is available to run, as explained in Section 1.5.3). It attempts to add exactly one process to the "eligible queue", which is a queue of APTEs threaded from a cell in the tc_data header. (The threads in the APTE are used for whichever queue it is in, so a process cannot be in a ready queue and the eligible queue simultaneously.) Find_next_eligible first looks at the process at the head of the interactive queue; if the work class of that process has not reached its limit of eligible processes (in deadline mode), or if the system is running in percent mode, this is the process that is selected. Otherwise, the next process in the queue is examined. Since in deadline mode all non-real-time ready processes are placed in the interactive queue, this single test chooses the process with the earliest deadline (in deadline mode) or the process that has been on the interactive queue for the longest time as a result of a recent interaction (in percent _________________________________________________________________ (1) That is, the interactive response time if it has just received an interactive wakeup, and the subsequent response time otherwise. 1-11 MDD-019 Traffic Control mode). Processes with real-time deadlines are handled specially, as explained in Section 1.5.3. If no process was found in the interactive queue, and percent mode is in effect, find_next_eligible scans the work class ready queues. The algorithm for choosing the process to be made eligible combines the preference for recently-interacting processes with the percentages assigned to the various work classes. 1.5.2.1 WORK CLASS CREDITS Traffic control uses a system of "credits" to keep track of the processor time used by the various work classes. A "credit bank" is maintained in the tc_data header that represents the number of microseconds of CPU time available to the various work classes. (The value of the credit bank at system initialization is 0; its contents increase as processes start to use CPU time, as described below.) Before scanning the work class queues, find_next_eligible distributes all the credits currently in the bank to the various work classes in proportion to their assigned percentages, adding them to the cell wcte.credits. Whenever a process gives up a processor, the number of credits corresponding to the amount of virtual CPU time it has used is subtracted from its work class's store of credits and returned to the bank. The number of credits assigned to any one work class is constrained to lie between 0 and 4*telast. After distributing credits, find_next_eligible looks at the first process in each work class's ready queue in order to find the one for which the value of wcte.credits - apte.ti is a maximum;(1) this process is chosen to be made eligible. (It is not necessary to look at other processes in the work class's queue, since the ready queues are sorted by ti.) This algorithm favors work classes that are furthest below their assigned percentages (since they have large accumulations of credits) and processes that have interacted recently (since they have small values of ti). For example, a work class whose processes interact with moderate frequency, but tend to use more than tefirst per interaction, tend to dominate at first because of their low values of ti; however, such a work class gradually uses up its credits, and other work classes start to take precedence over it. _________________________________________________________________ (1) Actually it uses wcte.credits - min (apte.ti, 2*telast). 1-12 Traffic Control MDD-019 1.5.2.2 GOVERNED WORK CLASSES A similar credit mechanism is used to limit the CPU time used by governed work classes. When scanning the work class queues, find_next_eligible passes out "governing" credits to governed work classes in proportion to their assigned maximum percentage; these credits are returned to the governing-credit bank when a process in a governed work class gives up a processor. If a work class's governing credits are found to be zero or negative during the scan of the work class queues, that work class has reached or exceeded its limit, and is not considered for eligibility. 1.5.2.3 AWARDING ELIGIBILITY If no process that can be made eligible is found in the interactive queue or any work class queue, the process at the head of the real-time queue (if any) is selected, even if its deadline has not arrived. Assuming that some process has been selected for eligibility, find_next_eligible checks to see if either max_eligible or max_max_eligible has been reached. If neither limit has been reached, the process is placed at the head of the eligible queue if it has a real-time deadline, or the tail of the eligible queue otherwise. 111...555...333 CCChhhoooooosssiiinnnggg aaa PPPrrroooccceeessssss tttooo RRRuuunnn The switching of processes takes place in the subroutine "getwork" in pxss, which is called by a process that has just changed its state in preparation for giving up its processor. The first thing getwork does is examine the real-time queue; if the deadline for the process at the head of the queue is not later then the current time, that process is immediately placed at the head of the eligible queue. If no such process is found in the real-time queue, getwork starts scanning the eligible queue from the beginning, performing the following steps: ox if the process is not in the ready state (a process in the eligible queue could also be running, waiting, or ptlocking), go to the next process in the queue. ox If the process is not loaded, start loading it and restart the scan of the queue (in case the I/O to load the process completes before another runnable process is found). ox If getwork on another processor is in the midst of starting to run this process, go to the next process in the queue. 1-13 MDD-019 Traffic Control ox If the process's required set of CPUs does not include the current processor, go to the next process in the queue. ox If the process passes all of the above tests, load the DBR for the new process so it will start to run on the current processor. After updating the new and old process's APTEs in various ways, getwork returns, in the new process, to the point from which it called getwork the last time it gave up a processor. If getwork searches the entire eligible queue without finding a process that can be run, it calls "find_next_eligible" to try to make an additional process eligible, as described above. If find_next_eligible succeeds, control returns to getwork at the point at which it had just found a ready process in the eligible queue. If getwork finds no runnable eligible process, and find_next_eligible finds no process to make eligible, getwork assigns the processor to its associated idle process. 111...555...444 PPPrrreeeeeemmmppptttiiiooonnn Under certain circumstances, a running process may be "preempted" by another, higher-priority process -- i.e., even though the process has not exhausted its quantum, it is forced to give up its processor. What this actually entails is changing the process's state to ready and calling getwork in order to find a higher-priority eligible process, of which there may not be any, in which the case the original process continues to run. Preemption is initiated in one of two ways: ox The "get_processor" subroutine is called to find a processor to give to a higher-priority process. This subroutine checks to see if any idle process is running; if not, it selects the running process that is lowest on the eligible queue. In either case, the processor on which the selected process is running is sent a "preempt" connect, as a result of which the running process calls getwork, as described above. The get_processor subroutine is called under the following circumstances: ox notify_timeout: If a process remains in the waiting state without being notified for longer than the system's "notify_timeout" time, it is notified of a "notify_timeout" event and made ready; get_processor is called so that the target process may run if it is higher in the eligible queue than some currently running process. ox notify: 1-14 Traffic Control MDD-019 If the tuning parameter "gp_at_notify" is on, any time a waiting process is notified, get_processor is called so that the target process may preempt a running process. (If gp_at_notify is off, only an idle process can be preempted as a result of a notify event.) ox wakeup: When a blocked process is made ready as the result of a wakeup, get_processor is called. In practice, the process receiving the wakeup only gets to run as a result of the preemption if it is in a real-time work class with a very short interactive response time; otherwise, it will not be eligible at the time of the call to getwork, and the preempted process will be selected before the recipient of the wakeup. ox At intervals controlled by the tuning parameter "pre_empt_sample_time", a timer interrupt is generated (see the discussion of timers in 1.7, below). The process running on the processor receiving the interrupt is preempted in the same fashion as if the processor had received a preempt connect, as described above. 111...666 WWWAAAKKKEEEUUUPPPSSS A wakeup is a signal sent to a process to take it out of the blocked state, optionally accompanied by 72 bits' worth of arbitrary data. A wakeup may be generated by an event in ring 0 (such as an I/O completion) or as a result of a call to hcs_$wakeup from an outer ring (the ultimate target of such a call is pxss$wakeup). The information contained in a wakeup signal is communicated to the process via an event channel, as described in more detail in MDD-018, "Interprocess Communication". When a wakeup is sent to a process, the fact is reflected by setting a flag in its APTE. If the event channel to which the wakeup is directed is a "special" or "fast" event channel, the fact that such a wakeup is pending is also recorded in the APTE; otherwise, an entry is added to the Interprocess Transmission Table (ITT) containing the target process ID, event channel, and event message. If the target process is not blocked, nothing further is done; the information associated by the wakeup will be found by ring-0 IPC the next time the process calls block, so that it will not actually enter the blocked state (see MDD-018 for details). If the process is blocked, its state is changed to ready, it is placed in the appropriate ready queue, and a preempt connect is generated, as described above. 1-15 MDD-019 Traffic Control If the wakeup originated in ring 0 (which is reflected as a call to pxss$ring_0_wakeup, rather than pxss$wakeup) it is considered to be an interaction; before sorting into a ready queue, ts and ti are set to 0 (in percent mode) or the interactive response time is used to calculate the deadline (in deadline mode and for real-time work classes). If the wakeup originated in an outer ring, no interactive credit is given,(1) and a check is made to see if the wakeup violates mandatory access control rules, as described in Section 2. 111...666...111 IIInnnttteeerrrppprrroooccceeessssss SSSiiigggnnnaaalllsss An interprocess signal (IPS) is a wakeup-like event that is intended for immediate delivery to the process whether it is blocked or not. (The most common use of IPS is the "quit" signal generated when a user hits the "break" key of the terminal.) Such a signal is recorded in the APTE, and checked by getwork whenever it starts running a new process; if an IPS is pending, getwork sets the ring alarm register so that the process will return to ring 0 to process the signal. An IPS may be either interactive or non-interactive, depending on whether the sender calls pxss$ips_wakeup_int or pxss$ips_wakeup. The difference is that ips_wakeup_int updates the process's priority as for an interactive wakeup before delivering the signal. In practice, the ips_wakeup_int entry is only called in response to a "break" from a user terminal. If the target process is running, a connect is sent to its processor to ensure that it runs through getwork and discovers that the signal is pending. If the process is eligible but not running, nothing further is done, since the signal will be detected by getwork the next time the process is selected to run. If the process is neither eligible nor running, its state is set to ready, it is sorted into a ready queue as above (even if it was already ready, in order to reflect a possible change in priority), and a preemption is initiated. 111...777 TTTIIIMMMEEERRRSSS Software timers on Multics are implemented on the assumption that process-switching is a frequent event, so timers are only checked for maturity in getwork. There are two basic kinds of timers: _________________________________________________________________ (1) An exception is made if the process has been blocked for longer than the tuning parameter "priority_scheduling_increment". 1-16 Traffic Control MDD-019 alarm timers, which are intended to notify a process at a given real time, and CPU timers, which are intended to notify a process after it has used a specified amount of virtual CPU time. 111...777...111 AAAlllaaarrrmmm TTTiiimmmeeerrrsss Alarm timers are maintained via a list, sorted by maturity time, threaded from a cell in the tc_data header. When a process sets an alarm timer, an entry is added to the alarm_timer list which records the time of maturity, the process ID, and the event channel to use to wake up the process when the timer matures. Whenever getwork is entered, it first checks this list to see if any alarm timers are mature. If so, each such timer entry is removed from the list; if an event channel is associated with the timer, a wakeup is sent to the target process, as described in Section 1.6; otherwise, an IPS is sent, as described in Section 1.6.1. 111...777...222 CCCPPPUUU TTTiiimmmeeerrrsss The amount of virtual CPU time used by a process is accumulated throughout its life in its APTE (in apte.virtual_cpu_time). When a process sets a CPU timer, the value for total CPU time used that corresponds to the maturity of the timer is stored in its PDS (in pds$timer_time_out); an event channel may optionally be associated with the timer. When the process is selected to run, getwork checks to see if apte.virtual_cpu_time has reached the value in pds$timer_time_out, and if so, arranges to notify the process. If there is an event channel for the timer, an entry is added to the ITT, and the process's APTE is marked to indicate a pending wakeup; no change needs to be made to the process's state, since it is already running, and the wakeup will be found by IPC as described in Section 1.6, above. If there is no event channel, the timer is treated as an IPS: the appropriate IPS bit is set in the APTE, and the ring alarm register is set. 111...777...333 TTTiiimmmeeerrr RRReeegggiiisssttteeerrr The timer register is a hardware register (one per CPU) that counts down as time elapses, and generates a timer interrupt when it reaches 0. Whenever a process is selected by getwork, the timer register is set to an appropriate value. For non-idle processes, this is normally the remainder of the process's eligibility quantum; however, it is not set to either less than 4 milliseconds or more than the preempt sample time. For an idle process, the timer register is set to the amount of time until the next alarm timer or 250 milliseconds, whichever is smaller. 1-17 MDD-019 Traffic Control When a timer interrupt occurs, pxss is called at the entry that handles preemptions. If the process running at the time has exhausted its quantum, its eligibility is revoked (i.e., it is removed from the eligible queue), its state is set to ready, and it is rescheduled (i.e., its ts and ti are updated or its new deadline is calculated) and sorted into the appropriate ready queue. Otherwise, it is simply preempted, that is, it remains in the eligible queue but its state is changed to ready. In either case, getwork is called to select another process. 1-18 Traffic Control MDD-019 SECTION 2 SECURITY CONSIDERATIONS Traffic control is implemented entirely in ring 0, and most of its functions are inaccessible from outer rings; accordingly, it has scarcely any security implications or concerns. The two areas where security comes into play are the setting of process scheduling parameters and the delivery of wakeups. 222...111 PPPRRROOOCCCEEESSSSSS SSSCCCHHHEEEDDDUUULLLIIINNNGGG PPPAAARRRAAAMMMEEETTTEEERRRSSS If a user were capable of changing at will the parameters of his process's work class and/or the work class to which his process is assigned, he could arbitrarily improve his own response and throughput at the expense of all other users. Although there are in fact entries to perform both of the above functions (tc$tune_work_class and pxss$set_work_class, respectively), they are accessible only through the hphcs_ gate, so the ability to affect scheduling parameters is restricted by the access control list of that gate, which is normally very tightly controlled. 222...222 SSSEEENNNDDDIIINNNGGG WWWAAAKKKEEEUUUPPPSSS Since wakeups are a mechanism for sending information from one process to another, they cannot be allowed to transfer information across authorization boundaries in violation of mandatory access controls -- i.e., a process must be prevented from "sending down" a wakeup to a process at a lower authorization. Accordingly, if a wakeup is sent whose origin is outside ring 0 (as determined by the fact that pxss$wakeup was called, rather than pxss$ring_0_wakeup), pxss ensures that the sender's security level is not higher than the target process's level, and that the sender's category set equals or is contained in the target process's category set. (This check is waived if either the sender or the receiver has the "ipc" privilege turned on.) If the above conditions are not met, no message is added to the ITT, no notification is sent to the target process, and an error code is returned to the caller of pxss$wakeup. ----------------------------------------------------------- Historical Background This edition of the Multics software materials and documentation is provided and donated to Massachusetts Institute of Technology by Group BULL including BULL HN Information Systems Inc. as a contribution to computer science knowledge. This donation is made also to give evidence of the common contributions of Massachusetts Institute of Technology, Bell Laboratories, General Electric, Honeywell Information Systems Inc., Honeywell BULL Inc., Groupe BULL and BULL HN Information Systems Inc. to the development of this operating system. Multics development was initiated by Massachusetts Institute of Technology Project MAC (1963-1970), renamed the MIT Laboratory for Computer Science and Artificial Intelligence in the mid 1970s, under the leadership of Professor Fernando Jose Corbato. Users consider that Multics provided the best software architecture for managing computer hardware properly and for executing programs. Many subsequent operating systems incorporated Multics principles. Multics was distributed in 1975 to 2000 by Group Bull in Europe , and in the U.S. by Bull HN Information Systems Inc., as successor in interest by change in name only to Honeywell Bull Inc. and Honeywell Information Systems Inc. . ----------------------------------------------------------- Permission to use, copy, modify, and distribute these programs and their documentation for any purpose and without fee is hereby granted,provided that the below copyright notice and historical background appear in all copies and that both the copyright notice and historical background and this permission notice appear in supporting documentation, and that the names of MIT, HIS, BULL or BULL HN not be used in advertising or publicity pertaining to distribution of the programs without specific prior written permission. Copyright 1972 by Massachusetts Institute of Technology and Honeywell Information Systems Inc. Copyright 2006 by BULL HN Information Systems Inc. Copyright 2006 by Bull SAS All Rights Reserved