6.033 2011 Lecture 6: Client/server in one computer today's topics c/s on one computer bounded buffer locks goal: client-server design in one computer example: X client, X server (no kernel yet) client wants to send request to server (e.g., draw an image) goal: arms-length, so X server not vulnerable to client bugs idea: let kernel manage interaction client/server interact w/ trusted kernel, not each other recall from last lecture we can use virtual memory to isolate client, server [diagram: X client, X server, kernel] let's focus on one-way flow (can use two for RPC) buffer memory in kernel each entry holds a message pointer send(m) to add msg to buffer m <- receive() to read msg out of buffer finite buffer, so send() may have to wait buffer may be empty, so receive() may have to wait why does buffer have multiple entries? sender / receiver rates may vary when sender is faster, it can accumulate a backlog when sender is slower, receiver still has input, can catch up very much like a UNIX pipe problem: concurrency some data structure inside kernel, send() and receive() modify it what if send() and receive() execute at same time? may interfere concurrency a giant problem, will come up many times let's start simple: each program gets its own CPU there is only one memory system [diagram: two CPUs, one memory] system calls run on both CPUs! i.e. if program A calls send(), send() runs on A's CPU send() and receive() interact via single shared memory system data structure: "bounded buffer" [diagram: BUFFER[5], IN, OUT] each array entry: pointer to message IN: number of messages put into BB OUT: number of messages read out of BB IN mod 5 is next place for send() to write OUT mod 5 is next place for receive() to look example: in = 28, out = 26 two messages waiting, slots 1 and 2 in > out (or just in != out, since in >= out always): BB not empty in - out < 5: not full send() code slide bb is a pointer to instance of bounded buffer, so we can have many of them e.g. one per c/s pair, or one per UNIX pipe loop to wait until room ("busy-wait") write slot increment input count receive() code slide loop to wait until more send()s than receive()s: i.e., there's a message increment bb.out AFTER copying msg increment may signal send() to overwrite, if it's waiting for space we believe this simple BB code works [show slide with both] even if send() and receive() called at same time concurrency rarely works out this well! assumptions for simple BB send/receive 1. one sender, one receiver 2. each has its own CPU (otherwise, loop prevents other from running) 3. CPUs perform mem reads and writes in the order shown oops! this code probably won't work as shown! compiler might put in/out in regs, not see other's changes CPU might increment bb.in before writing buffer[] we will assume memory R/W in program order today, will look at how we can get rid of assumption 1 tomorrow, will look at how to get rid of assumption 2 assumption 3 is mostly a matter of cooperating with language/compiler suppose we want multiple senders e.g. so many clients can send msgs to X server would our send() work? concurrent send() A: send(bb, m1) B: send(bb, m2) what do we *want* to happen? what would be the most useful behavior? desired outcome: two msgs in buf, in == 2 we don't care about order example problem w/ concurrent send() on different cpus, at the same time, on the same bb A B r in, out r in, out w buf[0] w buf[0] r in=0 r in=0 w in=1 w in=1 result: in = 1, one message in bounded buffer, and one was lost! this kind of bug is called a "race" once A starts executing the "if", it has to hurry to finish w/ incr of bb.in! other races in this code suppose only one slot left A and B may both observe in - out < N put *two* items in buf, overwrite oldest entry races are a serious problem easy mistake to make -- send() looks perfectly reasonable! hard to find depends on timing, may arise infrequently e.g. Therac-25, only experienced operator typed fast enough how to fix send()'s races? original code assumed no-one else messing w/ bb.in &c worked because only one CPU at a time was in send() can we restore that property? solution: locks a lock is a data type with two operations acquire(l) s1 s2 release(l) the lock contains state: locked or unlocked if you try to acquire a locked lock, acquire will wait until it's released if two acquire()s try to get a lock at same time, one will succeed, the other will wait how to fix send() with locking? [locking send() slide] associate a lock w/ each BB acquire before using BB release only after done using BB high-level view: no interleaving of multiple send()s only one send() will be executing guts of send() likely to be correct if single-sender send() was correct does it matter how send() uses the lock? move acquire after the "if" statement? [slide] demo: two `urxvt -fn xft:Monospace-14 +sb` side-by-side vi lockdemo.c on the left shell on the right Q: what will array[] hold? run make && ./lockdemo to see various correct answers no one right answer, depends on interleaving, all correct comment out locks Q: what are we likely to see now? compile, run sometimes get correct output! in real world, usually this is what happens we can see too few elements both wrote to the same array[n], both set n to same new value we can see zero slots both CPUs write array[n], then increment in order put separate locks around the two statements in append() Q: what are we likely to see now? compile, run again always the right number of elements, but some are zero why is locking around each line of code not enough? look back to the machine instruction timeline.. locks provide before-or-after atomicity group of actions happen atomically before or after another group but NEVER concurrently (e.g., individual instructions interleaved) providing atomicity for the increment's instructions is not enough how big should the atomic action be? when is enough? helpful to think of locks as protecting invariants BB's invariant is broken inside of send(), between store and increment atomic action should be large enough so that invariant holds before & after OK to break invariant in the middle, because no other action sees it broken locking in large systems quickly becomes complicated consider a hypothetical file system that provides a move operation without locks: move(d1, name, d2): delete name from d1 add name to d2 we need a lock, but how many, and what should it protect? "coarse-grained" locking: add a lock for the entire file system move(d1, name, d2): acquire(FS.lock) delete name from d1 add name to d2 release(FS.lock) pro: likely correct, just like on a single CPU con: no concurrency inside FS, can lead to low performance on many CPUs two programs that access different directories cannot run concurrently "fine-grained" locking: add a lock for each directory [broken] move(d1, name, d2): acquire(d1.lock) delete name from d1 release(d1.lock) acquire(d2.lock) add name to d2 release(d2.lock) pro: allows for more concurrency, can lead to higher performance con: hard to get correct! usually best to start with coarse-grained locking, refine when needed strawman 2 is broken: another CPU might not see this file in _either_ directory, despite holding all of the right locks think about before-or-after atomicity, and the invariants that it protects invariant spans both d1 and d2, so need to hold both locks strawman 3: move(d1, name, d2): acquire(d1.lock) acquire(d2.lock) delete name from d1 add name to d2 release(d2.lock) release(d1.lock) problem: deadlock with move(d1, x, d2) and move(d2, x, d1) deadlock occurs when multiple programs hold locks, each waiting for another avoiding deadlock look for all places where multiple locks are held make sure, for all such places, locks are all acquired in the same order then there can be no locking cycles, and no deadlock deadlock-free code: move(d1, name, d2): if d1.inum < d2.inum: acquire(d1.lock) acquire(d2.lock) else: acquire(d2.lock) acquire(d1.lock) ... this can be painful: requires global reasoning acquire(l1) print("...") does print() acquire a lock? could it deadlock w/ l1? good news: deadlocks are not subtle once they occur --- for next lecture --- now that we know how to use acquire & release.. how do we implement them? here's a plan that DOES NOT WORK: acquire(l) while l != 0 do nothing l = 1 release(l) l = 0 has a familiar race: A and B both see l = 0 A and B both set l = 1 A and B both hold the lock! if only we could make l==0 test and l=1 indivisible... most CPUs provide the indivisible instruction we need! differs by CPU, but usually similar to: RSM r, a r <- mem[a] mem[a] <- 1 sets register r to old mem[a] sets memory to 1 RSM = Read and Set Memory how does h/w make RSM indivisible? here's a simple plan two CPUs, one bus, one memory system only one bus, only one CPU can use it there's an arbiter that decides so RSM can grab bus, does read AND write, releases bus arbiter prevents 2nd RSM from starting until first RSM finishes easy to build single arbiter that chooses just one CPU at a time e.g. round-robin among CPUs how to use RSM for locking? acquire(l) do: RSM r, l while r == 1 RSM sets r=0 iff not already locked always sets lock to 1 if already locked: harmless, loops if not locked: locks, exits loop only one of a set of concurrent RSMs will see 0 since arbiter makes them execute one at a time tidbit: you can implement locks w/ ordinary LOADs and STOREs i.e. w/o hardware-supported RSM it's just awkward and slow look up Dekker's Algorithm summary BB is what you need for client/server on one computer concurrent programming is tough locks can help make several actions look sequential: before-or-after atomicity next: more than one program per CPU