6.033 2011 Lecture 6: Client/server in one computer

today's topics
  c/s on one computer
  bounded buffer
  locks

goal: client-server design in one computer
  example: X client, X server (no kernel yet)
  client wants to send request to server (e.g., draw an image)
  goal: arms-length, so X server not vulnerable to client bugs

idea: let kernel manage interaction
  client/server interact w/ trusted kernel, not each other
  recall from last lecture we can use virtual memory to isolate client, server
  [diagram: X client, X server, kernel]
  let's focus on one-way flow (can use two for RPC)
  buffer memory in kernel
    each entry holds a message pointer
  send(m) to add msg to buffer
  m <- receive() to read msg out of buffer
  finite buffer, so send() may have to wait
  buffer may be empty, so receive() may have to wait
  why does buffer have multiple entries?
    sender / receiver rates may vary
    when sender is faster, it can accumulate a backlog
    when sender is slower, receiver still has input, can catch up
  very much like a UNIX pipe

problem: concurrency
  some data structure inside kernel, send() and receive() modify it
  what if send() and receive() execute at same time? may interfere
  concurrency a giant problem, will come up many times
  let's start simple:
    each program gets its own CPU
    there is only one memory system
  [diagram: two CPUs, one memory]
  system calls run on both CPUs!
    i.e. if program A calls send(), send() runs on A's CPU
  send() and receive() interact via single shared memory system

data structure: "bounded buffer"
  [diagram: BUFFER[5], IN, OUT]
  each array entry: pointer to message
  IN: number of messages put into BB
  OUT: number of messages read out of BB
  IN mod 5 is next place for send() to write
  OUT mod 5 is next place for receive() to look
  example: in = 28, out = 26
    two messages waiting, slots 1 and 2
  in > out (or just in != out, since in >= out always): BB not empty
  in - out < 5: not full

send() code slide
  bb is a pointer to instance of bounded buffer, so we can have many of them
    e.g. one per c/s pair, or one per UNIX pipe
  loop to wait until room ("busy-wait")
  write slot
  increment input count

receive() code slide
  loop to wait until more send()s than receive()s: i.e., there's a message
  increment bb.out AFTER copying msg
    increment may signal send() to overwrite, if it's waiting for space

we believe this simple BB code works
  [show slide with both]
  even if send() and receive() called at same time
  concurrency rarely works out this well!

assumptions for simple BB send/receive
  1. one sender, one receiver
  2. each has its own CPU (otherwise, loop prevents other from running)
  3. CPUs perform mem reads and writes in the order shown
     oops! this code probably won't work as shown!
     compiler might put in/out in regs, not see other's changes
     CPU might increment bb.in before writing buffer[]
     we will assume memory R/W in program order
  today, will look at how we can get rid of assumption 1
  tomorrow, will look at how to get rid of assumption 2
  assumption 3 is mostly a matter of cooperating with language/compiler

suppose we want multiple senders
  e.g. so many clients can send msgs to X server
  would our send() work?

concurrent send()
  A: send(bb, m1)
  B: send(bb, m2)
  what do we *want* to happen?
  what would be the most useful behavior?
  desired outcome:
    two msgs in buf, in == 2
    we don't care about order

example problem w/ concurrent send()
  on different cpus, at the same time, on the same bb
  A          B
  r in, out
             r in, out
  w buf[0]
             w buf[0]
  r in=0
             r in=0
  w in=1
             w in=1
  result: in = 1, one message in bounded buffer, and one was lost!

this kind of bug is called a "race"
  once A starts executing the "if", it has to hurry to finish w/ incr of bb.in!

other races in this code
  suppose only one slot left
  A and B may both observe in - out < N
  put *two* items in buf, overwrite oldest entry

races are a serious problem
  easy mistake to make -- send() looks perfectly reasonable!
  hard to find
    depends on timing, may arise infrequently
    e.g. Therac-25, only experienced operator typed fast enough

how to fix send()'s races?
  original code assumed no-one else messing w/ bb.in &c
  worked because only one CPU at a time was in send()
  can we restore that property?

solution: locks
  a lock is a data type with two operations
    acquire(l)
    s1
    s2
    release(l)
  the lock contains state: locked or unlocked
  if you try to acquire a locked lock,
    acquire will wait until it's released
  if two acquire()s try to get a lock at same time,
    one will succeed, the other will wait

how to fix send() with locking?
  [locking send() slide]
  associate a lock w/ each BB
  acquire before using BB
    release only after done using BB
  high-level view:
    no interleaving of multiple send()s
    only one send() will be executing guts of send()
    likely to be correct if single-sender send() was correct

does it matter how send() uses the lock?
  move acquire after the "if" statement? [slide]

demo:
  two `urxvt -fn xft:Monospace-14 +sb` side-by-side
  vi lockdemo.c on the left
  shell on the right

  Q: what will array[] hold?
  run make && ./lockdemo to see various correct answers
    no one right answer, depends on interleaving, all correct

  comment out locks
  Q: what are we likely to see now?
  compile, run
  sometimes get correct output! in real world, usually this is what happens
  we can see too few elements
    both wrote to the same array[n], both set n to same new value
  we can see zero slots
    both CPUs write array[n], then increment in order

  put separate locks around the two statements in append()
  Q: what are we likely to see now?
  compile, run again
  always the right number of elements, but some are zero

why is locking around each line of code not enough?
  look back to the machine instruction timeline..
  locks provide before-or-after atomicity
    group of actions happen atomically before or after another group
    but NEVER concurrently (e.g., individual instructions interleaved)
    providing atomicity for the increment's instructions is not enough
  how big should the atomic action be?  when is enough?
    helpful to think of locks as protecting invariants
    BB's invariant is broken inside of send(), between store and increment
    atomic action should be large enough so that invariant holds before & after
    OK to break invariant in the middle, because no other action sees it broken

locking in large systems quickly becomes complicated
  consider a hypothetical file system that provides a move operation
  without locks:
    move(d1, name, d2):
      delete name from d1
      add name to d2

  we need a lock, but how many, and what should it protect?
  "coarse-grained" locking: add a lock for the entire file system
    move(d1, name, d2):
      acquire(FS.lock)
      delete name from d1
      add name to d2
      release(FS.lock)
  pro: likely correct, just like on a single CPU
  con: no concurrency inside FS, can lead to low performance on many CPUs
       two programs that access different directories cannot run concurrently

  "fine-grained" locking: add a lock for each directory [broken]
    move(d1, name, d2):
      acquire(d1.lock)
      delete name from d1
      release(d1.lock)
      acquire(d2.lock)
      add name to d2
      release(d2.lock)
  pro: allows for more concurrency, can lead to higher performance
  con: hard to get correct!
       usually best to start with coarse-grained locking, refine when needed
  strawman 2 is broken: another CPU might not see this file in _either_
      directory, despite holding all of the right locks
    think about before-or-after atomicity, and the invariants that it protects
    invariant spans both d1 and d2, so need to hold both locks

  strawman 3:
    move(d1, name, d2):
      acquire(d1.lock)
      acquire(d2.lock)
      delete name from d1
      add name to d2
      release(d2.lock)
      release(d1.lock)
  problem: deadlock with move(d1, x, d2) and move(d2, x, d1)
    deadlock occurs when multiple programs hold locks, each waiting for another
  avoiding deadlock
    look for all places where multiple locks are held
    make sure, for all such places, locks are all acquired in the same order
      then there can be no locking cycles, and no deadlock
  deadlock-free code:
    move(d1, name, d2):
      if d1.inum < d2.inum:
        acquire(d1.lock)
        acquire(d2.lock)
      else:
        acquire(d2.lock)
        acquire(d1.lock)
      ...
  this can be painful: requires global reasoning
    acquire(l1)
    print("...")
  does print() acquire a lock? could it deadlock w/ l1?
  good news: deadlocks are not subtle once they occur

--- for next lecture ---

now that we know how to use acquire & release..  how do we implement them?
  here's a plan that DOES NOT WORK:
    acquire(l)
      while l != 0
        do nothing
      l = 1
    release(l)
      l = 0
  has a familiar race:
    A and B both see l = 0
    A and B both set l = 1
    A and B both hold the lock!

if only we could make l==0 test and l=1 indivisible...
  most CPUs provide the indivisible instruction we need!
  differs by CPU, but usually similar to:
    RSM r, a
      r <- mem[a]
      mem[a] <- 1
  sets register r to old mem[a]
  sets memory to 1
  RSM = Read and Set Memory
  
how does h/w make RSM indivisible?
  here's a simple plan
  two CPUs, one bus, one memory system
  only one bus, only one CPU can use it
  there's an arbiter that decides
  so RSM can grab bus, does read AND write, releases bus
  arbiter prevents 2nd RSM from starting until first RSM finishes
  easy to build single arbiter that chooses just one CPU at a time
    e.g. round-robin among CPUs

how to use RSM for locking?
  acquire(l)
    do:
      RSM r, l
    while r == 1
  RSM sets r=0 iff not already locked
  always sets lock to 1
    if already locked: harmless, loops
    if not locked: locks, exits loop
  only one of a set of concurrent RSMs will see 0
    since arbiter makes them execute one at a time
  tidbit: you can implement locks w/ ordinary LOADs and STOREs
    i.e. w/o hardware-supported RSM
    it's just awkward and slow
    look up Dekker's Algorithm

summary
  BB is what you need for client/server on one computer
  concurrent programming is tough
  locks can help make several actions look sequential: before-or-after atomicity
  next: more than one program per CPU