Robert Morris

6.033 2009 Lecture 6: Client/server on one computer

Intro
  how to implement c/s on one computer
  valuable in itself
  involves concurrency, an independently interesting topic
    DP1 is all about concurrency

what do we want?
  [diagram: X client, X server, NO KERNEL YET]
  client wants to send e.g. image to server
  goal: arms-length, so X srvr not vulnerable to client bugs

idea: let kernel manage interaction
  client/srvr interact w/ trusted kernel, not each other
  [diagram: X client, X server, kernel]
  let's focus on one-way flow (can use two for RPC)
  buffer memory in kernel
    each entry holds a message pointer
  send(m) to add msg to buffer
  receive(m) to read msg out of buffer
  finite buffer, so send() may have to wait
  buffer may be empty, so receive() may have to wait
  why does buffer have multiple entries?
    sender / receiver rates may vary around an average
    let sender accumulate a backlog when it's faster
      so receiver has input when sender is slower
  very much like a UNIX pipe

problem: concurrency
  some data structure inside kernel, s() and r() modify it
  what if s() and r() active at same time? may interfere
  concurrency a giant problem, will come up many times
  let's start simple:
    each program gets its own CPU
    there is only one memory system
  [diagram: two CPUs, one memory]
  system calls run on both CPUs!
    i.e. if program A calls send(), send() runs on A's CPU
  send() and receive() interact via single shared memory system

data structure
  "bounded buffer"
  [diagram: BUFFER[5], IN, OUT]
  each array entry: pointer to message buffer
  IN: number of messages put into BB
  OUT: number of messages read out of BB
  IN mod 5 is next place for send() to write
  OUT mod 5 is next place for receive() to look
  example: in = 28, out = 26
    two messages waiting, slots 1 and 2
  in > out => BB not empty
  in - out < 5 => not full

send() code slide
  p is "port", points to instance of BB, so we can have many of them
    e.g. one per c/s pair, or one per UNIX pipe
  loop to wait until room ("busy-wait")
  write slot
  increment input count

receive() code slide
  loop to wait until more sends than recvs
  if there's a message
  increment p.out AFTER copying msg
    since p.out++ may signal send() to overwrite
    if send() is waiting for space

I believe this simple BB code works
  [show slide with both]
  even if send() and receive() called at same time
  concurrency rarely work out this well!

Assumptions for simple BB send/recv
  1. One sender, one receiver
  2. Each has its own CPU (otherwise loop prevents other from running)
  3. in and out don't overflow
  4. CPUs perform mem reads and writes in the order shown
     oops! this code probably won't work as shown!
     compiler might put in/out in regs, not see other's changes
     CPU might increment p.in before writing buffer[]
     I will assume memory R/W in program order

Suppose we want multiple senders
  e.g. so many clients can send msgs to X server
  would our send() work?

Concurrent send()
  A: send(p, m1) B: send(p, m2)
  what do we *want* to happen?
  what would be the most useful behavior?
  goal:
    two msgs in buf, in == 2
    we don't care about order

Example prob w/ concurrent send()
  on different cpus, at the same time, on the same p
  A          B
  r in, out
             r in, out
  w buf[0]
             w buf[0]
  r in=0
             r in=0
  w in=1
             w in=1
  result: in = 1, one message in bounded buffer, and one was lost!

This kind of bug is called a "race"
  once A puts data in buffer, it has to hurry to finish w/ incr of p.in!

Other races in this code
  suppose only one slot left
  A and B may both observe in - out < N
  put *two* items in buf, overwrite oldest entry

Races are a serious problem
  easy mistake to make -- send() looks perfectly reasonable!
  hard to find
    depends on timing, may arise infrequently
    e.g. Therac-25, only experienced operator typed fast enough

How to fix send()'s races?
  original code assumed no-one else messing w/ p.in &c
    only one CPU at a time in send()
    == isolated execution
  can we restore that isolation?

Locks
  a lock is a data type with two operations
    acquire(l)
    s1
    s2
    release(l)
  the lock contains state: locked or unlocked
  if you try to acquire a locked lock
    acquire will wait until it's released
  if two acquire()s try to get a lock at same time
    one will succeed, the other will wait

How to fix send() with locking?
  [locking send() slide]
  associate a lock w/ each BB
  acquire before using BB
    release only after done using BB
  high-level view:
    no interleaving of multiple send()s
    only one send() will be executing guts of send()
    likely to be correct if single-sender send() was correct

Does it matter how send() uses the lock?
  move acquire after IF? [slide]

Why separate lock per bounded buffer?
  rather than e.g. all BBs using same lock
    that would allow only one BB to be active
  but it's OK if send()s on different BBs are concurrent
  lock-per-BB improves performance / parallelism

Deadlock
  big program can have thousands of locks, some per module
  once you have more than one lock in your system,
    you have to worry about deadlock
  deadlock: two CPUs each have a lock, each waiting for other to release
  example:
    implementing a file system
    need to ensure two programs don't modify a directory at the same time
    have a lock per directory
      create(d, name):
        acquire d.lock
        if name exists:
          error
        else
          create dir ent for name
        release d.lock
    what about moving a file from one dir to another? like mv
      move(d1, name, d2):
        acquire d1.lock
        acquire d2.lock
        delete name from d1
        add name to d2
        release d2.lock
        release d1.lock
    what is the problem here?

Avoiding deadlock
  look for all places where multiple locks are held
  make sure, for every place, they are acquired in the same order
    then there can be no locking cycles, and no deadlock
  for move():
    sort directories by i-number
    lock lowere i-number first
    so:
      if d1.inum < d2.inum:
        acquire d1.lock
        acquire d2.lock
      else:
        acquire d2.lock
        acquire d1.lock
  this can be painful: requires global reasoning
    acquire l1
    print("...")
  does print() acquire a lock? could it deadlock w/ l1?
  the good news is that deadlocks are not subtle once they occur

Lock granularity
  how to decide how many locks to have, what they should protect?
  a spectrum, coarse vs fine
  coarse:
    just one lock, or one lock per module
    e.g. one lock for whole file system
    more likely correct
    but CPUs may wait/spin for lock, wasting CPU time
    "serial execution", performance no better than one CPU
  fine:
    split up data into many pieces, each with a separate lock
    different CPUs can use different data, different locks
      operate in parallel, more work gets done
    but harder to get correct
    more thought to be sure ops on different pieces don't interact
      e.g. deadlock when moving between directories
  always start as coarse as possible!
    use multiple locks only if you are forced to, by low parallel performance

How to implement acquire and release?
  Here's a plan that DOES NOT WORK:
    acquire(l)
      while l == 0
        do nothing
      l = 1
    release(l)
      l = 0
  Has a familiar race:
    A and B both see l = 0
    A and B both set l = 1
    A and B both hold the lock!

If only we could make l==0 test and l=1 indivisible...
  most CPUs provide the indivisible instruction we need!
  differs by CPU, but usually similar to:
    RSM(a)
      r <- mem[a]
      mem[a] <- 1
      return r
  sets memory to 1, returns old value
  RSM = Read and Set Memory
  
How does h/w make RSM indivisible?
  a simple plan
  two CPUs, a bus, memory
  only one bus, only one CPU can use it
  there's an arbiter that decides
  so CPU grabs bus, does read AND write, releases bus
  arbiter forces one RSM to finish before other can start

How to use RSM for locking?
  acquire(l)
    while RSM(l) == 1
      do nothing
  always sets lock to 1
    if already locked: harmless
    if not locked: locks
  RSM returns 0 iff not already locked
  only one of a set of concurrent CPUs will see 0
  tidbit: you can implement locks w/ ordinary LOADs and STOREs
    i.e. w/o hardware-supported RSM
    it's just awkward and slow
    look up Dekker's Algorithm

Summary
  BB is what you need for client/server on one computer
  Concurrent programming is tough
  Locks can help make concurrency look more sequential
  Watch out for deadlock
  Next: more than one program per CPU