6.033 2011 Lecture 9: Performance

Performance: why am I lecturing about it?
  often has huge effect on design 
  often forces major re-design as loads grow
  faster CPUs haven't "solved" performance
    problems have grown too, e.g. Internet-scale
    disk and net latency hasn't improved

Performance Introduction
  your system is too slow
    [diagram: clients, net, server CPU, server disk]
    perhaps too many users, and they are complaining
  what can you do?
  1. measure to find bottleneck
     could be any of the components incl client
     you hope to find a dominating bottleneck!
  2. relax bottleneck
     increase efficiency or add hardware

Decide what you mean by "performance"
  throughput: requests/time for lots of waiting requests
  latency: time/request for a single request
  sometimes inverses:
    if one req takes 0.1 second of CPU, and one CPU, then throughput = 10/sec
  often not inverses
    concurrency: w/ 2 CPUs, latency still 0.1, but throughput 20/sec
      (pipelining is a form of concurrency)
    queuing: 0.1 seconds CPU but 10 jobs in line = 1 second latency, 10 per sec thru

I will focus on throughput, appropriate for heavily loaded systems
  most systems gets slow as # of users goes up
    at first, each new user uses some otherwise idle resources
    then they start to queue
  [graph: # users, reply/sec, linear up, hits a plateau]
  [graph: # users, delay, stays near zero then linear up (queuing delay)]

How to find bottleneck?
  1. measure utilization of each resource
     perhaps sample: busy/idle, what is it doing? (profile...)
     e.g. cpu 100% busy, disk only 20% busy
     but maybe not, e.g. cpu 50% use and disk 50% used, just at different times
  2. model
     net should take 10ms, 50ms CPU, 10ms disk
  3. guess
     you will probably have to do this anyway

You may need to make application logic more efficient
  fewer features, better algorithms
  I cannot help you with that -- application-specific

System performance techniques (i.e. O/S infrastructure can provide)
  1. caching
  2. batching
  3. concurrency
      I/O concurrency
      parallel hardware
 4. scheduling

I'm going to use disk as main example
  disk is often the main bottleneck in real systems
  every year it gets harder to be CPU-bound
  crucial to use disk efficiently

hitachi 7K400: 400 GB. 7200. 8.5ms r seek, 9.2ms w.
567 to 1170 s/t. 10 heads. abt 87000 cylinders.

primer on disks
  [disk slide]
  physical arrangement
    rotating platter: 7,200 RPM, 120/sec, 8.3 ms / rotation
      continuous rotation
    circular tracks, 512-byte sectors, about 1000 sectors/track
    r/w in sector units
    perhaps 100,000 tracks, concentric rings
    multiple platters, 5 for 7K400, so 10 surfaces
      cylinder: set of vertically aligned tracks
    one disk arm, head per surface, they all move together
      can only read from one head at a time
  three movements required:
    "seek" arm to desired track: varies w/ # tracks, 1 to 15 ms
    wait for disk to "rotate" desired sector under head: 0 to 8.3ms
      each sector has a header w/ sector # and ECC
    r/w bits as they rotate under head at 7200

disk access patterns: two broad classes
  big multi-track sequential transfers:
    one track per rotation
    512*1000 / 0.0083
    62 megabytes/second
    that is fast! unlikely to be a bottleneck for most systems
    DRAM is only 10x faster, typical net only 2x faster
  small transfers, e.g. 5000-byte web page:
    from *random* disk location
    seek + rotate + transfer
    avg seek: 9 ms
    avg rotate: 1/2 full rotation for random sector
    transfer: size / 62 MB/sec
    9 + 4 + 0.1 = 13.1ms
    rate = 5000 / 0.0131 = 0.4 MB/sec
    i.e. 1% of big sequential rate. this is slow.
    sadly this is typical of real life

Lesson: avoid random disk access!
  if you r/w entire big files, lay them out contiguously
    not as easy as it seems
    FS needs to allocate contiguous sectors as you write
    so FS must organize info about free sectors appropriately
    bad: linked list, as in 1970s UNIX
    better: bitmap
    danger: free space fragmentation, if create then delete files
      prevents allocation of large contiguous areas
  if you read lots of small pieces of data, group them
    UNIX FS: dir inode, dir ent, file inode, indirect block, block
    original UNIX placed them in essentially random parts of the disk
    that's many 13 ms random accesses to open()+read()
    modern FS put dir/inodes/content in nearby tracks
  (is layout a form of scheduling?)
    
caching disk sectors
  use some of RAM to remember recently read disk sectors
    this is often the main purpose of RAM...
  your o/s kernel does this for you
  table:
    SN  DATA
    ... ...
  read(sn):
    if sector in cache, use it
    else:
      evict some sector
      read sn from disk
      put sn, sector into cache

how to calculate performance win from caching?
  the usual analysis tool is "average access time"
  average access time = hit time +
                        miss time * miss ratio
  hit time: about 0 s 
  miss time: 0.010 s
  miss ratio: fraction of accesses that aren't in the cache
    depends hugely on access pattern
  example:
    100 small files
    90% of accesses are for just 10 files
    so make the cache be able to hold 10 files
    then miss ratio will likely be just over 0.1
    avg access time = 0 + 0.01 * 0.1
                    = 0.001 seconds

eviction / replacement policies
  goal: evict cache entry that's unlikely to be used soon
  least-recently-used (LRU) usually works well
    intuition: used recently => used again soon
    good if some data more popular than others
      e.g. UNIX root directory
    maybe implement with a double-ended list
  Question: when would LRU work poorly?
    app loops over huge file that doesn't fit in cache
    if it's been used recently, won't be used again for a while!
    LRU would evict everything useful, e.g. root directory
      yet cache contents would never be useful
    MRU? random?

how to decide if caching is likely to be effective?
  first: do you have a busy disk and idle CPU?
  easy if all your disk data fits in RAM!
  harder: you have 1 GB of data on disk and 0.1 GB of RAM
  will that work out well?
  temporal locality?
    small subset (< 0.1 GB) that is used a lot
      working set < cache size
    and the rest of your 1 GB relatively unpopular
    e.g. home directories of currently-logged-in users
  spatial locality?
    if other content in sector needed soon after first access
    e.g. ls -l looks at each dir ent in turn
      a sector holds many directory entries
    maybe FS puts multiple inodes in a directory in the same sector
    one reason FS uses block = multiple sectors
      once you've seeked, why not read multiple sectors
      also decreases book-keeping (free list, inode block array)
  some access patterns have no locality
    reading huge files
    uniform random reads

i/o concurrency
  what if each request uses both CPU and disk?
  CPU:  A-  B-  C-
  Disk:   A-  B-  C-
  both are idle half the time
  we can double throughput by overlapping CPU and disk
    CPU executes B while disk seeks for A
  works well if many activities waiting, so always something to run or read
  requires a way to concurrently execute many activivies, switch between them
    threads! handy even if only one CPU
    yield CPU w/ wait() while waiting for disk
      disk interrupts when done, interrupt handler calls notify()
  special case: prefetch into cache (works w/ only one thread)
  special case: write-behind out of cache

what if CPU requirements << disk?
  above example only works well if roughly equal
  concurrency may still win if some hit in cache, some miss
  [time diagram: hit mmmiiisss hit hit]
  let hits complete from cache while miss is waiting for disk
  again, often implemented with threads

writes are a special problem
  reads often cacheable
  writes often must go to disk in case of crash
    as well as to cache
  result: caching not very effective for write-intensive applications
  worst is random writes (seeks)
    e.g. mail server storing incoming messages, each in a small file
  many good ideas for coping w/ writes
    e.g. logging, discussed later in course

useful to think of two disk performance axes, mathing access patterns
  sequential vs random
  read vs write
                  Disk   (SSD)
  seq r MB/sec      60    145
  seq w MB/sec      60    123
  rand r IO/sec     77    6800 (4K, 28 MB/s)
  rand w IO/sec     77    1000 (4 MB/s)

there are faster storage technologies than hard disk
  solid state disk (SSD) -- flash w/ fast SATA "disk" interface
  flash retains information even without power
  you might think it would be very fast, since no moving parts
    on my OCZ Vertex 30-GB drive
    seq read: 145 MB/s
    seq write: 123 MB/s
    rand 4K read: 6800 / s (28 MB/s)
    rand 4K write: 1000 / s (4 MB/s)
  why such poor random write performance?
    "write amplification"
    must erase flash before writing or re-writing
    erase takes one or two milliseconds
    only erases in 512 KB blocks! due to chip layout
    small write requires:
      read 512 KB into DRAM cache
      erase 512 KB
      write 512 KB to flash
    the drive is actually smarter than that
      remapping table lets each smallish page be stored in any erase block
      keep on block open for writing
      one-page write:
        write to open block
        update page mapping to point to open block
      decreases write amplification
      table size, power failure
  performance isn't everything:
          Disk    SSD
  $ / GB   0.1      2

scheduling
  sometimes changing order of execution can increase efficiency
  suppose many small disk requests waiting (from concurrent threads)
  use "elevator algorithm"
  sort by track, r/w in track order
    before: 71 10 92 45 29
    after: 10 29 45 71 92
  results: short seeks (1ms instead of 8ms)
  the bigger the backlog, the smaller the average seek
    system gets more efficient as load goes up!
    higher throughput
  maybe you can sort rotationally also, but that's hard
  cons: unfair -- increases delay for some

what if one disk doesn't provide enough performance?
  use multiple disks in parallel
  how to split the data over the disks?
  big reads, throughput-limited?
    "stripe" each file across multiple disks
      first 1 MB on disk 1, 2nd MB on disk 2, 3rd MB on disk 1, &c
    app does a big read
    all disks read their part in parallel
    app gets data N times faster
  small reads, seek-limited?
    put each read unit on only one disk (e.g. each customer record)
    so each app read touches only one disk
    can seek for multiple different records at the same time
  load balance critical in both cases
    2 disks only get 2x performance of each does half the work
    easy to get imbalances from popular data

here are some handy rough numbers for thinking about performance
  latency:
    0.0000001 ms : instruction time (1 ns, really maybe 1/3 ns)
    0.0001 ms : DRAM load (100 ns)
    0.2 ms : LAN network (e.g. RPC to nearby server)
    10 ms : random disk I/O
    25 ms : Internet east coast -> west coast
  throughput:
    1000 MB/s : DRAM
     100 MB/s : LAN (maybe 1000 MB/s)
     100 MB/s : sequential disk
       1 MB/s : random disk I/O

demo:
  1. run httpd port on Linux
     modify to use same port
     fadvise
     single threaded server
     echo 1 > /proc/sys/vm/drop_caches  (flush FS cache)
     cat garbarge > /dev/null  (flush disk cache!)
    all clients access same file concurrently
       ./simpleclient -n 10
    run io stat -> disk is bottleneck
    draw throughput and latency graph
  1b measure disk
    time cat htdocs/index.html > /dev/null
  2. add caching
    turn off fadvise
    second client request fast:  131 -> 0.3
    improved latency and throughout
 3. add concurrency
   what if something isn't in the cache?
   problem; slow again.  throughput is limited by misses
   modify httpd to be multithreaded to get I/O concurrency
   client must be multithreaded too