6.033 2011 Lecture 9: Performance Performance: why am I lecturing about it? often has huge effect on design often forces major re-design as loads grow faster CPUs haven't "solved" performance problems have grown too, e.g. Internet-scale disk and net latency hasn't improved Performance Introduction your system is too slow [diagram: clients, net, server CPU, server disk] perhaps too many users, and they are complaining what can you do? 1. measure to find bottleneck could be any of the components incl client you hope to find a dominating bottleneck! 2. relax bottleneck increase efficiency or add hardware Decide what you mean by "performance" throughput: requests/time for lots of waiting requests latency: time/request for a single request sometimes inverses: if one req takes 0.1 second of CPU, and one CPU, then throughput = 10/sec often not inverses concurrency: w/ 2 CPUs, latency still 0.1, but throughput 20/sec (pipelining is a form of concurrency) queuing: 0.1 seconds CPU but 10 jobs in line = 1 second latency, 10 per sec thru I will focus on throughput, appropriate for heavily loaded systems most systems gets slow as # of users goes up at first, each new user uses some otherwise idle resources then they start to queue [graph: # users, reply/sec, linear up, hits a plateau] [graph: # users, delay, stays near zero then linear up (queuing delay)] How to find bottleneck? 1. measure utilization of each resource perhaps sample: busy/idle, what is it doing? (profile...) e.g. cpu 100% busy, disk only 20% busy but maybe not, e.g. cpu 50% use and disk 50% used, just at different times 2. model net should take 10ms, 50ms CPU, 10ms disk 3. guess you will probably have to do this anyway You may need to make application logic more efficient fewer features, better algorithms I cannot help you with that -- application-specific System performance techniques (i.e. O/S infrastructure can provide) 1. caching 2. batching 3. concurrency I/O concurrency parallel hardware 4. scheduling I'm going to use disk as main example disk is often the main bottleneck in real systems every year it gets harder to be CPU-bound crucial to use disk efficiently hitachi 7K400: 400 GB. 7200. 8.5ms r seek, 9.2ms w. 567 to 1170 s/t. 10 heads. abt 87000 cylinders. primer on disks [disk slide] physical arrangement rotating platter: 7,200 RPM, 120/sec, 8.3 ms / rotation continuous rotation circular tracks, 512-byte sectors, about 1000 sectors/track r/w in sector units perhaps 100,000 tracks, concentric rings multiple platters, 5 for 7K400, so 10 surfaces cylinder: set of vertically aligned tracks one disk arm, head per surface, they all move together can only read from one head at a time three movements required: "seek" arm to desired track: varies w/ # tracks, 1 to 15 ms wait for disk to "rotate" desired sector under head: 0 to 8.3ms each sector has a header w/ sector # and ECC r/w bits as they rotate under head at 7200 disk access patterns: two broad classes big multi-track sequential transfers: one track per rotation 512*1000 / 0.0083 62 megabytes/second that is fast! unlikely to be a bottleneck for most systems DRAM is only 10x faster, typical net only 2x faster small transfers, e.g. 5000-byte web page: from *random* disk location seek + rotate + transfer avg seek: 9 ms avg rotate: 1/2 full rotation for random sector transfer: size / 62 MB/sec 9 + 4 + 0.1 = 13.1ms rate = 5000 / 0.0131 = 0.4 MB/sec i.e. 1% of big sequential rate. this is slow. sadly this is typical of real life Lesson: avoid random disk access! if you r/w entire big files, lay them out contiguously not as easy as it seems FS needs to allocate contiguous sectors as you write so FS must organize info about free sectors appropriately bad: linked list, as in 1970s UNIX better: bitmap danger: free space fragmentation, if create then delete files prevents allocation of large contiguous areas if you read lots of small pieces of data, group them UNIX FS: dir inode, dir ent, file inode, indirect block, block original UNIX placed them in essentially random parts of the disk that's many 13 ms random accesses to open()+read() modern FS put dir/inodes/content in nearby tracks (is layout a form of scheduling?) caching disk sectors use some of RAM to remember recently read disk sectors this is often the main purpose of RAM... your o/s kernel does this for you table: SN DATA ... ... read(sn): if sector in cache, use it else: evict some sector read sn from disk put sn, sector into cache how to calculate performance win from caching? the usual analysis tool is "average access time" average access time = hit time + miss time * miss ratio hit time: about 0 s miss time: 0.010 s miss ratio: fraction of accesses that aren't in the cache depends hugely on access pattern example: 100 small files 90% of accesses are for just 10 files so make the cache be able to hold 10 files then miss ratio will likely be just over 0.1 avg access time = 0 + 0.01 * 0.1 = 0.001 seconds eviction / replacement policies goal: evict cache entry that's unlikely to be used soon least-recently-used (LRU) usually works well intuition: used recently => used again soon good if some data more popular than others e.g. UNIX root directory maybe implement with a double-ended list Question: when would LRU work poorly? app loops over huge file that doesn't fit in cache if it's been used recently, won't be used again for a while! LRU would evict everything useful, e.g. root directory yet cache contents would never be useful MRU? random? how to decide if caching is likely to be effective? first: do you have a busy disk and idle CPU? easy if all your disk data fits in RAM! harder: you have 1 GB of data on disk and 0.1 GB of RAM will that work out well? temporal locality? small subset (< 0.1 GB) that is used a lot working set < cache size and the rest of your 1 GB relatively unpopular e.g. home directories of currently-logged-in users spatial locality? if other content in sector needed soon after first access e.g. ls -l looks at each dir ent in turn a sector holds many directory entries maybe FS puts multiple inodes in a directory in the same sector one reason FS uses block = multiple sectors once you've seeked, why not read multiple sectors also decreases book-keeping (free list, inode block array) some access patterns have no locality reading huge files uniform random reads i/o concurrency what if each request uses both CPU and disk? CPU: A- B- C- Disk: A- B- C- both are idle half the time we can double throughput by overlapping CPU and disk CPU executes B while disk seeks for A works well if many activities waiting, so always something to run or read requires a way to concurrently execute many activivies, switch between them threads! handy even if only one CPU yield CPU w/ wait() while waiting for disk disk interrupts when done, interrupt handler calls notify() special case: prefetch into cache (works w/ only one thread) special case: write-behind out of cache what if CPU requirements << disk? above example only works well if roughly equal concurrency may still win if some hit in cache, some miss [time diagram: hit mmmiiisss hit hit] let hits complete from cache while miss is waiting for disk again, often implemented with threads writes are a special problem reads often cacheable writes often must go to disk in case of crash as well as to cache result: caching not very effective for write-intensive applications worst is random writes (seeks) e.g. mail server storing incoming messages, each in a small file many good ideas for coping w/ writes e.g. logging, discussed later in course useful to think of two disk performance axes, mathing access patterns sequential vs random read vs write Disk (SSD) seq r MB/sec 60 145 seq w MB/sec 60 123 rand r IO/sec 77 6800 (4K, 28 MB/s) rand w IO/sec 77 1000 (4 MB/s) there are faster storage technologies than hard disk solid state disk (SSD) -- flash w/ fast SATA "disk" interface flash retains information even without power you might think it would be very fast, since no moving parts on my OCZ Vertex 30-GB drive seq read: 145 MB/s seq write: 123 MB/s rand 4K read: 6800 / s (28 MB/s) rand 4K write: 1000 / s (4 MB/s) why such poor random write performance? "write amplification" must erase flash before writing or re-writing erase takes one or two milliseconds only erases in 512 KB blocks! due to chip layout small write requires: read 512 KB into DRAM cache erase 512 KB write 512 KB to flash the drive is actually smarter than that remapping table lets each smallish page be stored in any erase block keep on block open for writing one-page write: write to open block update page mapping to point to open block decreases write amplification table size, power failure performance isn't everything: Disk SSD $ / GB 0.1 2 scheduling sometimes changing order of execution can increase efficiency suppose many small disk requests waiting (from concurrent threads) use "elevator algorithm" sort by track, r/w in track order before: 71 10 92 45 29 after: 10 29 45 71 92 results: short seeks (1ms instead of 8ms) the bigger the backlog, the smaller the average seek system gets more efficient as load goes up! higher throughput maybe you can sort rotationally also, but that's hard cons: unfair -- increases delay for some what if one disk doesn't provide enough performance? use multiple disks in parallel how to split the data over the disks? big reads, throughput-limited? "stripe" each file across multiple disks first 1 MB on disk 1, 2nd MB on disk 2, 3rd MB on disk 1, &c app does a big read all disks read their part in parallel app gets data N times faster small reads, seek-limited? put each read unit on only one disk (e.g. each customer record) so each app read touches only one disk can seek for multiple different records at the same time load balance critical in both cases 2 disks only get 2x performance of each does half the work easy to get imbalances from popular data here are some handy rough numbers for thinking about performance latency: 0.0000001 ms : instruction time (1 ns, really maybe 1/3 ns) 0.0001 ms : DRAM load (100 ns) 0.2 ms : LAN network (e.g. RPC to nearby server) 10 ms : random disk I/O 25 ms : Internet east coast -> west coast throughput: 1000 MB/s : DRAM 100 MB/s : LAN (maybe 1000 MB/s) 100 MB/s : sequential disk 1 MB/s : random disk I/O demo: 1. run httpd port on Linux modify to use same port fadvise single threaded server echo 1 > /proc/sys/vm/drop_caches (flush FS cache) cat garbarge > /dev/null (flush disk cache!) all clients access same file concurrently ./simpleclient -n 10 run io stat -> disk is bottleneck draw throughput and latency graph 1b measure disk time cat htdocs/index.html > /dev/null 2. add caching turn off fadvise second client request fast: 131 -> 0.3 improved latency and throughout 3. add concurrency what if something isn't in the cache? problem; slow again. throughput is limited by misses modify httpd to be multithreaded to get I/O concurrency client must be multithreaded too