6.033 Handout 21, 28 March 2000

6.033 Spring 2000 Design Project 1: Sample Analysis

Analysis of energy-efficiency

This note attempts to clarify the details of the energy analysis asked in the design of the compression-based VM system for DP1.  The one-sentence summary is that this analysis is complicated and depends on what the assumption is about the system bottleneck.  The problem is that the specification does not provide enough direction to make an informed decision on where the bottleneck really is.  This note analyzes the different potential bottlenecks and describes a complete analysis.  The end-result is that the expected energy advantage of the compressed virtual memory (CVM) system is at least 17.5X and can be as high as 28X.  The variation is due to the expected fraction of pages that are "dirty" and need to be written to disk upon eviction.

There are three potential bottlenecks in the system: the disk (too much paging activity to/from disk), the CVM system (a CPU bottleneck), and application processing (a CPU bottleneck).  Let's look at each in turn:

I. Disk bottleneck

The disk takes 10ms per operation (read or write of an 8-KByte page). It can therefore handle 100 operations per second.  A RAM miss causes a page to be read from disk.  If the page to be evicted by the replacement policy has been written to previously, it is dirty and needs to be written to disk as well.  Thus, if the RAM miss rate is m, the maximum number of memory references per second that can be handled by the system is between 50/m (all pages are dirty) and 100/m (no pages are dirty).  In particular, when we use 32MBytes of RAM without compression, m = 0.01, and the number of memory references per second cannot exceed 5000 to 10000.

Let fd be the fraction of dirty pages.  The amount of time spent idle by the CPU, per memory reference, is given by 0.01*10*(1+fd)ms and the number of memory references per second cannot exceed 100/ {(0.01) * (1 + fd) = 10000/(1 + fd).

II. CVM bottleneck

Let's calculate the number of pages per second that can be compressed and decompressed to understand where the bottleneck might be in CVM.  At 200 MHz, 1 instruction/cycle, no CPU stalls, and each 8-KByte page taking 32,768 instructions to compress and the same number to decompress, the number of pages that can be compressed and decompressed in 1 second = 1 / (2*32768/200*106) = 3,051.  Notice that in general, reading in a page from compressed memory to uncompressed memory requires both compression (of a page being moved to compressed memory) and a decompression (of the page being accessed).

Suppose we have M MBytes of RAM (M=32 MBytes for us) and that X MBytes are uncompressed and (M - X ) are compressed.  Then, the miss rate to compressed memory = m(X), where m() is the miss-rate function given by the table in the specification.  The maximum number of memory references per second, r, that can be handled can then be found using the equation:

r * m(X) = 3051 ==> r = 3051 / m(X)

When X = 16 MBytes, m(X) = 0.04 and = 76,275 memory references per second.

When X = 16 MBytes, the total amount of compressed + uncompressed memory = 16 + 4*16 = 80 MBytes, and m(80) = 0.0156%.  This means that the maximum number of disk operations per second required = 76275 * 0.000156 = 12 when no pages are dirty, and 12*2 = 24 if all pages are dirty.  In terms of the fraction of pages that are dirty, the number of disk operations cannot exceed 12(1+fd).

III. Application processing bottleneck

Suppose that the bottleneck in the system is the application's CPU processing.  When there's no compression and we use 32 MBytes of RAM, the energy consumed by a single disk operation is 2.3W * 10ms = 23 mJ, which implies that writing a dirty page and reading in the required one consumes 46 mJ.  This happens for 1% of all RAM references.  So, the average energy per RAM reference is 0.46 mJ for dirty pages.  In terms of the fraction of pages that are dirty, the average energy per RAM reference is 0.23(2-fd). Call this Eram_nocomp.

Eram_nocomp = 0.23(1 + fd)                (1)

The energy required to compress and decompress an 8-KByte page at 200 MHz = 2*32768/(200*106) * 400 * 10-3 = 0.13 mJ.  For the same (as in the uncompressed case) RAM reference pattern and processor caches, the amount of energy consumed by compression and decompression and disk access per RAM reference can be calculated as: {m(16) - m(80)} * 0.13 + m(80)* 23 * (1+fd) =  (0.009+0.003fd) mJ.  Call this Eram_comp.

Eram_comp = 0.009 + 0.003fd           (2)

IV. Should we spin down the disk?

The miss rate table in the DP1 specification does not capture any information about the frequency (in time or as a fraction of the number of instructions) of memory accesses in the program.  Furthermore, most modern processors have instruction and data caches that absorb many of the instruction fetches and load/store references.  To see whether it is worth spinning the disk down from time to time to conserve energy, we need to introduce a new parameter, R, which is the number of RAM references for every 1,000,000 executed program instructions.

Consider any VM system for this computer and let d be the average time (in seconds) between two successive accesses to disk.  When idle, the disk consumes 0.1W of power, but requires 4.7W * 20ms = 94mJ to spin up.  On the other hand, if d is small, then it may be worth keeping the disk spinning, but this takes up 0.95W of power.  Thus, it is worth spinning the disk down provided 0.1d + 94 < 0.95d, or d > 110.5 ms.

In 110.5ms, the processor running at 200 MHz can execute about 22 million instructions, and the effective RAM miss rate for CVM must be as small as 4.5*10-8, which is orders of magnitude smaller than m(80), the miss best rate for CVM.  Thus, we arrive at the conclusion that spinning down the disk is a losing proposition.
 

V. A consolidated analysis

Having calculated the energy consumed in various opertations and established that the disk must continue spinning, we are now in a position to consolidate all this information and perform a complete analysis of the system.  The question we ask is: What's the energy consumed in executing 1,000,000 instructions of the application program for the original VM and for CVM, and take the ratio to obtain the energy efficiency of CVM.

This depends on the total time of execution as well as on R.  If R is very small, then both compression isn't a big win because there are hardly any RAM references and therefore hardly any accesses to disk.  If R is large, then compression could be beneficial because it reduces the number of accesses required to disk.  We perform the analysis assuming that there is exactly one user program being run on the system.

Regardless of whether compression is used or not, the disk continues to spin, so the energy it consumes is 0.95T, where T is the total execution time of the program.  The processor takes up 400 mW at 200 MHz, and if we clock it down to 33 MHz while waiting for disk (this would be a good design decision), it dissipates proportionally less power.

For both the original VM and CVM, the energy consumption is given by:

E = Pspin*T + Pcpu_busy * Tcpu_busy + Pcpu_idle * Tcpu_idle + R * Eram                    (3)
where T = time to execute instructions + time to wait for paging to occur
= Tcpu_busy  + Tcpu_idle, and
Eram = average energy consumed per memory reference (Section III, equations 1 and 2)

We know all the power numbers: Pspin = 0.95W, Pcpu_busy = 0.4W, and Pcpu_idle = 0.4*33/200W = 0.066W.

Case 1: No compression

With no compression, the Tcpu_busy = 5ms (1 million instructions at 200 MHz).  The CPU idle time is obtained from Section I as 0.1(1+fd)*R ms.  Therefore,

Tnocomp = (5 + 0.1R + 0.1fdR) ms                  (4)

Using Equation (1) and Equation (3), we find that:

Enocomp = (6.75 + 0.332R + 0.332fdR) mJ   (5)

Case 2: CVM

For CVM case, calculating the execution time Tcomp is a little more complicated since it depends on the likelihood that a page not in uncompressed RAM is in compressed RAM and on the likelihood that it is on disk.  This likelihood is given by the miss rate m(): the probability that a page is in compressed RAM = m(16) - m(80) and the probability that it's on disk = m(80).  Thus,

Tcomp = {(1*106)/ (200 * 106) + (time to compress & decompress)*[m(16) - m(80)]*R + 0.01(1+fd)*m(80) * R} seconds.

Section II tells us the time to compress & decompress a page, it is 1/3051 ms = 0.327ms.  During compression/decompression, the CPU runs at 200 MHz, m(16) = 0.04, and m(80) = 0.000156, so

Tcpu_busy = (5 + 0.013R) ms            (6)

Tcomp = (5 + 0.013R + 0.00156R + 0.00156fdR) ms = (5 + 0.015R + 0.00156fdR) ms  (7)

Using Equations (2), (3), and (6) allows us to express Ecomp after some algebra:

Ecomp = (6.75+ 0.019R + 0.005fdR) mJ          (8)

The bottom line

The energy ratio between the two schemes is therefore the ratio of Equations (5) and (8), and depends on the number of memory references per million executed instructions.  If R is large (say at least a few hundreds or thousands of RAM references per million instructions, then this ratio tends toward 0.664/0.024 = 27.7 when all pages are dirty, and towards 0.332/0.019 = 17.5 when no pages are dirty.  Either way, a substantial win for CVM.

If R is small (but not tiny enough to spin the disk down), then there is no significant benefit to compression, but this cannot usually happen in reality.  For a typical value of R, suppose that the program makes 1.25 memory accesses per instruction (one to fetch and every fourth is a load/store).  Suppose that one in ten (10%) references make it to RAM past the cache.  Then R = 1.25 * 106 * 0.1 = 1.25 * 105

(As an aside, Patterson and Hennessy, in their Computer Architecture book claim that the Spec92 suite sees a cache miss rate of about 1% for a 128-KByte processor cache.  In this case, R = 1.25 * 106 * 0.006 = 7,500.  The corresponding energy benefit of compression would be about 28.8X in this case too.  But this value of L1 miss-rate doesn't really fit in with the numbers mentioned in the project specification.)

Also note that this analysis is for a single program running on the computer.  The case when the computer runs many concurrent programs is analogous, but the results are somewhat different under the assumption that the CPU is never idle (since some other program can execute when one is blocked).  The way to do this analysis is to start with a total amount of energy E and calculate the total system lifetime until this energy is exhausted for this workload.  This may be an instructive exercise for the reader to do.

In conclusion, we find that the energy benefits of CVM are between 17.5X and 27.7X depending on the fraction of dirty pages, for a single program running on our handheld computer.

Acknowledgment

I thank Jerry Saltzer for describing the situations when the processor isn't the bottleneck (Sections I and II), showing why my original calculations (Section III) didn't capture the complete picture.  Although he made several good suggestions that improved the presentation of this note, I can't blame him for any errors that might be present!  Please send comments, suggestions, or bug reports/fixes to hari@lcs.mit.edu.

Last update: Fri Mar 24 19:54:48 EST 2000 (hari)