Reconfiguring Issue Logic for Microprocessor Power/Performance Throttling

Edwin Olson eolson@mit.edu, Dave Maze dmaze@mit.edu, Andrew Menard armenard@mit.edu

Abstract—A growing need for computational power in mobile devices has spawned increased interest in low-power microprocessors. Some low-power applications require very high performance, such as those on Personal Digital Assistants which may even do video decoding in real-time. A growing body of work has examined how to provide this high performance at low power and also how to throttle performance so that power consumption can drop to very low levels when only a low level of performance is needed. Observing that the issue logic in an out-of-order microprocessor consumes a significant amount of power, several groups have attempted to modify the issue logic so that it can enter low-power modes dynamically. We have revisited these topics and our work shows that simple approaches to modifying issue logic do not reduce the average energy per instruction. We also look at the possibility of including a low-power single-issue processor on the same die as a high-performance multiple-issue processor, to swap between them to dynamically trade off power versus performance.

Index terms—Issue Window, Issue Logic, Out-of-Order, Low Power, Power/Performance Throttling

I. INTRODUCTION

Much of the thrust of recent computer architecture work has been in search of increased performance. As transistor budgets increased, more and more technologies from mainframes were incorporated in microprocessor designs. The product of this evolution was high performance microprocessors that sacrificed power consumption for performance. With the emergence of low-power markets, these speed demons have been retrofitted to consume less power by incorporating clock gating, voltage scaling, and more recently, dynamic resizing of key architectural features such as the issue window.

Many existing techniques for reducing power are well established and extremely effective, including dynamically reconfiguring the cache[1] and voltage scaling. Reducing the supply voltage of a microprocessor has a roughly linear effect on performance (due to weaker electric fields) but a squared effect on power dissipation (since power consumption is proportional to ½*frequency*CV²).

When comparing two architectures for power efficiency, it is tempting to us a metric such as energy/instruction or the power delay product. However, one must take into account the required performance level. A processor with seemingly poor energy/instruction characteristics that has more performance than required can be run at a lower voltage thus reducing performance and energy/instruction. This processor might then be much more attractive.

While voltage scaling is a very good way of providing additional power/performance modes, it has limits. When operating voltage approaches the threshold voltage of the transistors, the performance of the transistors begins dropping off much faster than linearly. As threshold voltages are reduced, leakage currents increase which increases power consumption. The SIA predicts that in the year 2005, supply voltages for low power applications will be 0.9-1.2V, compared to a typical modern supply of 1.8V [3]. Even while this reduction implies significant power consumption, the SIA predicts a net increase in total power requirements for battery-operated devices of 70%. Clearly there is a need for additional power/performance throttling mechanisms.

It has been observed that one major power drain in modern out-of-order processors is the issue logic; every clock cycle, every instruction in the issue queue must be checked to see if it can be dispatched. Retired instructions broadcast the availability of new operands on long bit lines across the entire issue window. Some processors, such as the Alpha 21264, compaction the issue queue in order to implement an oldest-first priority algorithm, and this process requires even more energy. In the 21264, between 18 and 46 percent of the total power of the processor is consumed by the issue logic [4].

Thus, methods of scaling back the size of the issue window and the number of instructions issued each cycle have been proposed to minimize this source of power consumption, at the cost of reduced performance. These methods are compatible with cache disabling and voltage scaling; for maximum reduction in power consumption the power management software could simultaneously reduce the voltage to the lowest possible level, disable parts of the cache, and reduce the issue window size, or it could find intermediate power/performance points by doing only one or two of these optimizations. We also consider an alternate scheme of bypassing complex issue logic completely, by having an in-order, single-issue core alongside the out-of-order multiple-issue core, with the OS able to swap between them, thus avoiding the issue logic altogether when necessary In this paper, we do not have a required performance level that must be achieved, but instead consider the metric of power consumed per instruction per cycle.

Several studies have shown that relatively simple modifications to an operating system can allow it to do performance throttling without spending an excessive amount of time profiling the code being executed [*,*]

II. Methodology

In order to conduct our study, we needed to measure the impact of changing architectural resource sizes (such as the number of slots in the issue window) on both power and performance. The SimpleScalar toolset provides detailed performance simulators [5]. SimpleScalar provides a performance simulator using a relatively unique microarchitecture based around a “Register Update Unit”, an architectural resource combining the functions of the issue window and register renaming unit. This is somewhat unfortunate since it does not correlate well to real physical chips.

However, the results we receive from these studiescan still yield insight into the effects of scaling architectural features and the relative results are still meaningful. Many other architecture studies have also used SimpleScalar, so our results can be directly compared with them. Future work may involve repeating our studies with a model more closely resembling commercially successful architectures.

SimpleScalar does not provide a mechanism for measuring power. However, several research groups have added power models to SimpleScalar, such David Brooks' Wattch tool

[6] and the Cai-Lim models [7]. Wattch's models were better suited to our study, since its models are heavily parameterized and therefore capable of reflecting various changes in configuration without needing to create SPICE models for each variation. We used version 1.02 of the Wattch power.c model.

Power estimation tools like Wattch and the Cai-Lim models have been the subject of considerable scrutiny recently \cite{ghiasi:asplos:2000}. Ghiasi and Grunwald and others have shown that not only are direct comparisons of energy measurements hopeless due to disagreement, but even relative comparisons often fail to agree. A key problem rests in the fact that an architectural description simply doesn't contain an adequate amount of information to properly estimate power, and even reasonable parameterized models quickly become unrealistic when the parameters are adjusted beyond a limited range. For example, the Wattch CAM model used for the RUU structure is a reasonable model for a 16 entry structure, but had the structure had 256 entries, it would have been implemented in a completely different way; for example, it might have been banked. We therefore have limited our study to modifying parameters by only small factors.

We primarily consider 4 issue and 8 issue processors with varying RUU sizes. The other characteristics of our processors are listed in Table 1. SimpleScalar allows many parameters to be adjusted, but we only changed a handful. Table 1 is primarily a list of non-default settings.

Table 1.

	4 issue	8 issue
Decode Width	4	8
Commit Width	4	8
Load Store Queue Size	8	8
Integer ALUs	4	6
Integer Multipliers	1	2
FP ALUs	4	4
FP Mul/Div	1	2
Memory Ports	2	4

Our benchmarks are derived from the SpecInt95 suite. Due to the limited speed of the SimpleScalar simulator (about 90k instructions per second), it was impractical to run the entire suite, or even an entire single benchmark. Instead, as in common practice in the simulator field, reduced input sets were used. These input sets take substantially less time to test, but still exercise the processor in similar ways as the official input sets. It follows that the performance data we generated cannot be compared to actual SpecInt scores, but we are primarily interested in the relative performances of our various models.

Table 2.

Benchmark	Input
Li	nqueens 6
Perl	test.in
compress95	5000 q 2131

For all of these benchmarks, runtime was dominated by the kernel of the program, rather than initialization code. In addition, the simulator is completely deterministic, so there is no need to repeat simulations and average scores.

III. Determining Optimal RUU Capacity

Understanding the optimal size for the Register Update Unit is extremely important when determining its actual capacity. Several factors influence this optimal size. The goal of the RUU is to always have enough instructions ready to feed the available functional units. As the number of functional units increases, the size of the RUU should intuitively increase to provide more candidate instructions. However, due to data dependencies, it is often the case that the number of instructions that can be fetched is greater than the number that can be issued. We want the RUU to hold a certain “surplus” of instructions so that when an instruction miss occurs (and fetch rate drops to zero) the functional units can be kept busy, but there’s no reason to make the RUU unreasonably large.

Our first experiment’s goal was to determine an absolute upper bound on the size of the RUU. We configured SimpleScalar to use an extremely large RUU and made modifications to SimpleScalar to collect statistics on the size of the RUU every cycle. The resulting structure could hold enough instructions to keep the functional units busy for dozens of cycles, and is therefore excessive. However, it does provide an upper bound on the size of the RUU from which to work from.

Figure 1.

Figure 2.

Figure 1 shows that for all three benchmarks, the RUU almost never contains more than 32 instructions at either issue width. Making an RUU any larger than 32 would serve no function; the entries would be empty virtually all the time.

When the RUU’s physical size is bounded, the RUU usage closely mirrors the unlimited case, except the RUU “saturates”. In figure 3, we see that a 16 entry RUU has almost exactly the same occupancy characteristics when occupancy is between 0 and 15. The 16 entry RUU is fully occupied about as often as the unlimited RUU has 16 or more entries. This is as would be expected, and though Figure 3 uses the li benchmark, the other benchmarks have the same behavior.

Figure 3.

The important question now is: if the RUU capacity is limited beyond the ideal case, what happens to performance? We measure performance in terms of Instructions Per Cycle (IPC), since we cannot accurately determine changes in clock period from within SimpleScalar.

Figure 4.

Figure 5.

Figure 4 shows the performance of the processor, in terms of IPC, versus the capacity of the RUU. We notice immediately that the performance of the processor for compress and perl is very similar for RUU capacities of 16 and 32 for a 4-issue processor. There’s a small increase for li. As we expected, there is almost no benefit in scaling the RUU beyond 32.

If we consider an 8-issue machine, we would expect the performance of the processor to drop off more rapidly than the 4-issue with decreasing RUU capacity. This is because the RUU could be depleted (potentially) twice as quickly, and the processor is therefore more likely to be unable to keep its functional units busy. We see precisely this behavior in Figure 5; there is a noticeable performance difference for both li and perl between RUU capacities of 16 and 32.

Some research groups have proposed dynamically varying the issue window capacity. [*] It is obvious that a parameterized model of an RUU is likely to predict substantially greater power consumption for a 32-entry RUU than a 16-entry RUU. We must resist the temptation to declare this to be an efficient mechanism for throttling power/performance. While performance is affected, a power-conscious architect is unlikely to make the RUU so much larger for such a miniscule return. This is an uninteresting regime since the IPC vs RUU size curve is essentially flat.

An interesting question still remains, however. What happens to energy per instruction statistics as we decrease the RUU well into the region of decreased performance? It might be a good idea, for example, to allow a processor to dynamically decrease its RUU size from 16 to 8 if the decrease in power offsets the decrease in performance.

Using the Wattch tool, we measured the power consumption of the processors and calculated the average energy per instruction assuming optimal clock gating (Wattch’s cc3 models).

Table 3.

Structure	4x4	4x8	4x16	4x32	4x64
Energy/Inst (li)	15.8	13.0	11.8	12.8	14.1
Energy/Inst (perl)	16.5	14.3	13.6	14.7	16.1
Energy/inst (compress)	14.4	11.5	10.6	11.3	12.5

Table 3 shows the average energy per instruction for each benchmark, for various RUU capacities of a 4-issue processor. We already expected the 4x32 and 4x64 configurations to be suboptimal, since the RUU is essentially oversized. It’s interesting, however, that the cost of executing instructions actually increases when the RUU is shrunk below 16 entries. While the power consumption of the issue logic is going down with decreasing RUU capacity, the performance is dropping super-linearly.

We can also see that we’re spending more energy per instruction on codes with less inherent parallelism (perl in particular). This makes sense since there are a lot of hardware resources in an out-of-order superscalar processor looking for parallelism to exploit, but there’s simply very little parallelism to be found. This overhead cost is being amortized over very few issued instructions every cycle and thus the average energy per instruction is higher.

We’ll also note that we don’t trust the power numbers for the extreme configurations of RUU (x4 and x64) since they comprise a significant factor of deviation from Wattch’s baseline capacity.

Table 4.

Structure	8x8	8x16	8x32	8x64
Energy/Inst (li)	13.8	12.5	13.4	14.9
Energy/Inst (perl)	15.1	14.7	15.8	17.6
Energy/inst (compress)	12.4	11.4	11.9	13.3

In Table 4, we have energy per instruction statistics for an 8 issue processor. We see very similar trends as in the 4-issue processor. We do notice that the average energy tends to be less on the 8-issue than on the 4-issue. We attribute this partially to lower-average occupancy in the RUUs, which means less logic needs to be evaluated every cycle. We’re also somewhat skeptical of Wattch in this case, so we would not advise absolutely comparing the numbers from Table 3 and 4. We do believe that the relative trends within a table are reasonable, however. As with the 4-issue case, we observe that the 16-entry RUU is the minimum energy per instruction point.

IV. Other Problems With Retrofitting

It seems as though scaling a processor’s issue window will not provide the power/performance throttling we would like. There are other difficult issues that accompany retrofitting a complex microprocessor to be low power.

We see from Figure ??? that the register file consumes a significant percentage of power. SimpleScalar’s RUU structure works in its favor for minimizing the complexity of the register file, whereas in a mainstream design the register file is likely to consume even more power as the issue width increases. The massive size of a many-ported register file makes any low-power use of it nearly impossible; even if ports are unused, the physical capacitance remains.

V. Using a completely separate core

Since simply scaling back the issue logic appears to not generate much savings, we also looked at the possibility of using multiple cores on a single chip. Comparing figures 4 and 5 above, it appears that the 4-issue machine, with fewer functional units, is consuming somewhat less power per instruction than the 8-issue machine, which seems hopeful. However, numbers generated by SimpleScalar are not entirely comparable across significantly different architectures, so we have done a survey of available machines with comparable instruction sets and technologies to attempt to verify whether a simpler machine really would gain a significant benefit.

IBM’s PowerPC line includes the model 440 CPU, a dual-issue, 7-stage pipeline machine, and the 405 CPU, a single-issue, 5-stage pipeline machine, both implemented in the same .18 micron copper process [*,*]. The 440, operating at 550MHz, consumes approximately 1.0W of power, and performs at 1000mips on the Dhrystone 2.1 benchmark, while the 405 operating at 266MHz consumes approximately 0.5W of power while performing 375mips on the same benchmark. Thus, the power per instruction on the 440 is approximately 0.001W, while on the 405 it is approximately 0.0013W. This is a very disappointing result; the faster processor is actually using less energy per instruction, so clearly you would benefit more from an approach like voltage scaling to reduce total energy used across a calculation, or just use the faster processor until the calculation is finished and then put it into a sleep mode, both of which also avoid the significant area overhead of the dual processor approach.

The Intel Pentium III mobile versions utilize voltage scaling to achieve a more than 50% reduction in power consumption while still achieving 70% of the performance.

VI. Conclusions

In this paper we have examined two techniques for performance throttling; reducing the size of the issue window and using a dual-processor core. Both of these approaches have turned out to be disappointing; scaling the issue window below 16 entries caused performance penalties that outweighed the energy savings, and comparisons of simple and complex processors manufactured using similar technologies showed that the more complex processor achieved a significantly higher performance while maintaining almost the same amount of energy per instruction executed. Thus, in both cases, conventional voltage scaling approaches will make significantly better power to performance tradeoffs.

VII. References

David H. Albonesi, “Dynamic IPC/Clock Rate Optimization,” 25th International Symposium on Computer Architecture, 282--292, June, 1998

W. Ye and N. Vijaykrishnan and M. Kandemir and M. J. Irwin, “The Design and Use of SimplePower: A Cycle-Accurate Energy Estimation Tool,” 37th Design Automation Conference, 340--345, June 2000

David Brooks and Vivek Tewari and Margaret Martonosi, “Wattch: a framework for architectural-level power

analysis and optimizations,” 27th Annual International Symposium on Computer Architecture, 83--94, June, 2000

Doug Berger and Todd M. Austin, “The SimpleScalar Tool Set, Version 2.0,” June, 1997},

David H. Albonesi, “The Inherent Energy Efficiency of Complexity-Adaptive Processors,” 1998 Power-Driven Microarchitecture Workshop, held at the 25th International Symposium on Computer Architecture, 107--112}, June, 1998

Michael K. Gowan and Larry L. Biro and Daniel B. Jackson, “Power considerations in the design of the {Alpha} 21264 microprocessor, “35th Annual Conference on Design Automation”, 726--731, June, 1998

R. Y. Chen and M. J. Irwin, “An Architectural Level Power Simulator,” 25th International Symposium on Computer Architecture,” June, 1998

T. Pering and T. Burd and R. Broderson, “Dynamic Voltage Scaling and the Design of a Low-Power Microprocessor System,” 1998 Power-Driven Microarchitecture Workshop, held at the 25th International Symposium on Computer Architecture, 107--112, June, 1998

G. Cai and C. H. Lim, “Architectural level power/performance optimization and dynamic power estimation,” MICRO32, November, 1999

Soraya Ghiasi and Dirk Grunwald, “A Comparison of Two Architectural Power Models,” Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, November, 2000

L. Benini and A. Bogliolo and S. Cavallucci and B. Ricco, “Monitoring System Activity for OS-Directed Dynamic Power Management,” International Symposium on Low Power Electronics and Design, August, 1998

Vivek Tiwari and Deo Singh and Suresh Rajgopal and Gaurav Mehta and Rakes Patel and Franklin Baez, “Reducing Power in High-performance Microprocessors,”

35th Annual Conference on Design Automation, June, 1998

IBM Product Datasheet for the PowerPC 440 Core

IBM Product Datasheet for the PowerPC 405 Core

The authors are graduate students at the Massachusetts Institute of Technology, Cambridge, MA.