2 - University of California, Los Angeles



WArPE 1.0

Wisconsin Architecture Power Estimator

MICRO

ARCHITECTURAL

POWER ESTIMATION TOOL

1 Introduction 5

2 WArPE Processor Model 7

2.1 Microarchitecture 8

2.1.1 Instruction Fetch 9

2.1.2 Instruction Decode / Dispatch Stage 11

2.1.3 Instruction Execution and Writeback 14

3 Analytical Models 17

3.1 Power Density Model 17

3.2 Analytical RAM Model 19

3.2.1 Decoder Buffer 20

3.2.2 Decoder 21

3.2.3 Wordline 24

3.2.4 Bitline 25

3.2.5 Sense Amplifier 27

3.2.6 Output driver 29

3.2.7 Generic mux 30

3.2.8 Comparator 30

3.3 Latch Model 31

3.4 Special Model for Issue Window 34

4 Options, Configuration, Output 36

4.1 Options 36

4.2 Configuration files 37

4.2.1 Basic configuration file 38

4.2.2 Process Technology Data File 40

4.3 Output file 41

5 File Structure 46

5.1.1 power.h 47

5.1.2 power.c 50

5.1.3 anal.h 52

5.1.4 anal.c 52

5.1.5 sim-outorder.c, main.c 56

5.2 Control Flow 57

6 References 60

Appendix 62

Index 69

Table of figures

Figure 1: Micro architecture of a simple superscalar processor 8

Figure 2. Table of all the activity counts associated with the fetch stage 10

Figure 3: Activity Counters associated with the decode/dispatch stage 12

Figure 4: Instruction Issue Window 13

Figure 5: Activity counters associated with the execution and the writeback stage 16

Figure 6:Decoder Buffer. 21

Figure 7:Static decoder schematic 22

Figure 8:Circuits used in the two stages 23

Figure 9: Dynamic decoder. 24

Figure 10: Word line. 25

Figure 11: Bitline 26

Figure 12: Sense Amplifier architecture 27

Figure 13: Sense Amplifier circuit. 28

Figure 14: Output driver. 29

Figure 15:n-bit comparator 30

Figure 17: A Pipeline Latch 33

Figure 18: Instruction Issue Window 35

Figure 19: Basic Configuration File 43

Figure 20 Technology File. 43

Figure 21: Output File. 45

Introduction

Power consumption (and dissipation) has become critical design considerations in modern microprocessors. For battery powered devices, such as laptop PCs and PDAs, total power consumption is the major issue. For high performance applications such as servers, the need to dissipate high power requires expensive packaging and cooling technologies. Furthermore, in large-scale systems, power consumption can be a major operating expense.

Microprocessors can be made more power efficient at a number of levels, ranging from the circuit level, to the gate level, all the way up to software. Our particular interest is in improving power efficiency at the microarchitecture level. For studying and developing power efficient microarchitectures, power estimation tools are almost essential. And an important part of our research effort has been the development of a flexible and accurate power estimation tool –WArPE.

WArPE uses detailed microarchitecture simulation to measure energy-consuming activities and execution time. These simulation-derived measurements can then be turned into power estimates, given energy estimates for each of the activities. WArPE is based on the simplescalar simulator [1], a performance simulator widely used among academic researchers. An important element of power estimation is the energy consumed by each of the modeled microarchitecture-level activities. In WArPE, these energy estimates can be supplied directly by the user as empirical data, or for many important subsystems they can be generated via analytical models that are part of WArPE.

Other power estimation tools based on the simplescalar simulator have been developed [3,4]. WArPE is distinguished from these other estimators in a number of ways.

1) It can take chip technology data as an input and scale energy numbers appropriately,

2) The instruction fetch, decode, rename, issue pipeline is modeled in detail, including latches.

This document describes the internal structure and usage of the WArPE tool. Section 2.0 describes the detailed structure of the simulator, including estimation methodology. Section 3.0 describes the analytical models used, and the following section contains the options, configuration files and output file details. Section 5.0 discusses the file structure of the simulator.

WArPE Processor Model

WArPE models a modern dynamically scheduled superscalar processor. The processor is divided into a number of function unit blocks (FUBs). The processor is simulated in much the same way as a performance simulator. At the end of each cycle, the estimator determines the activity for each FUB, and uses this activity to estimate energy consumed by that block. The total energy consumed by all the FUBs during each cycle yields an instantaneous power estimate, and the average over all the cycles gives an average power estimate. The instantaneous power is useful when di/dt is of concern; it can be estimated by computing the difference in power consumption between consecutive cycles.

The per-activity energy estimates are determined before the simulator starts. These estimates are determined in one of the following ways.

1) RAM FUBs use a general analytical model

2) Power density model for non-RAM FUBs

3) Latch models (primarily in the instruction pipeline)

4) Special models for critical FUBs such as the issue window.

The following sections describe the overall superscalar microarchitecure, including the specific FUBs that are modeled. This is followed by descriptions of the RAM and power density analytical models. The latch models are described along with the instruction pipeline, and special models are described with the specific FUB is discussed.

1 Microarchitecture

In this section we touch upon some of the details of how the individual instruction pipeline units are modeled. The generic micro architecture of a pipelined superscalar processor is as shown in the figure.

[pic]

Figure 1: Micro architecture of a simple superscalar processor

The associated units include the branch prediction tables, Instruction translation look aside buffer, data caches, data translation look aside buffers, Reorder buffer, register file, result bus etc. For most of these we have an approximate analytical model. There is no analytical model for the latches. We now describe some details of the power models of each pipeline stage.

1 Instruction Fetch

The instruction fetch stage involves access to the instruction cache, itlb as well as the branch prediction logic. The FUBs representing this stage include those for new PC generation logic (npc), logic associated with the branch target buffer access (btblog), the actual branch target buffer RAM structure (btbcac), the return stack buffer (rsbcac), three FUBs for the L1 instruction cache: one associated with the logic circuits to access the cache (il1log), another one associated with the L1 tag structure (il1tag) and the third one for the actual physical L1 instruction cache (il1cac) and the latches at the end of the pipeline (fdlatch). WArPE has analytical models for almost all of these FUBs. Most of these structures being Cache/CAM like have invalidate, replacement, write back, read and write counters associated with them. Fig2 shows a list of all the counters associated with this stage of execution.

|Counter No. |Name of the counter |Description |

|0 |Brupdate |branch update activity |

|1 |Brlookup |branch lookup activity |

|2 |Rsbpop |return stack pop activity |

|3 |Rsbpush |return stack push activity |

|4 |Il1acc |il1 access activity |

|5 |Il1wbk |il1 writebacks activity |

|6 |Il1rep |il1 replacements activity |

|7 |Il1inv |il1 invalidations activity |

|12 |Il2acc |il2 access activity |

|13 |Il2wbk |il2 writebacks activity |

|14 |Il2rep |il2 replacements activity |

|15 |Il2inv |il2 invalidations activity |

|24 |Itlbmis |itlb miss activity |

|27 |Itlbacc |itlb access activity |

|28 |Itlbwbk |itlb writebacks activity |

|29 |Itlbrep |itlb replacements activity |

|30 |Itlbinv |itlb invalidations activity |

|35 |Npc |next pc logic activity |

|69 |Fdlatch_active |Latch after fetch stage active |

|70 |Fdlatch_stall |Latch after fetch stage stalled |

|71 |Fdlatch_empty |Latch after fetch stage empty |

Figure 2. Table of all the activity counts associated with the fetch stage

In an attempt to build power numbers for these structures we try to map these tables to an approximate Cache structure. The CACTI tools, which are used by almost all the existing simulators, do this mapping for us. CACTI tools find an optimal cache structure for each of these tables by taking in parameters like the cache size, associativity and the no. of sets. The tool maps these structures to an optimal size cache assuming that some cache optimizations would have been done at the circuit level and return an optimal mapping. The numbers of row and column decoders are thus calculated. The power models for the caches and the decoders are the same as suggested by Wilton and Jouppi [2]. Currently, there are no analytical models for either the write back or the replacement or the invalidation logic circuits. But the simulator maintains a count of these activities. To calculate the power we multiply the activity counts with some approximate power numbers as obtained from the industry. However, the user can input any numbers and hence customize the simulator.

At the end of the fetch stage is a set of pipeline latches, which may be of variable width. These latches may be in Active, Stalled or Empty state with each stage consuming a different amount of energy. The simulator keeps an account of the number of latches in each stage per cycle. This gives the power consumed each cycle by the latches. More detail on the latch power model follows in sec 3.3.

2 Instruction Decode / Dispatch Stage

The decode stage entails the decoders as well as the register aliasing table associated with the Register Renaming Logic. These units are represented in the simulator with FUBs for dispatch queue (dispatchq), instruction decoder (decodepla), logic associated with decoder for handling mispredictions (decodemisp), logic associated with stalling decoder (decodestall), register aliasing table (ratarr), FUBs for input/output dependence check (ratidep, ratodep), register aliasing table stall (ratstall) and the latches at the end of the pipe stage (dilatch). There are counters associated with decoder stall and mispredict activity as well as with the decoder access itself. The register aliasing table has counters associated with the table itself as well as with input and output dependence checking activity. A list of all the counters is given in fig3. Presently, we have analytical model only for the register aliasing table cache. Rest of the activity counters are multiplied with the power numbers obtained from the user input file (pfa mode).

|Counter No. |Name of the Counter |Description of the counter |

|36 |Dispatchqrd |dispatchq read activity |

|37 |Dispatchqwr |dispatchq write activity |

|38 |Dispatchqrel |dispatchq release activity |

|39 |Dispatchqrec |dispatchq recover activity |

|40 |Decoder |decoder activity |

|41 |Decodemispchk |decoder mispredict detect activity |

|42 |Decodemisp |decoder mispredict correction activity |

|43 |Decodestallchk |decoder stall detect activity |

|44 |Decodestall |decoder stall block activity |

|45 |Ratidep |rat idep allocation activity |

|46 |Ratodep |rat odep allocation activity |

|47 |Ratstallchk |rat stall detection activity |

|48 |Ratstall |rat stall block activity |

|72 |Dilatch_active |Latch after decode stage active |

|73 |Dilatch_stall |Latch after decode stage stall |

|74 |Dilatch_empty |Latch after decode stage empty |

Figure 3: Activity Counters associated with the decode/dispatch stage

The instruction thus decoded are moved into another set of latches which again may be of variable size and variable number of latches could be there. These latches may model the delay associated with the renaming logic or the actual decoding of the instruction. As before the latches could be in one of the three states: Active, Stalled or Empty with different power numbers that may be the same as for the previous latches. We maintain a per cycle record of the state in which the latches are (Dilatch_active, Dilatch_stall, Dilatch_empty) and calculate the per cycle contribution to total power.

[pic]

Figure 4: Instruction Issue Window

Another innovative idea with this power simulator is in the issue window. The simulator models both Collapsible and Non Collapsible instruction issue window with the same FUB: isw. There would be some power associated with collapsing the instruction window. The simulator has counter to record these movements per cycle (Iswcolmoved) and the user can supply the power associated with these movements. The issue window can also be viewed as a set of fixed length latches with the same three states as before. The Active state (Iswact) now corresponds to the number of instruction ready to be issued that cycle while the stalled state (Iswstall) would correspond to instruction that are still waiting for their operands to become ready. The empty state (Iswempty) would represent the in-occupancy of the issue window each cycle. A detailed power model for the same is explained in sec3.4

3 Instruction Execution and Writeback

The instructions selected are then issued to the corresponding Functional Units or are stored in the Load/Store queues. The FUBs for this stage include those for the integer functional units (fuint), floating point functional units (fufp), the L1 data cache logic circuit (dl1log), L1 data cache tag structure(dl1tag), L1 data cache (dl1cac) and similarly for the united L2 cache (ul2log, ul2tag, ul2cac), the load/store queue (lsqrdyq), the data tlb (dtlbcac). The simulator does not have an analytical model for any of the functional units but the load/store queues can be modeled as a pair of cache like structure along with a CAM like structure with analytical models for both of them. Another structure associated with the execution stage is the data cache. The simulator models the data cache on the same lines as the instruction cache using the CACTI tools. There are counters for data cache access(dl2acc), write back(dl2wbk), replacement(dl2rep) and invalidation(dl2inv). The data tlb is also modeled on the lines of the instruction tlb and hence has the CAM like analytical model. The results as generated from the functional units are broadcasted through the result bus. But the current version of the simulator doesn’t calculate the power consumed by this result bus.

All the activities associated with the initialization and the utilization of the register update unit are represented with the FUBs for ruu array (ruuarr), the ruu writeback (ruuwb). A complete list of all the FUBS and all the counters are included in the appendix to this manual. The list of counters associated with this stage is as follows:

|Counter No. |Name of the counter |Description of the counter |

|8 |Dl1acc |dl1 access activity |

|9 |Dl1wbk |dl1 writebacks activity |

|10 |Dl1rep |dl1 replacements activity |

|11 |Dl1inv |dl1 invalidations activity |

|16 |Dl2acc |dl2 access activity |

|17 |Dl2wbk |dl2 writebacks activity |

|18 |Dl2rep |dl2 replacements activity |

|19 |Dl2inv |dl2 invalidations activity |

|20 |Ul2acc |ul2 access activity |

|21 |Ul2wbk |ul2 writebacks activity |

|22 |Ul2rep |ul2 replacements activity |

|23 |Ul2inv |ul2 invalidations activity |

|25 |Dtlbmis |dtlb miss activity |

|26 |Ul2mis |ul2 miss activity |

|31 |Dtlbacc |dtlb access activity |

|32 |Dtlbwbk |dtlb writebacks activity |

|33 |Dtlbrep |dtlb replacements activity |

|34 |Dtlbinv |dtlb invalidations activity |

|45 |Ratidep |rat idep allocation activity |

|46 |Ratodep |rat odep allocation activity |

|47 |Ratstallchk |rat stall detection activity |

|48 |Ratstall |rat stall block activity |

|49 |Ruuarr |ruu array activity |

|50 |Ruurdyqsch |ruu readyq allocation activity |

|51 |Ruurec |ruu recover activity |

|52 |Ruuret |ruu retire activity |

|53 |Ruurdyqcam |ruu readyq dependence check activity |

|54 |Ruurdyqrel |ruu readyq resource release activity |

|55 |Lsqarr |lsq array activity |

|56 |Lsqrdyqsch |lsq readyq allocation activity |

|57 |Lsqrec |lsq recover activity |

|58 |Lsqret |lsq retire activity |

|59 |Lsqrdyqcam |lsq readyq dependence check activity |

|60 |Lsqrdyqrel |lsq readyq resource release activity |

|61 |Ruuarb |ruu arbitration activity |

|62 |Ruuwb |ruu writeback scheduler activity |

|63 |Ruuwbq |ruu writebackq activity |

|64 |Lsqarb |lsq arbitration activity |

|65 |Lsqwb |lsq writeback scheduler activity |

|66 |Lsqwbq |lsq writebackq activity |

|67 |Fuint |functional unit integer |

|68 |Fufp |functional unit floating point |

Figure 5: Activity counters associated with the execution and the writeback stage

Analytical Models

The architectural power estimation methodologies can be broadly classified into empirical methods and analytical methods. These can further be classified into fixed activity and activity sensitive methods. One of the earliest methods of power estimation was a fixed activity method called the Power Factor Approximation method (PFA) described by Liu and Svensson [5]. Power estimation techniques have come a long way since then, with activity-based models, transition sensitive models and so on. The basic estimation methodology is, however, the same. We basically either calculate the power density constants associated with each structure as in the analytical model or take the power constants as input from the user, pfa model.

1 Power Density Model

Several architectural power estimation schemes have been discussed in literature [6][7]. In WArPE we use a scheme similar to Power Factor Approximation (PFA) [5]. We express the power dissipation in terms of the active/inactive power density of each FUB, the area of the FUB and the activity factor, which is determined via performance simulation.

power = {(active power density)*(activity) + (inactive power density)*(1–activity)}*area

The power density and area numbers are either determined empirically from the real design and scaled to the required technology or are estimated by considering circuit complexity, logic styles, etc. The power density numbers are further divided based on the following circuit styles:

Dynamic logic

Static logic

PLA circuits

Memory type regular circuits

Clock circuits

Thus for every FUB, one has to define 5*3 = 15 different numbers, corresponding to active power density, inactive power density and area for each of the five circuit styles. The user can supply this through the configuration file. However, it is not always possible to get/estimate these numbers. In order to overcome this problem we have included routines, which can analytically model FUBs. Presently, we can construct models for most regular memory type structures like caches, register files, register renaming tables, branch target buffers and reorder buffers. The simulator is designed in such a way that models can be updated and new models can be added relatively easily.

In order to take physical structure into consideration, a few more options have been added. The analytical models can, and in fact will, have to be refined continuously to get improve result accuracy. Models for other regular structures like PLAs can also be added.

3 Analytical RAM Model

In the analytical mode, power constants are generated using analytical models provided. Presently, we have the capability to model most of the regular and simple logic based structures. The models are based on the circuit time-delay-energy simulation model that is similar to those used by Wilton and Jouppi [2]. The idea is to break FUBs into smaller components, for which analytical models are present. The analytical models used in the simulator are similar to those used by Wilton and Jouppi [2]. Some of the differences include a choice of static vs. dynamic logic for decoder and single ended read option for register files. These models can be used to construct power constants for FUBs that contain regular, memory type building blocks. The FUBs that have already been modeled are the instruction and data caches, TLBs, branch target cache, register allocation table and return address stack. Other units that can be modeled are the register update unit and load/store queue arrays.

For example, a cache can be divided into a decoder buffer, row decoder, word-lines, bit-lines, sense amplifiers, column decoder and output MUXs. The models generate power numbers by calculating the effective switching capacitance. The effective capacitance is estimated by adding the gate, drain and routing capacitances together. These are calculated by functions that take the width and length of Poly used, as inputs. The length of all transistors is assumed to be constant and equal to the Leff defined in the technology file. The list of these functions (included in anal.c) follows.

gatecap(): return the gate capacitance of the transistor.

gatecappass(): returns the gate capacitance for a pass transistor.

draincapp(): returns drain capacitance for the p-type transistor. It has an added feature

of optimizing for stacked transistors, example the n-type transistors in a 4-

input NAND.

draincapn(): similar function for n-type transistor.

The following sections describe each of the basic models provided. An example of the usage of these models to create more complex models will be given in the last chapter.

1 Decoder Buffer

The decoder buffer, as the name suggests, buffers the address lines that go into the decoders. The buffer is an important element if the address lines feed into a large number of gates. Presently, the sizes of the buffer transistors are fixed. These could be changed depending on the number of gates connected to the lines and the speed required. The following figure shows the buffer architecture.

Figure 6:Decoder Buffer.

2 Decoder

Two types of decoder models have been included, depending on the type of circuits they use. The first one is a static decoder that is based on a two level decoding scheme. The first stage is constructed from 3x8 and 2x4 NAND based decoders. The second stage consists of an n-input OR for every output bit, where n is the number of min terms in stage 1. The following schematic brings out the basic architecture of this decoder.

Figure 7:Static decoder schematic

Figure 8:Circuits used in the two stages

The second type of decoder is the dynamic decoder, which is based on a domino NOR. However, the maximum inputs that should be allowed for this decoder is around six. The following figure shows a schematic of the dynamic decoder.

Figure 9: Dynamic decoder.

3 Wordline

The wordline power model includes both the wordline as well as the wordline driver. The driver size is computed using a function called WLdriver_size(). The inputs to this function are the capacitance driven and the rise-time expected. The rise-time has been assumed to be period/8 due to lack of data. This can be changed by changing the entry in tech.h. The model also takes into account single ended read type cells, used in register files. A schematic of the wordline is shown below.

Figure 10: Word line.

4 Bitline

The bitline model takes into account the precharge transistors, line capacitance and isolation transistors. Several minute features have been added and detailed comments in the code explain these. The basic schematic of the bitline is shown below.

Figure 11: Bitline

5 Sense Amplifier

The sense amplifier is shared by many bitlines using a column MUX. However, one should not multiplex more than eight bitlines together due to leakage issues. The MUX is a standard pass-gate based MUX with a column decoder. The basic architecture and the sense amplifier circuit used are shown below.

colmux

Figure 12: Sense Amplifier architecture

Figure 13: Sense Amplifier circuit.

6 Output driver

The output driver uses an array of tri-state drivers like the one shown in the schematic below.

Figure 14: Output driver.

7 Generic mux

This is a standard pass-gate based MUX. The only specifications required are the number of inputs to be multiplexed into one bit and the number of output bits. The generic MUX, as the name suggests, can be used to model a general MUX.

8 Comparator

The comparator design is shown in Fig. 15.

Figure 15:n-bit comparator

4 Latch Model

At the end of the fetch stage is the pipeline latches associated with the fetch stage. These pipeline latches are basically modeling the delay incurred between moving instruction from the fetch stage to the decode stage. These delays could be due to the delay in BTB lookup or in getting the branch prediction. The latches could be of variable size and the number of latches would also vary depending upon the delay to be modeled. The variable length of the latches is due to the fact that some information may be added on a later latch in the pipeline. At any time these latches could be in one of the three states: Active implying that a new instruction was moved into this latch that cycle, Stalled meaning that the latch is holding on to the instruction that it had in the previous cycle this cycle also, Empty meaning that the latch is not storing anything that cycle. The power associated with each of these states would be different and is read from the input file.

[pic] Figure 16: Simple Architecture along with the Pipeline latches

This breakdown of energy-consuming activity allows for a form of clock gating where active instructions may consume more energy than stalled instructions, and where valid instructions may consume more energy than invalid ones (i.e. empty pipeline slots). For example, consider the logic shown in Figure 17. Here, a typical pipeline latch is shown, as might appear in the decode pipeline. An input multiplexor (typically built into the latch) is used to "recirculate" latched pipeline values when the hold signal is active. In addition, the valid bit from the preceding stage is used to gate the latch itself; if there is no valid data being fed into the latch, then the latch is not clocked.

Figure 17: A Pipeline Latch

A Valid Bit from the previous stage is used to gate the clock signal. A hold signal from the succeeding stage is used to switch the multiplexor and recirculate data being stalled.

In this system, a certain amount of energy is consumed if an instruction moves up the pipeline (the hold signal is inactive) and is latched into the next stage. A different (lower) amount is consumed if the hold signal is active, the multiplexor feeds the same data back into the latch and the latched is clocked, but the logic following the latch does not see any of its inputs change. Finally, a different (still lower) amount of energy is consumed if the valid signal is off, and the latch is not clocked at all. Similarly, in the issue queue, a

particular issue queue slot may consume different amounts of energy depending on whether or not it holds an active instruction and whether or not the instruction actually issues. The pipeline latches were taken from a high-end design environment. A 2-to-1 static mux was used to re-circulate the data when stalled. Each cycle the simulator maintains an account of latches in various states and the total power the latches would consume each cycle. This is one of the innovative ideas in this simulator

5 Special Model for Issue Window

As stated before, the simulator models both Collapsible and Non Collapsible instruction issue window with the same FUB: isw. There would be some power associated with collapsing the instruction window. The simulator has counter to record these movements per cycle (Iswcolmoved) and the user can supply the power associated with these movements. The issue window can also be viewed as a set of fixed length latches with the same three states as before. The Active state (Iswact) now corresponds to the number of instruction ready to be issued that cycle while the stalled state (Iswstall) would correspond to instruction that are still waiting for their operands to become ready. The empty state (Iswempty) would represent the in-occupancy of the issue window each cycle.

[pic]

Figure 18: Instruction Issue Window

For the issue queue, wakeup logic is modeled by counting the energy in the comparators. For the selection logic, energy of one arbiter cell was supplied. Then the number of arbiter cells per arbiter was calculated based on the number of entries in the issue queue. We assume one arbiter per issue port – in our case four issue ports. Every entry in the issue queue has some comparators (for tag match). The wakeup logic associated with this issue window involves tag comparison and has a level of XOR gates followed by NAND gates. Assuming that the NAND gates are smaller than the XOR, the simulator records the power consumed in these XOR gates each cycle. There are counters associated with each of the states of the issue window latches as well as with data movement between these latches for a collapsible window.

Options, Configuration, Output

This section describes the options, configuration files and output files used in the WArPE power estimation tool.

1 Options

The estimator options (in addition to the underlying simplescalar options) are defined below. These options have been registered in the original simplescalar option database. Implementing these options required modification of some of the original sim-outorder.c code.

–power_config : This option specifies the power simulator

configuration file. The file must

read permissions. The default file name is

power.txt.

–power_outfile : This option specifies the file into which output

statistics are dumped. The default file name is

power_output.txt.

–tech_file : This option specifies the technology definition

file name. The file must have read permissions.

The default file name is technology.def.

–technology : This option specifies the power simulation

technology. The technology is defined by an

identifier listed in the technology file.

Eg. –technology 0.25um. The default

technology is 0.8um.

–sim_limit : This option specifies the number of instructions (in

millions) at which the simulation stops and data is

dumped into the output file.

2 Configuration files

Following is a description of the various configuration files used in the WArPE estimator. Configuration files provide an easy and effective way of defining the large number of parameters used in the simulator.

1 Basic configuration file

This is the file defined by the –power_config option. It defines the power densities, areas, mode of operation i.e. pfa (empirical) or anal (analytical model), power thresholds, and physical partitioning parameters. This file can be generated by saving a Microsoft Excel( worksheet in tab delimited text format.

The file has three main option:

1) –global

These define the power and di/dt thresholds for the full chip. The unit is watts.

2)

unit: name of the FUB (Functional Unit Block) as defined in power_init().

mode: pfa: directs the simulator to use empirical data i.e. dyn_pda,…,pla_a.

anal: directs the simulator to use analytical model for the FUB.

maxpowerth: maximum power threshold for the FUB.

maxdidtth: maximum di/dt threshold for the FUB.

dyn_pda: dynamic circuit power density - active

dyn_pdi: dynamic circuit power density - inactive

dyn_a: dynamic circuit area

sta_pda: static power density – active

sta_pdi: static circuit power density – inactive

sta_a: static circuit area

clk_pda: clock circuit power density – active

clk_pdi: clock power density – inactive

clk_a: clock circuit area

mem_pda: memory type circuit power density – active

mem_pdi: memory type circuit power density – inactive

mem_a: memory type circuit area

pla_pda: PLA power density – active

pla_pdi: PLA power density – inactive

pla_a: PLA circuit area

The units of the power densities are μW/μm2, and the units of area are μm2.

3) -

Eg. –itlbcac 1 2 1 static dual

This option specifies the physical partition. In the example given above, it

defines the partition for itlb. The names specified with a “-“ followed by the FUB

name.

: The number of partitions of the wordline. Each partition has a

different decoder and wordline driver. The partitions however

share sense amplifiers.

: The number of partitions of the bitline. Each partition has separate

sense amplifiers and decoders.

: Similar to bitline partition but shares decoder.

: The type of logic used for decoders, static or dynamic.

: Defines the read mode i.e. dual for dual rail and single for single

ended (used in small register files).

2 Process Technology Data File

This file contains the processing technology data for several generations. It must at least contain the data for the technology defined by the –technology option. Some of the data provided in the technology file is not used presently. It will used in later revisions, e.g. for dual Vt technologies. The format for the technology data is as follows

Eg.

0.8um 0.80 5.00 100 0.75 0.75 1 1

: Technology identifier. It should match the identifier supplied using the

–technology option.

: The effective channel length in microns.

: The drain voltage used in the technology.

: The clock frequency in MHz.

: For use in dual voltage circuits. This is the lower threshold voltage.

: Higher threshold voltage.

: Leakage current for the lower threshold voltage in nA/μm.

: Leakage current for the higher threshold voltage in nA/μm.

3 Output file

This file contains the output power statistics generated after the simulated instructions reach sim_limit or the simulation ends. The file is well formatted and the data is self-explanatory. Sample configuration files and output file are shown below.

| |

|-global |

| |

|0.8um 0.80 5.00 100 0.75 0.75 0.01 0.01 |

|0.6um 0.60 3.30 200 0.65 0.65 0.01 0.01 |

|0.35um 0.35 2.50 300 0.55 0.55 0.1 0.1 |

|0.25um 0.25 1.50 450 0.45 0.45 0.1 0.1 |

|0.18um 0.18 1.05 700 0.35 0.35 1 0.1 |

|0.15um 0.15 1.00 1000 0.30 0.35 1 0.1 |

|0.13um 0.13 1.00 1500 0.28 0.35 1 0.1 |

|0.1um 0.10 0.75 2250 0.25 0.35 1 0.1 |

|0.07um 0.70 0.60 3300 0.25 0.35 10 0.1 |

Figure 20 Technology File.

Sun May 19 17:07:59 2002

Power simulation checkpoint at 200000051 instructions

functional cumulative maximum maximum maximum power maximum didt

block name power power didt power violations violations

npclog 4.354e+06 8.262e+06 7.813e+06 0 0

btblog 6.775e+05 8.097e+06 7.835e+06 0 0

btbcac 1.59e+06 2.135e+07 2.092e+07 0 0

itlbcac 2.293e+05 4.446e+05 4.335e+05 0 0

rsbcac 3.414e+05 1.546e+06 1.245e+06 0 0

dtlbcac 4.024e+06 3.801e+07 3.716e+07 0 0

pmhlog 4.667e+05 3.132e+06 3.132e+06 0 0

il1log 3.548e+07 6.648e+07 6.3e+07 0 0

il1tag 1.071e+08 2.033e+08 1.962e+08 0 0

il1cac 1.062e+07 2.029e+07 1.979e+07 0 0

dl1log 1.338e+07 1.819e+08 1.628e+08 0 0

dl1tag 4.12e+07 5.679e+08 5.091e+08 0 0

dl1cac 1.485e+07 2.117e+08 1.905e+08 876705 0

dispatchq 0 0 0 0 0

decodepla 0 0 0 0 0

decodemisp 0 0 0 0 0

decodestall 0 0 0 0 0

ratarr 8.569e+07 2.715e+08 2.384e+08 0 0

ruuarr 2.734e+07 1.864e+08 1.133e+08 0 0

lsqarr 4.258e+06 2.924e+07 2.741e+07 0 0

ruurdyq 1.041e+06 7.845e+06 6.668e+06 0 0

lsqrdyq 7.525e+06 2.3e+07 1.464e+07 0 0

ruuarb 3.15e+07 2.795e+08 1.242e+08 0 0

ruuwb 7.137e+07 1.775e+08 1.745e+08 0 0

lsqarb 3.267e+07 2.795e+08 1.242e+08 0 0

lsqwb 2.487e+07 1.627e+08 1.597e+08 0 0

fuint 3.489e+06 8.958e+06 8.605e+06 0 0

fufp 4.671e+05 5.928e+06 5.461e+06 0 0

ul2log 1.833e+06 5.953e+07 5.85e+07 0 0

ul2tag 1.653e+07 5.574e+08 5.485e+08 0 0

ul2cac 1.352e+07 8.154e+08 8.102e+08 0 0

biu 8.242e+06 2.582e+08 2.512e+08 0 0

isw 1.625e+06 0 1.311e+06 0 0

fdlatch_0 6.458e+04 9.83e+04 7.782e+04 0 0

fdlatch_1 6.442e+04 9.83e+04 7.782e+04 0 0

fdlatch_2 6.387e+04 9.83e+04 7.782e+04 0 0

fdlatch_3 6.329e+04 9.83e+04 7.782e+04 0 0

dilatch_0 6.24e+04 9.83e+04 7.782e+04 0 0

dilatch_1 6.167e+04 9.83e+04 7.782e+04 0 0

dilatch_2 6.133e+04 9.83e+04 7.782e+04 0 0

dilatch_3 5.725e+04 9.83e+04 7.782e+04 0 0

Global statistics:

Total power = 566797441.827776

Maximum power = 3490027519.397630

Maximum didt power = 3198001037.129858

Power violations = 19489894

Didt power violations = 1204832

Figure 21: Output File.

File Structure

The simulator is essentially based on Simplescalar [1]. Care has been taken to keep the power simulation functions in separate files thus minimizing the modification of the original code. However, at some places it was inevitable or rather much more convenient to modify the original Simplescalar files. The file structure is as follows.

power.c: The main power number generation file. It contains routines for power         calculation. Any new power calculation routines, eg. Clock gated

power calculation should be included in this file.

power.h: This file contains all the declarations for variables, structures and

functions and definitions used in power.c.

anal.c: Contains all the analytical models. Any new models developed should

be placed in this file.

anal.h: Contains declarations and definitions for variables and functions used

in anal.c.

tech.c: Technology processing file. Reads from the technology file and

calculates scaling factors for the required technology .The base

technology used is 0.8 um and all simulations are performed by scaling

the 0.8um technology.

tech.h: Contains all the device size definitions for 0.8 um base technology.

sim-outorder.c and main.c have also been modified as described later.

1 power.h

As mentioned earlier, power.c contains routines for power computation and power.h is the supporting header file. The simulator is designed using a FUB-centric approach. All the power numbers specific to an FUB is stored together in one structure. The structure is shown below. Not all the elements are used. Some of them are present for future expansion.

typedef struct {

char name[32];

double active_power;

double active_power_rd;

double active_power_wr;

double static_power;

double inactive_power;

double active_power_lt;

double stall_power_lt;

double empty_power_lt;

double active_power_cg;

double active_power_wr_cg;

double active_power_rd_cg;

double inactive_power_cg;

double maxpowerth

double maxdidtth;

double cum_power;

double prev_power;

double max_power;

double max_didt;

double max_powerx;

double max_didtx;

} fub_t;

The element name stores the name of the FUB, which can be at most 32 characters in length. The next four elements store power numbers, which are obvious from their names. It should be noted that active power comes in three flavors. When using the empirical method, only active_power is used. It is the sum of the (power density)*(area) products for the five different circuit styles. When analytical models are used, the read and write operations can be separated and these give different power consumptions thus the rd and wr suffixes. The element inactive_power is presently redundant but can be used in the empirical mode for standby mode. The next three numbers are power values for latches only. The next four elements are the clock gated power numbers which are presently not being used. Notice that clock gating does not affect static power and hence static_power_cg is not present. The elements maxpowerth and maxdidtth are the maximum power and maximum di/dt power thresholds for the FUB. These values are defined in the configuration file. cum_power keeps accumulating the power after every cycle and is finally divided by the number of cycles to get the average power dissipated. prev_power, max_power and max_didt are the previous cycle power, maximum power and maximum di/dt power respectively. Finally, max_powerx and max_didtx keep track of the number of threshold violations.

A similar structure of type glb_power_t is used to track the full chip power numbers. Its elements are essentially the sum of the corresponding elements of the FUB structures. Another important structure defined is the power_t, which is used to exchange power numbers. Its got three elements, active_power_rd, active_power_wr and static_power which are self-explanatory.

The activity counts are tracked using two arrays of counters, one for present cycle counts and the other for cumulative counts. Specific counters can be accessed by using the counter name as the index, Eg. pres_count[Ruuarr]. Ninety three counters have presently been declared. New counters can be added simply by adding their names to the #define list and updating NUM_POWER_COUNTERS. As a convention, only the first character of the counter name is in caps.

As more and more features are added to the simulator, new elements can be added to these structures and new counters can be defined for more detail/functionality. This makes the simulator amenable to future development.

Finally, there is a structure, which is used to maintain the power parameter database. The structure type is called power_db. It stores the following data

name: Name of a FUB/variable/file.

S: The number of sets in a cache like structure.

OR

The value of a variable, for example: decode width.

A: Associativity.

B: The block size in number of bits.

b: The output size in bits.

nwl, nbl, nsp, logic, rd_mode as defined in section 4.2.1.

The power_db structure is also used to store the various filenames. The convention used is that the first element of the database has name “root”. The next element’s name is the configuration filename. The third element’s name is the output filename. The fourth is the technology filename and the fifth is the technology identifier. This was found to be a way to avoid the addition of an extra field to the database. All other elements are then added in any order. This concludes the discussion of the important structures used. All other structures are self-explanatory.

2 power.c

power.c contains power estimation routines and option handling routines. These routines are described below

add_param(), get_param()

These functions are used to add and retrieve parameters from the power simulation database. The former adds a structure of type power_db to the database while the latter retrieves the same from the database.

search_opt(), print_opt()

search_opt() is used to retrieve the physical structure parameters (nwl, nbl, nsp, logic style, read mode) on giving the option name. print_opt() prints all the elements of the power parameter database in a tabular form. It is helpful in debugging.

dump_fub_stats()

This function dumps all the power statistics on the screen or into the specified file. The file dump mode can be specified by mode = 0 and the screen dump by mode ( 0.

power_init()

This function allocates memory for all the FUB structures and calls init()on each FUB. It also reads the thresholds specified the –global option and initializes the global power structure.

init()

This function reads the power densities and areas of the FUBs from the basic configuration file in case of the pfa mode. If the mode is anal, then it just calls calc_anal(). The functions initializes all the power variables inside the structure. Finally, it adds the FUB to the FUB database.

calc_anal(), array_power()

These functions calculate the power numbers when in anal mode. calc_anal() calls array_power(), which in turn calls routines from anal.c to generate the power constants.

power_update()

All the functions mentioned before are called only at the beginning of the simulation. This routine, however, is called every cycle to update the power variables. power_update() multiplies the access counts to active power constants if the count is non-zero or else uses the inactive power constants. Presently, no clock-gating feature is incorporated, but the infrastructure has already been laid. The function also checks for power threshold and di/dt threshold violations. At the end of the function the present cycle power counters are reset whereas the cumulative counts keep on going.

3 anal.h

This is the header file for anal.c. It contains all the function declarations for the functions present in anal.c.

4 anal.c

This file contains all the analytical models. The analytical models are described in more detail in section 4. In this section we describe the interfaces of all the functions in anal.c.

decoder_buffer_power()

This function takes the number of address bits and number of rows as inputs and generates power constants for the decoder buffer. The decoder buffer is meant to feed into all decoders needed for an array. Presently, the size of the buffer is constant, however, in the future this can be made dependent on number of decoders that it feeds into.

decoder_power()

This function generates the power numbers for the decoder. It takes the number of rows and logic style as inputs.

routing_power()

This function estimates the power dissipated due the routing in the decoder. It takes rows, columns and cell type as inputs. It needs number of columns as an input because the decoder buffer is assumed to be at the center of all the partition as was made clear in section 4.

wordline_power()

This function calculates the power for the wordline, including the wordline driver. The wordline driver size depends upon the number of columns, which is an input and also the particular kind of memory cell used(i.e. read mode and cell size), which is input. The size is then calculated using the WLdriver_size() function [].

bitline_power()

This function calculates the power for the bitlines, including the precharge and isolation transistors. It takes the number of rows, columns, cell type and read mode as inputs. In the single ended read mode, no pre-charging is used. Instead, the bitlines are driven by the cell transistors. Hence, this scheme can be used for relatively small structures like register files.

senseamp_power()

This is used for calculating the sense amplifier power constants. It is assumed that the nodes of the sense amp are charged by a separate pre-charge circuit. The inputs to this function are the number of sense amps and the number of bitlines sharing one senseamp.

outmux_power()

This function calculates the power for the output MUX. The inputs to the function are the numbers of inputs to the MUX and the number of outputs.

compare_power()

This function calculates the power for the comparator. This model is useful for tag arrays and register update unit type FUBs.

genmux_power()

This calculates constants for a generic MUX. The inputs to the function are number of output bits and number of bits being multiplexed into one bit.

driver_size()

This function calculates the driver size for driving a capacitance with a desired rise time. The capacitance and rise time are inputs. The voltage swing is assumed to be from 0-Vdd.

bldriver_size()

This is similar to driver_size() except for the fact that the voltage swing is Vsense-Vprecharge. This function is mainly used to calculate pre-charge transistor sizes for bit lines in low – power cache implementations.

gatecap(), gatecappass()

These functions are used to calculate the gate capacitance for a given transistor width and poly length. The latter is used specifically for pass transistors.

draincapp(), draincapn()

These are used to calculate the drain capacitance for p and n-type transistors respectively. The also take the number of transistors stacked as input to optimize the configuration [].

leakage()

This function calculates the leakage power or static power for a given transistor size with a given threshold. Presently, it’s a very rough calculation and much more work can be done in the future.

log2()

This function returns logarithm to the base two, rounded off to the next lowest integer. It is mainly used for address bit calculations for a given number of rows.

5 sim-outorder.c, main.c

These files have been slightly modified for the power simulator. Following is a list of changes made.

1. In main.c, a power option database called pow_odb has been added. This is used in sim_print_stats() to dump the power statistics. Another change made is the power_init() function call added after sim_init() to initialize the power simulation.

2. In sim-outorder.c, several global variables have been added. These have been well commented. In sim_reg_options(), the five new options have been registered. The power_update() function call has been added in sim_main(). And finally, power_database() has been added. This function essentially processes options and adds them to the power database for use in the analytical models.

2 Control Flow

The following flowchart depicts the control flow for the power simulation.

This completes the control flow description of the main functions in the power simulator.

References

[1] D. Burger and T. Austin. The simplescalar tool set, version 2.0, Technical report,

Computer Sciences Department, University of Wisconsin, June 1997.

[2] S.J.E. Wilton and N.P. Jouppi An Enhanced Access and Cycle Time Model for On-Chip Caches, Western research Laboratory Report, May 1993.

[3] D. Brooks, V. Tiwari, M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations, in Proc. International Symposium on Computer Architecture, Jun. 2000.

[4] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye Energy-driven integrated hardware-software optimizations using SimplePower, in Proc. International Symposium on Computer Architecture, Jun. 2000.

[5] D. Liu and C. Svensson. Power Consumption Estimation in CMOS VLSI Chips. IEEE Journal of Solid-State Circuits, 29(6), pp. 663-670. Jun. 1994

[6] P. Landman and J. Rabaey. Activity-Sensitive Architectural Power Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 15(6), page 571, Jun. 1996.

[7] R. Chen, M. Irwin, and R. Bajwa. An architectural level power estimator. In Power-Driven Microarchitecture Workshop at ISCA25, 1998

Appendix

| | | | |

|Sl. No. |Name of the FUB |Description |Models supported |

|1 |npclog |Next pc generation logic |PFA |

|2 |btblog |BTB logic |PFA |

|3 |btbcac |BTB cache |PFA/Anal |

|4 |itlbcac |Instruction TLB |PFA/Anal |

|5 |rsbcac |Return Stack Buffer |PFA/Anal |

|6 |dtlbcac |Data TLB |PFA/Anal |

|7 |pmhlog |Page miss handler |PFA |

|8 |il1log |L1 instruction cache logic |PFA |

|9 |il1tag |L1 instruction cache tag |PFA/Anal |

|10 |il1cac |L1 instruction cache array |PFA/Anal |

|11 |dl1log |L1 data cache logic |PFA |

|12 |dl1tag |L1 data cache tag |PFA/Anal |

|13 |dl1cac |L1 data cache array |PFA/Anal |

|14 |dispatchq |Dispatch Queue |PFA |

|15 |decodepla |Instruction decoder |PFA |

|16 |decodemisp |Misprediction handling logic |PFA |

|17 |decodestall |Decoder Stall logic |PFA |

|18 |ratarr |Register Aliasing table |PFA/Anal |

|19 |ruuarr |Register update unit / |PFA |

| | |reorder buffer | |

|20 |lsqarr |Load/Store queue |PFA |

|21 |ruurdyq |Re order ready queue |PFA |

|22 |lsqrdyq |Load/Store ready queue |PFA |

|23 |ruuarb |Re order arbitration logic |PFA |

|24 |ruuwb |Re order write back scheduler |PFA |

|25 |lsqarb |Load/store arbitration logic |PFA |

|26 |lsqwb |Load/store write back scheduler |PFA |

|27 |fuint |Integer functional unit |PFA |

|28 |fufp |Floating point functional unit |PFA |

|29 |ul2log |Unified L2 cache logic |PFA |

|30 |ul2tag |Unified L2 cache tag |PFA/Anal |

|31 |ul2cac |Unified L2 cache array |PFA/Anal |

|32 |biu |Bus/IO unit |PFA |

|33 |fdlatch |Fetch Decode latch |PFA |

|34 |dilatch |Decode Issue Latch |PFA |

|35 |isw |Instruction Issue Window |PFA |

Table of FUBs: Shows the various functional unit blocks with the models existing in the simulator. PFA: Power Factor Approximation

Anal: Analytical models exist

|Sl No. |Name of the counter |Associated FUB |Description |

|0 |Brupdate |BTB cache |branch update activity |

|1 |Brlookup |BTB cache |branch lookup activity |

|2 |Rsbpop |Return Stack Buffer |return stack pop activity |

|3 |Rsbpush |Return Stack Buffer |return stack push activity |

|4 |Il1acc |L1 Instruction cac |il1 access activity |

|5 |Il1wbk |L1 Instruction cac |il1 writebacks activity |

|6 |Il1rep |L1 Instruction cac |il1 replacements activity |

|7 |Il1inv |L1 Instruction cac |il1 invalidations activity |

|8 |Dl1acc |L1 Data cac |dl1 access activity |

|9 |Dl1wbk |L1 Data cac |dl1 writebacks activity |

|10 |Dl1rep |L1 Data cac |dl1 replacements activity |

|11 |Dl1inv |L1 Data cac |dl1 invalidations activity |

|12 |Il2acc |L2 Instruction cac |il2 access activity |

|13 |Il2wbk |L2 Instruction cac |il2 writebacks activity |

|14 |Il2rep |L2 Instruction cac |il2 replacements activity |

|15 |Il2inv |L2 Instruction cac |il2 invalidations activity |

|16 |Dl2acc |L2 Data cac |dl2 access activity |

|17 |Dl2wbk |L2 Data cac |dl2 writebacks activity |

|18 |Dl2rep |L2 Data cac |dl2 replacements activity |

|19 |Dl2inv |L2 Data cac |dl2 invalidations activity |

|20 |Ul2acc |L2 United cache |ul2 access activity |

|21 |Ul2wbk |L2 United cache |ul2 writebacks activity |

|22 |Ul2rep |L2 United cache |ul2 replacements activity |

|23 |Ul2inv |L2 United cache |ul2 invalidations activity |

|24 |Itlbmis |Instruction TLB |itlb miss activity |

|25 |Dtlbmis |Data TLB |dtlb miss activity |

|26 |Ul2mis |L2 United cache |ul2 miss activity |

|27 |Itlbacc |Instruction TLB |itlb access activity |

|28 |Itlbwbk |Instruction TLB |itlb writebacks activity |

|29 |Itlbrep |Instruction TLB |itlb replacements activity |

|30 |Itlbinv |Instruction TLB |itlb invalidations activity |

|31 |Dtlbacc |Data TLB |dtlb access activity |

|32 |Dtlbwbk |Data TLB |dtlb writebacks activity |

|33 |Dtlbrep |Data TLB |dtlb replacements activity |

|34 |Dtlbinv |Data TLB |dtlb invalidations activity |

|35 |Npc |Next pc generation logic |next pc logic activity |

|36 |Dispatchqrd |Dispatch Queue |dispatchq read activity |

|37 |Dispatchqwr |Dispatch Queue |dispatchq write activity |

|38 |Dispatchqrel |Dispatch Queue |dispatchq release activity |

|39 |Dispatchqrec |Dispatch Queue |dispatchq recover activity |

|40 |Decoder |Instruction decoder |decoder activity |

|41 |Decodemispchk |Instruction decoder |decoder mispredict detect activity |

|42 |Decodemisp |Instruction decoder |decoder mispredict correction activity |

|43 |Decodestallchk |Instruction decoder |decoder stall detect activity |

|44 |Decodestall |Instruction decoder |decoder stall block activity |

|45 |Ratidep |Register Aliasing table |rat idep allocation activity |

|46 |Ratodep |Register Aliasing table |rat odep allocation activity |

|47 |Ratstallchk |Register Aliasing table |rat stall detection activity |

|48 |Ratstall |Register Aliasing table |rat stall block activity |

|49 |Ruuarr |Reorder buffer |ruu array activity |

|50 |Ruurdyqsch |Reorder buffer |ruu readyq allocation activity |

|51 |Ruurec |Reorder buffer |ruu recover activity |

|52 |Ruuret |Reorder buffer |ruu retire activity |

|53 |Ruurdyqcam |Reorder buffer |ruu readyq dependence check activity |

|54 |Ruurdyqrel |Reorder buffer |ruu readyq resource release activity |

|55 |Lsqarr |Load/Store queue |lsq array activity |

|56 |Lsqrdyqsch |Load/Store queue |lsq readyq allocation activity |

|57 |Lsqrec |Load/Store queue |lsq recover activity |

|58 |Lsqret |Load/Store queue |lsq retire activity |

|59 |Lsqrdyqcam |Load/Store queue |lsq readyq dependence check activity |

|60 |Lsqrdyqrel |Load/Store queue |lsq readyq resource release activity |

|61 |Ruuarb |Reorder buffer |ruu arbitration activity |

|62 |Ruuwb |Reorder buffer |ruu writeback scheduler activity |

|63 |Ruuwbq |Reorder buffer |ruu writebackq activity |

|64 |Lsqarb |Load/Store queue |lsq arbitration activity |

|65 |Lsqwb |Load/Store queue |lsq writeback scheduler activity |

|66 |Lsqwbq |Load/Store queue |lsq writebackq activity |

|67 |Fuint |Integer point functional unit |functional unit integer |

|68 |Fufp |Floating point functional unit |functional unit floating point |

|69 |Fdlatch_active |Fetch Decode latch |Latch after fetch stage active |

|70 |Fdlatch_stall |Fetch Decode latch |Latch after fetch stage stalled |

|71 |Fdlatch_empty |Fetch Decode latch |Latch after fetch stage empty |

|72 |Dilatch_active |Decode Issue Latch |Latch after decode stage active |

|73 |Dilatch_stall |Decode Issue Latch |Latch after decode stage stall |

|74 |Dilatch_empty |Decode Issue Latch |Latch after decode stage empty |

|75 |Iswact |Instruction Issue Window |Issue window latch active |

|76 |Iswstall |Instruction Issue Window |Issue window latch stalled |

|77 |Iswempty |Instruction Issue Window |Issue window latch empty |

|78 |Iswcolmoved |Instruction Issue Window |Collapsible Issue window latch moved |

Table of Counters: Note that the number of counters would vary with the number of latches. If there are three latches after the fetch stage, there would be 9 Fdlatch (69-77) counters and same for the latches after the decode stage.

Index

A

active power 12, 15, 18

activity 12, 16

add_param() 17

anal 6

anal.c 1, 14, 18, 19, 25

anal.h 1, 14, 19

analytical 3, 6, 13, 14, 15, 19, 22, 25

array_power() 18

B

bitline_power() 20

bldriver_size() 21

C

calc_anal() 18

clk_a 6, 7

clk_pda 6, 7

clk_pdi 6, 7

Clock circuits 13

clock frequency 8

compare_power() 20

configuration file 5,6

control flow 1, 22

cum_power 15, 16

D

decoder_buffer_power() 19

decoder_power() 19

di/dt 6, 12, 16, 19

draincapn() 21, 25

draincapp() 21, 25

driver_size() 21

dump_fub_stats() 18

dyn_a 6, 7

dyn_pda 6, 7

dyn_pdi 6, 7

Dynamic logic 13

E

empirical 3, 6, 15

estimation 3, 12, 13, 17

F

FUB 6, 7, 12, 13, 15, 16, 17, 18

fub_t 15

G

gatecap() 21, 25

gatecappass() 21, 25

genmux_power() 21

get_param() 17

glb_power_t 16

global 6, 9, 18, 22

I

inactive power 12, 18

init() 18, 22

Ioh 8, 10

Iol 8, 10

L

leakage() 21

Leff 8, 25

log2() 21

logic_style 7

M

main.c 1, 14, 22

max_didt 15, 16

max_didtx 15, 16

max_power 15, 16

max_powerx 15, 16

maxdidtth 6, 15, 16

maxpowerth 6, 15, 16

mem_a 6, 7

mem_pda 6, 7

mem_pdi 6, 7

Memory type regular circuits 13

methodology 1, 4, 12

mode 6, 7, 8, 13, 15, 17, 18, 20

N

nbl 7, 17, 18

nsp 7, 17, 18

NUM_POWER_COUNTERS 16

nwl 7, 17, 18

O

option database 5

Options 5

outmux_power() 20

output 5

P

pfa 6, 9, 18

physical structure 13, 18

PLA circuits 13

pla_a 6, 7

pla_pda 6, 7

pla_pdi 6, 7

pow_odb 22

power threshold 6, 19

power.c 1, 14, 15, 17

power.h 1, 14, 15

power.txt 5

power_config 5, 6

power_db 17

power_init() 6, 18, 22

power_outfile 5

power_output.txt 5

power_update() 18, 22

pres_count 16

prev_power 15, 16

print_opt() 18

Process Technology 1, 8

R

routing_power() 19

S

search_opt() 18

senseamp_power() 20

sim_limit 6

sim-outorder.c 1, 5, 14, 22

sta_a 6, 7

sta_pda 6, 7

sta_pdi 6, 7

Static logic 13

static_power 15, 16

T

tech 5, 8, 10, 14, 29

tech_file 5

technology 5

technology.def 5

U

unit 6, 7, 20, 25

V

Vdd 8, 10, 21

Vth 8, 10

Vtl 8, 10

W

wordline_power() 20

-----------------------

power.c:dump_fub_stats()

power.c:power_update()

every cycle

Sim-outorder.c

main.c

anal.c:decoder_buffer_power()

:decoder_power()

:routing_power()

:wordline_power()

:bitline_power()

:senseamp_power()

:outmux_power()

:comparator_power()

power.c:array_power()

power.c:calc_anal()

power.c:init()

power.c:power_init()

sim-outorder:power_database()

creates the power database using options read from the configuration file and the options database.

sim-outorder:sim_reg_options()

registers the power options into the options database.

main.c

Single Buffer

Decoder Buffer

GND

VDD

ADDR BITS * 2(BIT and NBIT)

.

.

.

Stage 2

Stage 1

N decoders

3x8,2x4

using

NAND gates

N input NOR gate

Structure of decoder

Eg. 2x4 decoder

.

.

out

Outputs from

four

2-input

NAND gates

BIT

and NBIT

Second stage NOR gate

Eg. 4-input NOR

precharge

….

out

Wordline

driver

columns

Isolation

Pass gate

Columns

rows

Precharge

equalizer

Precharge

OUT

BITN

BIT

MUX

MUX

Sense Amplifier

Gnd

Gnd

Vdd

Vdd

Vdd

Vdd

Vdd

Vdd

BITN

BIT

Gnd

sel

GND

VDD

out

Sense amp out

# of bits to compare

out

Vdd

precharge

b0

nb0

na0

a0

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download