[an error occurred while processing this directive]

HP OpenVMS Systems

HP Digital Continuous Profiling Infrastructure
» 

HP OpenVMS Systems

OpenVMS information

» What's new on our site
» Upcoming events
» Configuration and buying assistance
» Send us your comments

HP OpenVMS systems

» OpenVMS software
» Supported Servers
» OpenVMS virtualization
» OpenVMS solutions and partners
» OpenVMS success stories
» OpenVMS service and support
» OpenVMS resources and information
» OpenVMS documentation
» Education and training

DCPI

» Home
» What's New
» Install Software
» Documentation
» Publications

Evolving business value

» Business Systems Evolution
» AlphaServer systems transition planning
» Alpha RetainTrust program

Related links

» HP Integrity servers
» HP Alpha systems
» HP storage
» HP software
» HP products and services
» HP solutions
» HP support
disaster proof
HP Integrity server animation
HP Integrity server animation
Content starts here

dcpicalc(1)

NAME

dcpicalc - Calculates cycles-per-instruction of procedures

SYNOPSIS

dcpicalc [<options>] -procedures procedure-name-list -- image-file

dcpicalc [<options>] procedure-name image-file

DESCRIPTION

The dcpicalc command generates the control flow graph of the specified procedure(s) in the specified image file. Using profiles collected by dcpid(1) and stored in the specified profile files, dcpicalc augments the graph with estimated execution frequencies of basic blocks, cycles-per-instruction for instructions, possible explanations for stalls, and other useful information. The resulting flow graph is printed to standard output.

The output can be converted to postscript by dcpi2ps(1). In the postscript output, "larger" basic blocks are generally "more important." Specifically, for each basic block, the font size indicates the block's execution frequency, the physical space occupied by the block on paper indicates the amount of time spent in that block, and the number of lines indicates the average number of cycles required to execute it.

The first command syntax allows you to specify multiple procedures. dcpicalc concatenates the outputs for the individual procedures, starting each with a line of the form

; PROC procedure-name

Input parameter procedure-name can be the ascii procedure name, an address within a procedure (in C syntax, for example, 0x20000 is hex address 20000), or an explicit address range (useful when analyzing images without debug symbol information; for example, 0x20000:0x200c0).

Analyzing multiple procedures at a time is typically much more efficient than invoking the command once per procedure, although dcpicalc reports exactly the same information in both cases. The -procedures option can be mixed with the other options. The list of procedures is terminated by "--" or another option. The second command syntax can name only one procedure.

Note:  This command can only be used on aggregate (versus ProfileMe) data.

FLAGS

-help
Print information about options.
-print_opcode
Output the machine code, in hex, for each instruction.
-cutoff n
Omit basic blocks taking less than n% of the time spent in the procedure. The instructions of these basic blocks are not printed. When the output is piped through dcpi2ps, these basic blocks appear as tiny boxes with only block names. Note that n is a floating point number between 0 and 100 (inclusive). The default value is 0: no blocks are omitted.
-procedures procedure-name-list
Analyze the specified procedures. The list is terminated by "--" or another option.
-version
Print program version information.

FREQUENCY AND STALL ANALYSIS FLAGS

The following options can be used to control the heuristics for estimating execution frequencies and identifying the causes of stalls:

-conf_low
Generate low, medium, and high confidence data.
-conf_med
Generate medium and high confidence data. (default)
-conf_high
Generate only high confidence data.
-cross_procedure [optimistic | pessimistic | selective]
Choose what assumption to make when a procedure call boundary is encountered while looking for reasons to explain dynamic stalls. A procedure call boundary is either a call made by the procedure being analyzed or the beginning or end of that procedure. With pessimistic, assume that whatever happens outside the analyzed procedure can cause a dynamic stall inside it. With optimistic, assume that it cannot. With selective, the assumption is based on standard procedure call convention. (The default is optimistic.)
-do_gp
Use a (nonlinear time) constraint solver to exploit global flow constraints when estimating execution frequencies. The frequency estimates may still violate flow constraints.

PROFILE FILE FLAGS

By default, this command automatically finds all of the relevant profile files. The following options can be used to guide the search for the profile files:

-db <directory name>
Search for profile files in the specified profile database directory. The directory name should be the same name as the one specified when dcpid was started. That is, the named directory should contain a set of epochs. If this option is not specified, the directory name is obtained from the DCPIDB logical name. If neither of these methods succeeds in finding the appropriate directory, and no explicit set of profile files is provided via the -profiles option, then the command fails.
-epoch latest
Search for profile files in the latest epoch. This is the default.
-epoch latest-k
Search for profile files in the "k+1"th oldest epoch. For example, search in the third last epoch if "-epoch latest-2" is specified.
-epoch all
Search for profile files in all epochs.
-epoch <name>
Search for profile files in the named epoch. The epoch name should be the name of a subdirectory corresponding to a single epoch within the profile database directory. Epoch subdirectory names usually take the form YYYYMMDDHHMM (year-month-day-hours-minutes). For example, an epoch started on February 4, 2002 at 23:34 is named 200202042334. If an epoch is given a symbolic name by creating a symbol link to the actual epoch directory, then the symbolic name can also be used as an argument to the -epoch option.
-events all
Search for profile files corresponding to all event types such as cycles, icache misses, branch mispredictions, etc. This is the default.
-events type(+type)*
Search for profile files for the specified event types. For example, search for cycles, icache misses, and data cache misses when the option -events cycles+imiss+dmiss is specified.
-events all(-type)*
Search for profile files for all event types except for the specified types. For example, search for all event types except for branch mispredictions when the option -events all-branchmp is specified.
-label <label>
Search for profile files with the specified label (see dcpilabel). If no labels are specified on the command line, profile file labels are ignored entirely. If any labels are specified on the command line (this option can be repeated several times), only profile files that have one of the specified labels are used.
-profiles <file names...> --
Use just the profile files named by the specified file names. The list of profile file names can be terminated either via --, or by the end of the option list. The command prints an error message and fails if the -profiles option is used in conjunction with any of the earlier automatic profile finding options. (Use either the automatic profile lookup mechanism, or explicitly name the profile file with the -profile option, but not both.)

INTERPRETING OUTPUT

The dcpicalc command provides information at the instruction, basic block, and procedure level. The dcpicalc command is sometimes unable to estimate the cycle-to-sample ratio for a block. Such blocks are excluded from all summary information except the instruction count. The dcpicalc command makes no attempt to identify stalls (static or dynamic) in such blocks. Therefore, most of the following discussion pertains only to blocks with known cycle-to-sample ratios.

Instruction Level Information

At the instruction level, dcpicalc inserts "bubbles" into the instruction listings to identify points where the processor stalls because it is unable to issue an instruction. Bubbles are inserted before the stalled instruction. Here is an example:

 588584  318:2e4c0000 ldq_u    a2, 0(s3)          1558  1
 588588  318:a79d2d70 ldq      at, 11632(gp)    191855  0  1.5cy
   a
   a
 58858c  318:4a4c00d2 extbl    a2, s3, a2       164109  2  1.5cy   8584
   s
     d
     d
     d
     d
     d
     d
 588590  318:43920412 addq     at, a2, a2       428395  1  4.0cy   8588
   b
     ?
     ?
 588594  318:2c320000 ldq_u    t0, 0(a2)        227783  1  2.0cy   8590
   s
 588598  318:22520001 lda      a2, 1(a2)        121068  1  1.0cy
   b
     d
     d
     d
     d
 58859c  318:48320f41 extqh    t0, a2, t0       336123  1  3.0cy   8598 8594
   s
 5885a0  318:48271781 sra      t0, 0x38, t0     123408  1  1.0cy
   b
 5885a4  318:41810402 addq     s3, t0, t1       127442  1  1.0cy   85a0
   s
 5885a8  318:2c620000 ldq_u    t2, 0(t1)        123021  1  1.0cy
 5885ac  318:47ff041f bis      zero, zero, zero      0  0  nop
   a
   a
     d
     d
     d
     d
     d
     d
     d
     d
 5885b0  318:486200c4 extbl    t2, t1, t3       658189  2 6.0cy   85a8
 5885b4  318:47ff0403 bis      zero, zero, t2        0  0
 5885b8  318:48807630 zapnot   t3, 0x3, a0      122504  1 1.0cy
 5885bc  318:47ff041f bis      zero, zero, zero      0  0 nop
     i
 5885c0  318:421fd9b1 cmplt    a0, 0xfe, a1     155841  1 1.5cy
 5885c4  318:e6200002 beq      a1, 0x1205885d0       0  0

Each line of assembly code shows, from left to right,

  • the instruction's address (hexadecimal),
  • the source line number (decimal),
  • the instruction's 32-bit machine code in hexadecimal (if -print_opcode)
  • the instruction in mnemonics
  • the number of PC samples falling at this instruction address (decimal)
  • the minimum number cycles the instruction is predicted to spend at the head of the issue queue (actual schedule may vary)
  • (optionally) the average number of cycles spent at this instruction address
  • (optionally) the other instructions that may have caused this instruction to stall (see details below).

Each line in the listing represents a half-cycle, which makes it easy to see whether instructions are being dual-issued. To avoid excessively long listings, however, dcpicalc represents a very long stall with a large but limited number of bubbles. The actual number of stall cycles is shown as a number along with the bubbles.

Stall cycles are either static or dynamic. Static stall cycles are those that the processor would suffer even if there were no dynamic stalls (for example, if all memory loads hit in the D-cache and all conditional branches are predicted correctly). The rest are dynamic. The bubbles for the static and dynamic stall cycles are shown in different columns.

In the static column (the leftmost column), bubbles have the following meanings:

  • s refers to stall cycles resulting from static resource conflicts among the instructions within the same "window" (consisting of two instructions for Alpha 21064 and four for 21164) that the processor considers for issue in any given cycle.
  • a/b/c refer to stall cycles caused by register dependencies on previous instructions involving, respectively, Ra/Rb/Rc of the stalled instruction.
  • f refers to stall cycles caused by competition for the function units and other internal resources in the processor.

In the dynamic column(s), there may be multiple possible explanations for the same stall cycles; sometimes there may be none. Each explanation is represented by a column of bubbles. In some cases, dcpicalc can compute the maximum number of stall cycles that a particular reason can account for. If this is less than the number of stall cycles, the column for that reason may not extend all the way down to the stalled instruction.

The bubbles have the meanings below:

  • d - D-cache miss
  • D - DTB miss
  • I - I-cache or ITB miss
  • i - I-cache miss (but not ITB miss)
  • w - write buffer overflow
  • y - synchronization of memory operations (using memory barriers)
  • p - branch misprediction
  • f - busy function unit
  • o - other (currently TRAPB, EXCB, or load-after-store replay trap)
  • ? - unexplained

Several points are worthy mentioning here. First, notice that there is no symbol for ITB miss alone because an I-cache miss is possible whenever an ITB miss is possible. Second, "other" means miscellaneous other reasons that typically account for only a tiny percentage of stalls. Currently it includes stalls at TRAPB or EXCB instructions, which are not issued until all previous instructions are guaranteed to complete without traps or both traps and exceptions, respectively. Third, the symbol "f" may appear in both the static and dynamic columns because competition for function units may explain both static and dynamic stalls. For example, the stall caused by a floating-point division may be partly static, because part of it can be predicted by scheduling the instructions, and partly dynamic, because part of it is data dependent. An "f" in the dynamic column typically means a busy integer multiply or floating-point divide unit.

For each stalled instruction, dcpicalc also lists instructions that may have caused the stalls. This list appears at the end of the line showing the stalled instruction. A four-digit hexadecimal address indicates an instruction in the same basic block as the stalled instruction; a full block name with a four-digit hexadecimal address indicates an instruction in another basic block; a full block name without an address indicates that the instruction potentially causing the stall is assumed to be in another procedure, which can be a callee or the caller of the current procedure. Note that the lists of instructions and explanations are not always exhaustive, in part because longer stalls may hide shorter ones.

If an instruction is a nop, dcpicalc will indicate it by appending "nop" to the line showing the instruction.

Block Level Information

At the beginning of a block, dcpicalc displays summary information for the block. For example:

 *** One cycle = 714428 samples
 *** Executed 4.83 times/invocation
 *** Best-case 8/9 =  0.89CPI, Actual 22/9 =   2.44CPI
 *** (36% execution without dynamic stalls)

The first line is the cycle-to-sample ratio for block -- this is dcpicalc's estimate of how many PC samples in the profiling data correspond to one cycle. The next line is the average number of times the block is executed relative to the number of times the entry and/or exit blocks are executed. The third line displays the best-case and actual cycles per instruction (CPI) for the block. The best-case scenario includes all stalls statically predictable from the instruction stream (for example, an Alpha 21164 cannot dual-issue consecutive store instructions) but assumes that there are no dynamic stalls (for example, all load instructions hit in the D-cache). The last line above displays the best-case cycles per instruction as a percentage of the actual.

Procedure Level Information

At the procedure level, dcpicalc displays summary information in the entry block. This information includes the number of instructions in the procedure, averages of the best-case and actual cycles per instruction (computed from the per-block values weighted by block execution frequencies), and a sorted list of blocks accounting for 90% of the stalls in the procedure.

Moreover, dcpicalc summarizes how the cycles are spent. Here is a sample summary followed by line-by-line explanations:

  Line  1    I-cache (not ITB)   3.5% to  7.4%
  Line  2     ITB/I-cache miss   3.7% to  3.7%
  Line  3         D-cache miss  25.2% to 27.2%
  Line  4             DTB miss   0.0% to  1.7%
  Line  5         Write buffer   0.0% to  0.0%
  Line  6      Synchronization   0.0% to  0.0%

  Line  7    Branch mispredict   0.7% to  2.6%
  Line  8            IMUL busy   0.0% to  0.0%
  Line  9            FDIV busy   0.0% to  0.0%
  Line 10                Other   0.0% to  0.0%

  Line 11    Unexplained stall   1.9% to  1.9%
  Line 12     Unexplained gain  -0.8% to -0.8%
            ----------------------------------------
  Line 13     Subtotal dynamic                 38.4%

  Line 14             Slotting       6.4%
  Line 15        Ra dependency      10.0%
  Line 16        Rb dependency       2.9%
  Line 17        Rc dependency       0.0%
  Line 18        FU dependency       1.9%
            ----------------------------------------
  Line 19      Subtotal static                 21.2%

            ----------------------------------------
  Line 20          Total stall                 59.6%

  Line 21               Useful      39.4%
  Line 22                 Nops       1.2%
            ----------------------------------------
  Line 23            Execution                 40.6%

  Line 24   Net sampling error                 -0.2%
            ----------------------------------------
  Line 25        Total tallied                100.0%
  Line 26   (114504716, 88.8% of all samples)

Lines 1 to 13
show all dynamic stall cycles. See previous discussion of instruction level information for the meanings of these categories. Unexplained stall (line 10) represents stall cycles for which dcpicalc cannot offer any plausible explanation. Unexplained gain (line 11) occurs when instructions take fewer cycles than even the ideal assumption. For example, since we take dual-issue as the ideal case, if in fact three instructions are issued (two to the integer pipelines and one to a floating point pipeline), half a cycle would be attributed to "unexplained gain." For the difference between "I-cache (not ITB)" and "ITB/I-cache miss," please see the earlier discussion on the corresponding bubbles `i' and `I'.

Dcpicalc shows a range of stall cycles (as a percentage of total cycles tallied) that could have been caused by each reason listed. Some of the ranges may be wide if major stalls can be explained by more than one reason. Generally, the accuracy of the analysis can be improved using profiles for non-cycles events. Currently, dcpicalc takes advantage of imiss, itbmiss, and dtbmiss profiles if they are specified on the command line. Although the contributions of individual stall reasons are reported as ranges, the subtotal for all dynamic stalls is not. It represents the cycles attributed to any one or more of the reasons. Therefore, it does not depend on how stall cycles are apportioned among alternative reasons for the same stall.

Lines 14 to 19
show the static stall cycles. These are stall cycles that the processor would suffer even if there were no dynamic stalls. For example, this assumes that a load from memory takes only two cycles, which corresponds to a D-cache hit. Additional stall cycles due to a cache miss are considered dynamic. If an instruction is stalled for multiple reasons, the static stall cycles are attributed to the last reason preventing instruction issue. Thus, shorter stalls are hidden by longer ones.
Slotting (line 14)
refers to stall cycles resulting from static resource conflicts among the instructions within the same "window" that the processor considers for issue in any given cycle.
Ra/Rb/Rc dependencies (lines 15-17)
refer to stall cycles caused by register dependencies on previous instructions involving, respectively, Ra/Rb/Rc of the stalled instruction.
FU dependency (line 18)
refers to stall cycles caused by competition for function units and other internal resources in the processor.
Line 21-23
are the numbers of cycles spent on executing instructions. Line 23 includes all instructions; line 22 includes nops; line 21 includes "useful" instructions (that is, instructions other than nops). Each of them is simply half the number of executed instructions (of the respective type) since we assume dual-issue to be the ideal case. This percentage may exceed 100% One reason is the Alpha 21164 may issue floating point instructions in addition to two integer instructions per cycle. Since dcpicalc assumes dual issue to be the ideal case (corresponding to 100% execution), the extra instructions would cause this percentage to exceed 100%. Another possible explanation is discrepancies due to sampling error in rarely executed code.

Note that the time spent on "nops" is not necessarily wasted. These operations are often inserted deliberately by the compiler's instruction scheduler to improve instruction execution by the processor's pipeline. If they were removed, fewer instructions would be executed, but it may not take less time.

Line 24
is the net discrepancy due to sampling error and inaccuracy in execution frequency estimates. This can give some indication of how noisy the sample data are, but since it is net discrepancy, two discrepancies of opposite signs may cancel out each other, giving a small error term. However, significant discrepancies are attributed to unexplained stall and gain (lines 11 and 12); they do not cancel out.
Line 25
is simply the sum of the subtotals. It should always be 100%. If not, report a bug!
Line 26
shows the total number of samples tallied for this summary, and its ratio to the number of all samples for this procedure. We tally only the samples falling in basic blocks whose execution frequencies have been determined by dcpicalc. All previous percentages in the summary are computed relative to the number of tallied samples.

TYPICAL USAGE

Typically, dcpicalc and dcpi2ps are used together as follows:
 
$PIPE DCPICALC -db db foo program.exe > bar.graph
$DCPI2PS bar.graph output.ps

         It is also possible to read the ASCII output of dcpicalc directly.

LIMITATIONS

This command can only be used on aggregate (versus ProfileMe) data.

SEE ALSO

dcpi(1), dcpi2ps(1), dcpicat(1), dcpictl(1), dcpid(1), dcpidiff(1), dcpiformat(4), dcpilist(1), dcpiprof(1), dcpitopstalls(1), dcpiwhatcg(1)  

For more information, see the HP Digital Continuous Profiling Infrastructure project home page (http://h30097.www3.hp.com/dcpi).



Comments
Last modified: April 8, 2004