Automatic computation of WCET

Computing accurate WCET on modern complex architectures is a challenging task. This problem has been devoted a lot of attention in the last decade but there are still some open issues. First, the control flow graph (CFG) of a binary program is needed to compute the WCET and this CFG is built using some internal knowledge of the compiler that generated the binary code; moreover once constructed the CFG has to be manually annotated with loop bounds.

Second, the algorithms to compute the WCET (combining Abstract
Interpretation and Integer Linear Programming) are tailored for
specific architectures: changing the architecture (*e.g.,* replacing an
ARM7 by an ARM9) requires the design of a new ad hoc algorithm.
Third, the tightness of the computed results (obtained using the
available tools) are not compared to actual execution times measured
on the real hardware.

In this project we address the above mentioned problems. We first describe a fully automatic method to compute a CFG based solely on the binary program to analyse. Second, we describe the model of the hardware as a product of timed automata, and this model is independent from the program description. The model of a program running on a hardware is obtained by synchronizing (the automaton of) the program with the (timed automata) model of the hardware. Computing the WCET is reduced to a reachability problem on the synchronised model and solved using the model-checker Uppaal.

Our approach is summarised in the figure above.

The architecture of our tool is given in Fig. 1.

The starting point is a binary program (that can be obtained with
`objdump` using the CodeSourcery suite). This file is
transformed into a Uppaal timed automaton by performing the following
steps:

- generate the CFG of the program (Compute CFG)
- compute a reduced program (Compute WCET-equiv)
- generate the Uppaal description of the reduced program.

In the result file we embed models of the pipeline and the caches of the ARM920T. This produces a ready-to-analyse Uppaal file that generates the runs of the program on an ARM920T.

The WCET is computed by computing the largest value a clock (never reset in the program) can take in a final state. This amounts to checking a reachability property.

We have implemented the construction of the CFG (Compute CFG) and the computation of the WCET-equivalent program (Compute WCET-equiv). Together with a parser of ARM binary programs this makes a few thousand C++ lines of code. Our tool produces a bundle of files: a ready-to-analyse file containing the UPPAAL timed automata models of the program a binary P and the hardware models (the layout of the CFG is produced using dot), a dot file with the graph of the program and a ready-to-compile C++ file that contains a simulator of the program. This last file can be compiled and used to compute useful information like the ranges of registers. Notice that during the first phase ComputeCFG we compute the range of the stack pointer and thus the tool can also be used as a stack analyser.

The development board we model and use is an Armadeus APF9328 board which bears a 200MHz Freescale MC9328MXL micro-controller with an ARM920T processor. The processor embeds an ARM9TDMI core that implements the ARM v4T architecture. An overview of the architecture is given in Fig. 2. Gray arrows are address buses/connections. White arrows are data buses/connections. Both are 32 bits wide. Coprocessor 15 hosts control registers for the caches and the MMUs. Actually Register R13 (which should not be confused with the ARM9TDMI r13 register) is not duplicated and is located in Coprocessor 15. It hosts a process ID used for virtual address to physical address translation. Some blocks like the Write back Physical TAG RAM and various debug and/or coprocessor interfaces are not shown.

Original C sources files are available from the Mälardalen WCET Research Group.
We provide the files (first column of Table 1)
as we have slightly modified some of them: the WCET for most of the files
is very small, and measurement errors are relatively large; this is why we have
changed (when necessary and possible) the number of iterations in the C program.
For example, the original C program for *fib* computes the Fibonacci
number *30* and in our example we have computed *fib(300)* so that
the WCET is larger and measurement errors are less than 1%.
The program *P* to analyse is wrapped in a template function
*main*: an example for program *fib(300)* is given in
Listing 1.

Given *P*, we let *t*(*P*) be the wrapped program. Measuring the
execution-time of *P* consists in (1) reading a *hardware* timer
(`timerGetValue`) into a *start* variable, (2) calling the
program *P*, and (3) reading the timer again in a *stop* variable and
(4) printing^{1} the difference
*stop* − *start*. The function `timerGetValue` (assembly code)
has been designed to read a hardware timer (See next paragraph).
The measurement error is plus or minus 12 processor
cycles. The program *t*(*P*) is compiled and linked. Running it on the ARM9
will print out the number of cycles taken by the program *P*: this
figure is given in column “Measured WCET” in
Table 1.

To faithfully compute the WCET of *P* using our method, we take as
input of our tool chain *t*(*P*). *t*(*P*) is transformed (using Compute
CFG and Slice) into a Uppaal automaton. In this automaton a dedicated clock
`GBL_CLK` is reset when the instruction^{2} of *t*(*P*) that
reads the hardware timer flows out of the M stage (reading the timer
in function `timerGetValue` is done using a load instruction).
The final state of the automaton is reached when the second occurrence
of the instruction that reads the timer flows out of the W stage. The
computed WCET is given in column “Computed WCET” in
Table 1. Column “Uppaal”
in Table 1
gives the time Uppaal takes to check the property
`sup {WriteBackStage.DONE} : GBL\_CLK`

clock that gives the
maximal value of a clock in a designated set of states. To do this we
used a soon-to-be-released version of UPPAAL (64-bit, Academic UPPAAL 4.1.12 (rev. 5150), November 2012).
With older versions of UPPAAL, files with large arrays take a very
long time to parse due to some non-optimized library code and the verification times
were badly impacted by this phase.

UPPAAL models for the programs
in Table 1 are available **here**.
The archive contains the models and instructions how to
compute the WCET with UPPAAL (verifyta).
Another archive **sources** contains the C source, compiled ARM binary
and the CFG for each analysed program:
the .arm file is the assembly code we analyse and is obtained using `objdump` on
the compiled C source,
the .dot file gives some graphical representation of the CFG of the
program (you should install GraphViz) to display it.

Note 1
The de-assembled files *file.arm* have been stripped of non useful code.
Thus some functions (`printbin, readstate, timerInit, TimerSetPrescaler`)
are ignored and replaced by a *return* statement (`bx lr`).
This does not impact the WCET measure as these functions operate outside
the scope of the measure (either before we start the timer or after
we record the time elapsed.)
The caches are *ON* (i.e. active) only after function `init_mmu` has completed
in the binary program.
Moreover, the last instruction in the program is the one right after the second call of `timerGetValue`.
The signal `prog_completed!` is sent in the UPPAAL file at this point.
The files also contain an odd instruction: `mov lr, #3` followed by `bx lr`.
This branching corresponds to a return to the caller of `main` and is only used in the UPPAAL models.

Note 2
In the UPPAAL models, the variables that describe the state of
the caches (instruction and data) are declared as `meta` variables.
This reduces the encoding needed for a state and is correct because it is never the case
that two states are equal (even without the cache description). Hence the whole state space
is explored even with this reduced state encoding.

Results in Table 1 are obtained with the latest hardware models and UPPAAL (64-bit, Academic UPPAAL 4.1.12 (rev. 5150), November 2012).

Note 3 The caches are 64-way associative caches i.e., 8 sets of 64 lines
(the tech report 2001 erroneously mentioned 8-way associative caches).

Note 4 Column *Measured WCET (M)* is the most frequently obtained
execution-time out of 10 runs of the programs on our testbed. Slight differences in
the measures may appear because the main memory is a *dynamic
RAM* (DRAM) and is refreshed every few mseconds. During a refreshing phase, the main memory is unavailable and this incurs some
extra delay in the execution-time which accounts for the differences between
two measures.

Program

loc ^{†}

UPPAAL Time/States Explored ^{¶}

Computed BCET/WCET (C)

Measured WCET (M)

( C−M)/M× 100Abs ^{§}fib-O074 1s/74181 8098/8098 8064 0.42%47/131 fib-O174 0.5s/22333 2597/2597 2544 2.0%18/72 fib-O274 0.2s/9711 1209/1209 1164 3.8%22/71 janne-complex-O0^{∗}65 0.8s/38014 4264/4264 4164 2.4%78/173 janne-complex-O1^{∗}65 0.4s/14600 1715/1715 1680 2.0%30/89 janne-complex-O2^{∗}65 0.3s/13004 1557/1557 1536 1.3%32/78 fdct-O1238 1.5s/60418 4245 4092 3.7%100/363 fdct-O2^{}238 22s/55286 19231 18984 1.3%166/3543 Single-Path Programs^{‡}with MUL/MLA/SMULL instructions (instructions durations depend on data)fdct-O0238 2.5s/85008 11242/11800 11448 3.0%253/831 matmult-O0^{∗}162 190s/10531230 502849/529250 [511584,528684] 0.1%158/314 matmult-O1^{∗}162 20s/1122527 129967/156367 [127356,153000] 2.2%71/172 matmult-O2^{∗}162 70s/4450414 122045/148299 [116844,140664] 5.4%75/288 jfdcint-O0374 2.5s/100861 12726/12918 12588 2.6%159/792 jfdcint-O1374 1s/35419 4880/5072 4668 8.6%25/325 jfdcint-O2^{}374 9s/70077 16498/16690 16380 1.8%56/2512 Multiple-Path Programsbs-O0174 24s/1421274 478/1068 1056 1.1%75/151 bs-O1174 19s/1214673 321/738 720 2.5%28/82 bs-O2174 10s/655870 273/628 600 4.6%28/65 cnt-O0^{∗}115 2s/77002 9025/9027 8836 2.1%99/235 cnt-O1^{∗}115 1s/27146 4123/4123 3996 3.1%42/129 cnt-O2^{∗}115 1s/11490 3067/3067 2928 4.6%39/263 insertsort-O0^{∗}91 370s/24250738 865/3133 3108 0.8%79/175 insertsort-O1^{∗}91 220s/11455296 708/1533 1500 2.2%40/115 insertsort-O2^{∗}91 21s/387292 753/1326 1344 0.4%43/108 ns-O0^{∗}497 48s/3064316 940/30968 30732 0.8%132/215 ns-O1^{∗}497 6s/368720 605/11701 11568 1.1%61/124 ns-O2^{∗}497 18s/1030746 441/7280 7236 0.6%566/863

For the *Single-Path* programs the results show that our hardware formal model (and sliced program)
contain enough details to capture the caches and pipeline features.

The second section *Single-Path with MUL/SMULL* requires some more explainations:
when we compute the WCET using the UPPAAL models, we use an interval for durations of the instructions
MUL/MLA and SMULL. Indeed the time it takes to perform one of this instruction depends on
the data in the operands. For `MUL/MLA` the execution-time in the *execution* stage of the pipeline is
within [3,6] cycles and for `SMULL` within [4,7].

The computed WCET is thus interval. Notice that there is no need for assuming that no *timing anomalies* occur
as we do not assume that the duration of `MUL/MLA` or `SMULL` is the upper bound of the interval.
The exploration of the state-space encompasses all the cases for all the possible durations of `MUL/MLA`
and `SMULL` within their respective intervals.
For the `matmult` program, we have measured the execution-time for two different input data: these data
were intended to yield respectively the shortest and largest execution-time.
Of course, we cannot guarantee that they actually yield the WCET. For instance for the *O2* version, the binary program
is optimized and it is rather difficult to guess which input data produces the WCET.

The last section *Multiple-Path* programs contains results for programs with data dependent
decisions during the course of execution. The computed WCET takes into account all the possible choices.
For measurements, we have again tried to choose the input that produces the WCET (for some of the programs
this is difficult to claim that this choice is correct).

Overall, the results are very tight even on large execution-times (`matmult-O0`).

The UPPAAL models we provide can also be used for model-checking functional
(non quantitative) properties
as well. It suffices to synchronize the `prog` template with
a simple loop automaton that accepts/generates the `fetch` action.
The examples zip files below contains this loop automaton (`alwaysFetch`)
together with some properties that can be checked.
It is for instance possible to check that the range of accessed memory cells
is within a particular set, or to compute the max size of the stack, check for deadcode, etc.
In the example file below, we compute the shortest path and longest path
in the program `insertsort-O2` using a clock: two steps
of the program are separated by one time unit, hence the sup and inf
values of the clock gives the longest and shortest paths
in the program.
We can also check that "loops" (or nodes that look like loop boundaries)
can be iterated more than M times and hence compute loop bounds which is required
for some tools to compute WCET. To compute loop bounds we can do a binary
search and check the property `A[] N ≤ K ` where N is a counter
that is incremented every time a "loop" node is entered and K the expected bound.
Notice that our computation of WCET does not rely on these loop
bounds as the slice and variables needed to keep track of the control
flow are included (and discovered) in the slicing phase.

The UPPAAL models of some programs we used for model checking properties are downloadable using the following link:

We provided a few examples to illustrate how we can model-check binary
programs with UPPAAL but adding the `alwaysFetch` automaton
to each file of Table 1 is straighforward and the method is applicable
to the entire set of benchmarks of this table.

The UPPAAL models obtained after the slicing phase can also be used to simulate
traces of a binary program.
This is a very nice feature as it is possible to use the statistical model-checking
(see related references for details)
approach to obtain more insights into the *distribution* of the execution times.
It is also a good alternative to the computation of the WCET using the exhaustive
approach in case it is impractical.
In the sequel we give the transformed models of some programs of Table~1 that are
ready for statistical model-checking.
They slightly differ
from the models of Table~1 as priorities and one-to-one communication can not be used
in the current version of UPPAAL statistical model-checking.
We have transformed the models of Table~1 obtained after slicing into
models that use only 1) broadcast channels and 2) no priority.
This way we can use the statistical model-checking feature in UPPAAL
to obtain the probability distribution of the execution times.

The methodology we have used is the following:

- determine in the CFG the nodes at which non-determinitic choices can occur due to unknown inout data;
- assign a weight to each choice: the probability of firing an edge from the node is then the weight of the edge divided by the summation of the weights of the outgoing edges.

Notice that there are no probabilities on timing (as the system is time deterministic once a path is taken).

The ns-Ox (x=0,1,2) programs all look for a particular value in a MAX_NUM=5*5*5*5 matrix. When the value is found, the program terminates. We have added weights to the edges (in the binary program) where termination can happen as these are the only non-determistic choices in the program.

For example in ns-O0, there is only one such edge, `0x3350 (#13136)`.
We want to encode the following facts regarding the input data for ns-O0:

- the key searched for is in the matrix with probabalility p in [0,1]
- when the key is in the matrix, the probability of finding it after looking up n elements is p/n.

We can encode this using the UPPAAL weights by making them dependent on the number elements already looked up in the matrix. The distribution of the execution times for ns-O0 with p=1 i.e.the key is always found (with default statistical parameters values in UPPAAL) is given by Plot 1.

For this program, according to Table 1 above, the (computed) WCET is 30968.
The statistical property checked with UPPAAL is `Pr[GBL_CLK<=35000](<> TheProg.DONE)`
where `TheProg.DONE` is true only when the last instruction of the program
flows out of the last stage of the pipeline.

Now if we set p=4/5 (key is found with probability 4/5) the distribution of execution times is given by Plot 2:

Accordingly the distribution shows that 1/5 of the 732 simulated runs actually have a duration close to the WCET (as the key is not found and all elements in the matrix are scanned) and the remaining runs durations (whent the key is found) are evenly distributed from the BCET to the WCET. We used UPPAAL 4.1.11 (rev5085) on a MacBook Air 11". 732 runs are simulated and the distribution of execution times is (as expected) evenly distributed from the BCET to the WCET.

Note 5
Precision can be improved with respect to various parameters e.g.
simulating more runs. The "even" distribution is in this case more even
as examplified by Plot 3 below for ns-O1.

In Plot 3, we ran UPPAAL with the probability to find the key set to one and
different settings for statistical parameters: width of *buckets*, probability
of *false negatives/positives* in order to get a higher confidence level.
Basically the blue bars chunk the x-axis into buckets of width 500
and the other bars (red and green) have width 200.
As can be seen the distribution of the execution times is rather even.

To reproduce the results, open UPPAAL 4.1.11 and open one
of the example files listed below. This should load the NTA model together with
the properties to check.
Check the property that looks like `Pr[GBL_CLK<=Something](<> TheProg.DONE)`).
When it terminates, open the Plot Composer (Tool Tab) to display the
simulation results.
It is possible to play with this probability `p` to see the impact on the
distribution of the execution times: it is defined as a constant in the template `prog`
inthe UPPAAL models.

The `insertsort` performs a insert sort.
After each insertion, the list looks like (e1,e2,...ek,l,l1,l2,...lj)
where (e1,e2,ek) is sorted and (l,l1,l2,lj) has to be inserted in
the sorted list.
In the version compiled with option `-O2` we have added
(dynamic) weights to the two non-deterministic choices in the program: 1) when a new element
l to insert is compared with the maximum ek of the sorted sub-list and 2) when the new element
l to insert is compared with other elements ek-1, ... e2, e1 of the sorted sub-list.
We can then encode properties on the input data like: the input data (a list)
is almost sorted by setting the propability that the element l to insert
is larger than or equal ei to k-i.
Results with different assumptions on the input data are given in Plot~4.

It is also easy to obtain a comparison between the shape of the distribution of program steps and program execution times.

We can compute the probability `Pr[#<=Something](<> TheProg.DONE)`) on the NTA
model using the `alwaysFetch` template and display the plot.
The result is given on Plot 5.
As can be seen on Plot 5, the distribution of the program steps and
program execution times have the same shape.

The binary sort analysis using statistical model-checking is given in Plot 6. We set probability to find the key in the set of 800 ordered values and annotate the UPPAAL models to reflect this. If the property to find the key in the set of 800 values is p=a/b, then a good approximtion for the weights of the UPPAAL transitions to find the key at iteration k in the binary search is a and not to find the key is 800/(2^k)*b-a. Plot 6 shows the statistics for the a/b=1/1,1/2 and 1/100.

The UPPAAL models of the programs we used for the statistical model checking computation are downloadable using the following links:

To reproduce the results, open UPPAAL 4.1.11 and open one
of the example files. This should load the NTA model together with
the properties to check.
Check the property that looks like `Pr[GBL_CLK<=Something](<> TheProg.DONE)`).
When it terminates, open the Plot Composer (Tool Tab) to display the
simulation results.