It is useful to measure the execution time computer algorithms take, either to compare two or more algorithms for the same task, or to understand how the time for a single algorithm grows as a function of its input parameters. For example, how does the time for a given sorting algorithm grow with the length N of the input sequence: as O(N) (linearly), O(N 2 ) (quadratically), or O(N log N)? Beyond the asymtoptic, or "big-O" formulas (which might already be known analytically anyway), how do two algorithms that both take (say) O(N log N) time compare in terms of actual times? Such measurements are particularly important in designing or selecting the best algorithms and variants of algorithms to include in a standard software library. While it would seem straightforward to write programs to do such timings, one often finds that simple programs one writes to set up timing experiments do not give very accurate or repeatable results. The purpose of this note is to examine some of the reasons for this state of affairs, and recommend some techniques that overcome the main difficulties. It also documents some utility software the author developed to aid in both timing algorithm executing times and counting various kinds of operations executed by the algorithms, including the number of data comparisons and assignments, and the number of iterator and other "overhead" operations. These utilities are implemented in C++ and are designed to be used with generic algorithms such as those in the C++ Standard Template Library (STL). The utilities and recommended timing techniques are illustrated in terms of timing some of the STL generic sorting algorithms.
While the timing techniques discussed here would be useful for traditional, nongeneric library routines, the utilities for counting operations take advantage of the fact that generic algorithms have type parameters (in C++ the algorithms are expressed as template functions). It is also the case that measuring and understanding of the computing time behavior of generic algorithms requires a somewhat different approach than the methods and execution models used for traditional, nongeneric library routines, so the latter half of the discussion focuses on this issue.
The C standard library provides a function called clock (in header file time.h) that can sometimes be used in a simple way to time computations:
clock_t start, finish; start = clock(); sort(x.begin(), x.end()); // Call to STL generic sort algorithm finish = clock(); cout << "Time for sort (seconds): " << ((double)(finish - start))/CLOCKS_PER_SEC;
The type clock_t and the constant CLOCKS_PER_SEC are defined in time.h. On a typical Unix system CLOCKS_PER_SEC is 1,000,000, which might lead one to believe that the clock actually ticks every microsecond. Usually the clock granularity is much larger than that, however, such as ticking every 1/100-th of a second. (But the number reported by clock is 10,000 larger than on the previous tick, making it appear that microseconds are being measured.) While 1/100-th of a second may sound fairly accurate relative to timing normal human activities, in terms of computation it is a grossly long time. Since it is 10,000 microseconds, with a 100-Mhz processor up to one million instructions may be executed between ticks. In the above example, if the algorithm is any good, the sequence being sorted may have to be ten thousand or more elements long before a nonzero time is measured.
Even if the input sequence is large enough for the sorting algorithm to take a measurable time, the times reported may not be very accurate. Although the clock function is supposed to report only time used by the user's process, and not system time, there is inevitably some fluctuation in times reported when the processor has to attend to substantial other activity. And even on a single-user workstation, modern operating systems are often running a score or more different processes to handle network activity and other, often invisible, actions. These fluctuations severely limit the usefulness of timing experiments, unless care is taken to reduce their effect. (For example, one might see a time reported for sorting a 20,000-element sequence greater than the time reported for a 30,000-element sequence.)
An example of some techniques for accurately measuring times is the program tsort1.cpp. It measures the time taken by three algorithms: two STL generic sorting algorithms, sort (the Quicksort algorithm, using the median-of-3 method of choosing pivot elements for partitioning), partial_sort (Heapsort), and a generic sorting algorithm called "introsort." (The latter was invented by the author to overcome the main drawback of quicksort, that its worst case computing time is O(N 2); see Introspective Sorting and Searching Algorithms. Introsort is a modification of the quicksort algorithm that can notice when it is repeatedly producing bad partitions and switch to heapsort in those cases. Introsort is just as fast as quicksort on the average, and it has a worst case time of O(N log N). The experiments set up in this program provide some evidence for this claim.)
Before looking at the program, let's see some output from a typical run.
tsort1 All sequence sizes are in multiples of 1000. Input the smallest and largest sequence sizes:1 1024 Input random seed control (a positive integer):33 ____| Introsort | Quicksort | Heapsort Size| Time | Time | Time 1 0.001 0.001 0.002 2 0.003 0.002 0.004 4 0.003 0.005 0.010 8 0.011 0.011 0.028 16 0.021 0.021 0.069 32 0.055 0.055 0.164 64 0.126 0.110 0.452 128 0.272 0.242 1.007 256 0.577 0.609 2.864 512 1.113 1.154 6.015 1024 2.296 2.388 12.799
The program runs each of the three algorithms first on a random sequence of size 1000 and reports the times in seconds. The times as shown here are actually too small to measure with the clock function; they were obtained by running each algorithm 32 times between readings of the clock, and dividing the resulting times by 32. But this was done not just for a single random input; instead the times were recorded for 7 different random inputs and the median of these times is output. (Why the median instead of the mean? The reason is discussed below.)
The sequence size is then doubled and the three algorithms are timed again, with the doubling repeated until it exceeds the upper bound the user specified. This way of generating the sizes fairly economically covers a large range of sizes, and the results can be plotted effectively on a log-log scale. For example, with Gnuplot we obtain from the output shown above
See Plotting Hints for further detail on how to produce such plots and convert them to images for use on web pages.
The times in this example were obtained by compiling the program with the IBM Visual Age C++ compiler on Windows 95, with optimization turned on, and running the executable on a 133-Mhz Pentium PC. See also the apCC/Unix/Sparcstation 5 timing results, which are similar. Aside from compiler and machine dependencies, the actual times for executing generic algorithms also depend heavily on the kind of data and comparison operations begin used. To obtain more machine-independent and general results requires counting operations of various kinds, as will be discussed later.
Among the header files included in the program, three
#include "intsort.h" #include "timer.h" #include "recorder0.h"
are for the author's source code for Introsort and for two files containing utilities for recording times. In timer.h a class called timer is defined that builds on the clock function from time.h in a fairly obvious way. Similarly, recorder0.h defines a class called recorder for saving time readings in an STL vector so that a median value of several readings can later be computed. (It is named recorder0.h because a more elaborate recorder class is available in header recorder.h; its explanation comes later.)
The following lines set up several important parameters of the program:
typedef int U; const int number_of_algorithms = 3; const int number_of_trials = 7;
Type U is the type of data in the sequence to be sorted; it can be changed to any other type for which assignment (=) and less-than comparisons (<) are defined. The next lines say that the experiment will involve timing of three algorithms and that the number of "trials" will be 7. By a trial is meant timing of the algorithm for inputs of a single length; rather than taking a single such measurement we take seven measurements with different randomly generated inputs and compute their median. (The median computation is done in class recorder.) The median is used here rather than the mean because it is less susceptible to fluctuation due to possibly large fluctuations in the individual times.
To specify which algorithms are used in the timing experiment, one writes a smaller template function like the one in the example program,
template <class Container> inline void algorithm(int k, Container& x) { switch (k) { case 0: introsort(x.begin(), x.end()); break; case 1: sort(x.begin(), x.end()); break; case 2: partial_sort(x.begin(), x.end(), x.end()); break; } }
Alternatively, the switch statement could be placed inline down in the main program, but it is more convenient to have it appear earlier where it can be more easily read and modified to vary the experiment. Similarly, some of the output headings that may need to be varied are placed here also.
In the main program a point to note is the way the random number generator is controlled. This is based on the structure of the random number generator in the HP implementation of STL, and it is one aspect of the program that needs to be redone in a more portable manner.
One of the most important points to note in the main program is the way that times are taken for short computations by repeating those computations a number of times (controlled by the variable repetitions), so that they take long enough to register nonzero times.
int repetitions = max(32/N1, 1);
For example, if N1 = 1 (meaning a sequence of length 1,000 (or, more generally, 1 * factor), then the time will be computed for executing the algorithm 32 times:
timer1.restart(); for (q = 0; q < repetitions; ++q) { x = y; algorithm(n, x); } timer1.stop();
Note that we do not start and stop the clock for each execution and take the average. Instead we repeat the computation repetitions times and record the total time. (The repetitions factor is divided out when the time is later reported on the output stream.) Note also that we restore the input sequence before each call of the algorithm; this is crucial when we are running "in-place" algorithms as we are here, since these algorithms typically would perform differently on already-sorted sequences than on random ones. The times reported are of course slightly higher than they should be since we are timing not only the sorting algorithm but also the vector assignment, but the error is small. (This can be easily verified by running the program with the algorithm call commented out; the time for the vector assignment for 1,024,000 elements is only about 0.125 seconds.)
In this program, where we are doubling the length of the input sequence in the outermost loop, we also halve the repetitions factor (until it reaches 1), thus avoiding more computation on longer sequences than is needed for measurable times.
Finally, let us look at the overall structure of the inner loops:
for (p = 0; p < number_of_trials; ++p) { random_shuffle(x.begin(), x.end()); y = x; for (n = 0; n < number_of_algorithms; ++n) { timer1.restart(); for (q = 0; q < repetitions; ++q) { x = y; algorithm(n, x); } timer1.stop(); stats[n].record(timer1); } } for (n = 0; n < number_of_algorithms; ++n) { stats[n].report(cout, repetitions); stats[n].report(ofs, repetitions); }
In each trial a new random sequence is produced in vector x and saved in vector y. Then for each algorithm, this vector is used as the input and the algorithm's execution (which is repeated repetitions times) is timed. The time is saved by the call to recorder::record. After all algorithms are executed, we proceed to the next trial. After all trials are done, we call recorder::report for each algorithm, which computes the medians of the saved values and reports them on the output streams (both cout and to a file).
Another possible ordering of these loops would be to do all the trials for one algorithm together. But rotating frequently among the algorithms, together with taking the median of several times, helps to avoid any skewing of the results toward (or away from) one algorithm when there are fluctuations in the times reported by the clock function due to extraneous activity (like network connections being made).
To adapt the tsort1.cpp program to test other algorithms may require a few changes in the main program in addition to the obvious ones in the algorithm function and headings string definitions. For example, if we are testing STL's partial_sort algorithm against two competing algorithms, we have to add a parameter to the algorithm function for computing the iterator marking the boundary between sorted and unsorted elements in the result. In the main program we have to declare and assign this variable and add it as an actual parameter to the call of the algorithm function.
tpsort1.cpp is such a variant of tsort1.cpp. In this case the added parameter is computed as half the sequence length, but other experiments would require assigning it in other ways. The result of a run of this code is
All sequence sizes are in multiples of 1000. Input the smallest and largest sequence sizes: 1 1024 Input random seed control (a positive integer): 19 ____|nth_element,sort| partial_sort | partial_sort1 Size| Time | Time | Time 1 0.001 0.003 0.003 2 0.002 0.005 0.006 4 0.005 0.011 0.015 8 0.010 0.025 0.033 16 0.025 0.055 0.070 32 0.050 0.130 0.160 64 0.100 0.300 0.330 128 0.200 0.670 0.720 256 0.430 1.530 1.570 512 0.900 3.470 3.420 1024 1.890 7.780 7.410
(Code compiled with the Apogee Version 3 compiler on SUN Unix, with -O optimization, and executed on a 70 Mhz microSPARC II.)
The value of comparing the actual computing times of nongeneric algorithms is limited by their dependence on the hardware characteristics and compiler configuration on which they are obtained. With generic algorithms, actual times may also vary depending on which actual types are substituted for their type parameters. Operation counts are more portable---except for dependence, in some cases, on compiler optimizations---but they may bear little relation to actual times when only certain operations are counted, such as only counting the comparison operations done by a sorting algorithm. For example in STL, heapsort, present as a special case of its partial_sort generic algorithm, does fewer comparisons than the STL median-of-3 quicksort version (sort), but its actual computing time is usually significantly higher because it does more assignments and other operations. And since these algorithms are generic, they access most of the operations they do through type parameters, and thus the relative cost of different kinds of operations can vary when they are specialized with different types.
With algorithms that are generic in the sequence representation, as in STL, the number of iterator operations and their cost relative to comparisons and assignments are also important. As discussed in more detail at the end of this section, in the deque representation in the HP STL implementation most iterator operations cost several times as much as vector or array iterator operations, and this fact is significant in assessing the relative performance of different variants of introsort.
A fourth kind of operations that appears in these algorithms is distance operations, which are arithmetic operations on integer results of iterator subtractions, such as the division by 2 in the expression (last - first)/2. The cost of a distance operation is mostly independent of the data or iterator types, but it is still useful to know how many such operations are performed. Heapsort, for example, does many more distance operations than quicksort or introsort, which, along with its higher number of iterator operations, accounts for its poorer performance in most cases.
It is a useful side-benefit of generic algorithms that operation counts can be obtained without modifying the algorithm source code at all, by specializing their type parameters with classes whose operations have counters built into them. This is an obvious technique for counting element type comparison and assignment operations, but iterator operations and even distance operations can be counted in the same way since STL generic algorithms also access them through type parameters. The most useful way of defining a counting class is as an adaptor, which is a generic component that takes another component and outfits it with a new interface, or provides the same interface but with different or additional ``behind-the-scenes'' functionality. A counting adaptor is an example of adding both to the interface (for initializing and reporting counts) and internal functionality (incrementing the counts).
The following counting adaptor classes provide for counting all of the most significant kinds of operations used in STL generic algorithms:
These classes use class recorder, defined in recorder.h, for recording and reporting operation counts and times.
An simple example of use of these classes is tsort2.cpp, which is analogous to the pure timing example, tsort1.cpp, discussed in the first part of this note. It produces operation counts for the same three generic algorithms---introsort, quicksort, and heapsort---as does the timing example.
Instantiating the counting classes is simplified by using another class called experiment, defined in experiment.h. In tsort3.cpp, the experiment class is used to compare the same three algorithms as tsort2.h. With either of these programs, input is similar to that of tsort1.cpp, for example
All sequence sizes are in multiples of 1000. Input the smallest and largest sequence sizes: 1 2048 Input random seed control (a positive integer): 99
With this input the beginning of the output of tsort3.cpp on the standard output stream is
Size: 1 Trials: 1:123, 2:123, 3:123, 4:123, 5:123, 6:123, 7:123, Algorithm Time Comps Assmts Iters Dists Total Introsort 0.025 11.9 9.4 55.3 1.2 77.9 Quicksort 0.025 11.9 9.4 55.8 1.2 78.2 Heapsort 0.104 10.3 15.5 137.1 162.9 325.8 Size: 2 Trials: 1:123, 2:123, 3:12 . . .
The program also produces the output in essentially the same form, for human reading, in a file called read.dat, and in another form more convenient for use by Gnuplot to generate plots, in a file called graph.dat. To save these files for later use, rename them. For example, read1.dat and graph1.dat are the files produced by the above run of tsort3.cpp.
Note that computing times are recorded and output in addition to the operation counts. These times are higher than the corresponding times reported by tsort1.cpp, since they include the time required to update the counts.