The C++ Standard Library specification of generic algorithms
places requirements not only on the syntax (interfaces) and semantics
(input/output relations) of the algorithms, but also on the computing
time the algorithms are allowed to take. Most of the fundamental data
processing algorithms specified have O(N) (linear time) or
O(N logN) bounds, but in a few cases
O(N2) (quadratic) worst case time behavior is
permitted. Such bounds were specified wherever there were algorithms
known, such as quicksort, that were significantly faster than other
algorithms in the average case, and the cases that could cause
quadratic behavior were rarely encountered in practice. It would be
much more satisfactory, however, if these bounds could be tightened,
disallowing quadratic behavior for these generic algorithms entirely.
The following papers show how this goal can be achieved.
- David R. Musser, Introspective Sorting and Selection
Algorithms (draft
April 1996, revised December 1996, final version published in
Software--Practice and Experience, (8): 983-993 (1997)),
presents a new, hybrid sorting algorithm, introsort (for
introspective sort), that behaves almost exactly like
median-of-3 quicksort for most inputs (and is just as fast) but which
is capable of detecting when partitioning is tending toward quadratic
behavior. By switching to heapsort in those situations, introsort
achieves the same O(N logN) time bound as heapsort
but is almost always faster than just using heapsort in the first
place. Detailed performance measurements reported in the paper were
conducted with the aid of the new counting components discussed
in a later section.
- The same paper describes a new selection algorithm,
introselect (for selecting the k-th largest element of
a sequence) based on an introspective approach similar to that of
introsort. Its worst case computing time is O(N) instead of
the O(N2) bound of Hoare's find algorithm.
- David R. Musser and Gor V. Nishanov, A Fast Generic
Sequence Matching
Algorithm
(December 1997; also available, without appendices, as TR
97-11), describes a string matching--and
more generally, sequence matching--algorithm that has a linear
worst-case computing time bound, a low worst-case bound on the number
of comparisons (2N), and sublinear average-case behavior that is
better than that of the fastest versions of the Boyer-Moore
algorithm. The algorithm retains its efficiency advantages in a wide
variety of sequence matching problems of practical interest, including
traditional string matching; large-alphabet problems (as in Unicode
strings); and small-alphabet, long-pattern problems (as in DNA
searches). Since it is expressed as a generic algorithm for searching
in sequences over an arbitrary type T, it is well suited for
use in generic software libraries such as the C++ Standard Template
Library. The algorithm was obtained by adding to the
Knuth-Morris-Pratt algorithm one of the pattern-shifting techniques
from the Boyer-Moore algorithm, with provision for use of hashing in
this technique. In situations in which a hash function or random
access to the sequences is not available, the algorithm falls back to
an optimized version of the Knuth-Morris-Pratt algorithm. The source
files for the algorithms, test programs, and test data are
available.
Based on these results, quadratic computing time behavior could now
be banished from the STL components! Unfortunately it is too late to
change the C++ Standard, but perhaps compiler and library vendors will
adopt the new algorithms voluntarily. Silicon Graphics has already
adopted introsort instead of quicksort for the implementation
of sort
in SGI STL,
and it has propagated from there to GNU C++ and to
Boris Fomitchev's
STLport
(which makes SGI STL available on many different platforms).
|