2.1 Introduction
Credit: Tim Peters, PythonLabs
Computer
manufacturers of the 1960s estimated that more than 25 percent of the
running time on their computers was spent on sorting, when all their
customers were taken into account. In fact, there were many
installations in which the task of sorting was responsible for more
than half of the computing time. From these statistics we may
conclude that either (i) there are many important applications of
sorting, or (ii) many people sort when they
shouldn't, or (iii) inefficient sorting algorithms
have been in common use.
—Donald Knuth, "The Art of Computer Programming",
Volume 3, Sorting and Searching, page 3
Professor
Knuth's masterful work on the topics of sorting and
searching spans nearly 800 pages of sophisticated technical text. In
Python practice, we reduce it to two imperatives (we read Knuth so
you don't have to):
When you need to sort, find a way to use the built-in
sort method of Python lists.
When you need to search, find a way to use built-in dictionaries.
Many recipes in this chapter illustrate these principles. The most
common theme is using the
decorate-sort-undecorate (DSU) pattern, a
general approach to transforming a sorting problem by creating an
auxiliary that we can then sort with the default, speedy
sort method. This is the single most useful
technique to take from this chapter. It relies on an unusual feature
of Python's built-in comparisons: sequences are
compared lexicographically. Lexicographical order is a generalization
to tuples and lists of the everyday rules used to compare strings
(i.e., alphabetical order). The built-in cmp(s1,
s2), when s1 and
s2 are sequences, is equivalent to this Python
code:
def lexcmp(s1, s2):
# Find leftmost nonequal pair
i = 0
while i < len(s1) and i < len(s2):
outcome = cmp(s1[i], s2[i])
if outcome:
return outcome
i += 1
# All equal, until at least one sequence was exhausted
return cmp(len(s1), len(s2))
This code looks for the first nonequal corresponding elements. If
such a nonequal pair is found, that pair determines the outcome.
Otherwise, if one sequence is a proper prefix of the other, the
prefix is considered to be the smaller sequence. Finally, if these
cases don't apply, the sequences are identical and
are considered equal. Here are some examples:
>>> cmp((1, 2, 3), (1, 2, 3)) # identical
0
>>> cmp((1, 2, 3), (1, 2)) # first larger because second is a prefix
1
>>> cmp((1, 100), (2, 1)) # first smaller because 1<2
-1
>>> cmp((1, 2), (1, 3)) # first smaller because 1==1, then 2<3
-1
An immediate consequence of lexicographical comparisons is that if
you want to sort a list of objects by a primary key, breaking ties by
comparing a secondary key, you can simply build a list of tuples, in
which each tuple contains the primary key, secondary key, and
original object, in that order. Because tuples are compared
lexicographically, this automatically does the right thing. When
comparing tuples, the primary keys are compared first, and if (and
only if) the primary keys are equal, the secondary keys are compared.
The examples of the DSU pattern in this chapter show many
applications of this idea; perhaps the cutest is Recipe 2.4, which ensures a stable sort by using each
object's list index as a secondary key. Of course,
the DSU technique applies to any number of keys. You can add as many
keys to the tuples as you like, in the order in which you want the
keys compared.
A definitive generalization is provided by Recipe 2.8, which provides a general routine to sort
lists of objects by user-specified index position or attribute name.
This is fine code, and you are free to use it for your own purposes.
It suffers from a problem common to many frameworks in Python,
though: once you get the hang of it, using the DSU pattern is so easy
that you'll find it's quicker to do
it from scratch than to remember how to use the framework. I
haven't yet decided whether this is a strength or
weakness of Python, but it's definitely a real
phenomenon.
2.1.1 Searching and Sorting FAQ
To help you further
understand searching and sorting in Python, I thought
I'd answer some frequently asked questions about the
topic:
- What algorithm does Python's list.sort use?
-
In early releases of Python, list.sort used the
qsort
routine from the underlying platform's C library.
This didn't work out for several reasons, but
primarily because the quality of qsort varied
widely across machines. Some versions were extremely slow when given
a list with many equal values or in reverse-sorted order. Some even
dumped core because they weren't reentrant. A
user-defined _ _cmp_ _ function can also invoke
list.sort, so that one
list.sort can invoke others as a side effect of
comparing. Some platform qsort routines
couldn't handle that. A user-defined _
_cmp_ _ function can also (if it's insane
or malicious) mutate the list while it's being
sorted, and many platform qsort routines dumped
core when that happened.
Python then grew its own implementation of the
quicksort algorithm. This was
rewritten with every release, as real-life cases of unacceptable
slowness were discovered. Quicksort is a delicate algorithm!
In Python 1.5.2 the quicksort algorithm was replaced by a hybrid of
samplesort and
binary
insertion sort, and that hasn't changed again to
date. Samplesort can be viewed as a variant of quicksort that uses a
very large sample size to pick the partitioning element, also known
as the pivot (it recursively samplesorts a large random subset of the
elements and picks the median of those). This variant makes
quadratic-time behavior almost impossible and brings the number of
comparisons in the average case much closer to the theoretical
minimum.
However, because samplesort is a complicated algorithm, it has too
much administrative overhead for small lists. Therefore, small lists
(and small slices resulting from samplesort partitioning) are handled
by a separate binary insertion sort. This is an ordinary insertion
sort, except that it uses binary search to determine where each new
element belongs. Most sorting texts say this isn't
worth the bother, but that's because most texts
assume that comparing two elements is as cheap as or cheaper than
swapping them in memory. This isn't true for
Python's sort! Moving an object is very cheap, since
what is copied is just a reference to the object. Comparing two
objects is expensive, though, because all of the object-oriented
machinery for finding the appropriate code to compare two objects and
for coercion gets reinvoked each time. This makes binary search a
major win for Python's sort.
On top of this hybrid approach, a few common special cases are
exploited for speed. First, already-sorted or reverse-sorted lists
are detected and handled in linear time. For some applications, these
kinds of lists are very common. Second, if an array is mostly sorted,
with just a few out-of-place elements at the end, the binary
insertion sort handles the whole job. This is much faster than
letting samplesort have at it and is common in applications that
repeatedly sort a list, append a few new elements, then sort it
again. Finally, special code in the samplesort looks for stretches of
equal elements, so that the slice they occupy can be marked as done
early.
In the end, all of this yields an in-place sort with excellent
performance in all known real cases and supernaturally good
performance in common special cases. It spans about 500 lines of
complicated C code, which gives special poignancy to Recipe 2.12.
- Is Python's sort stable?
-
No, samplesort is not stable, and it cannot be made stable
efficiently. See Recipe 2.4 if you need
stability.
- But I've tried many examples, and they're all stable!
-
You tried small examples. The binary insertion sort is stable, and,
as I explained earlier, samplesort doesn't kick in
until the list gets larger.
- How large?
-
It's an implementation detail you
can't rely on across releases. Today, the answer is
100 items.
- But Recipe 2.12 shows a stable sort. Why not use that?
-
It's a cute example, but it does twice as many
comparisons as a real quicksort. As I explained earlier, the cost of
comparisons dominates sorting time, so it would take at least twice
as long as Python's sort even if it was coded in C.
Even if it didn't take twice as long, samplesort is
quicker than quicksort.
- Mergesort does few comparisons and can be made stable easily. Why doesn't Python use that?
-
Mergesort requires
extra memory, and it does much more data movement than
Python's current sort. Despite the fact that
comparison time dominates, samplesort does few enough comparisons
that the extra data movement done by mergesort is significant. I
implemented three mergesorts while investigating quicksort
alternatives for Python 1.5.2, and they were all significantly slower
than the approach Python uses.
- Why is passing a comparison function so much slower than using the DSU pattern?
-
In search performance,
comparison time dominates, and an explicit comparison function
written in Python adds the substantial overhead of a Python-level
function call to each comparison. The built-in comparisons are all
coded in C, so they don't incur the overhead of
Python-level function calls.
- So I should never pass an explicit comparison function, right?
-
Speed isn't always everything. If you can afford the
speed hit, and you find it more convenient and clearer in a given
case to pass a comparison function, go ahead. I do.
- Why does Python use the three-outcome cmp for sorting? Why doesn't it use a simple less-than comparison instead?
-
This is a historical consequence of Python initially using the C
qsort function, since qsort
requires a three-outcome comparison function. Like many people, I
wish it used a simple less-than comparison, but it's
too late now. When rich comparisons were introduced,
Python's list.sort was reworked a
little, so that although it uses a three-outcome comparison function
at the user level, internally it also works fine with objects that
implement only _ _lt_ _ comparison.
- What's the best way to sort a list in reverse order?
-
Reversing the sense of the
comparison works fine:
list.sort(lambda a, b: cmp(b, a))
Here's another technique that is faster and more
obvious but that is often avoided by those who mistakenly believe
that writing two lines of code where one might do is somehow sinful:
list.sort( )
list.reverse( )
- What kind of hash table does Python's dictionary type use?
-
The dictionary type uses contiguous
tables with power-of-two sizes. Collisions are handled within the
table, and the table is kept at most two-thirds full, so that
collision handling is very unlikely to degenerate. When a table gets
too full, the table size is doubled and all the elements are
reinserted. For more detail, read the source code in
dictobject.c and
dictobject.h. As of Python 2.2, the source code
is extensively commented.
- I've heard that Python's dictionary implementation is very fast. Is that true?
-
Yes. Because Python uses dictionaries to implement module and object
namespaces, dictionary performance is crucial. For the past decade,
smart people in the Python community have helped optimize
it—the result is open source at its best.
|