2.9 Selecting Random Elements from a List Without Repetition
Credit: Iuri Wickert, Duncan Grisby, Steve Holden, Alex Martelli
2.9.1 Problem
You need to
consume, in random order, the items of a rather long list, and the
most direct approach is painfully slow.
2.9.2 Solution
While it's a common mistake to be overly concerned
with speed, you should not ignore the different performances of
various algorithms. Suppose we must process all of the items in a
long list in random order, without repetition. For example:
import random
# an example of a processing function
def process(datum): print datum
# an example of a set of data to process
data = range(10000)
The simplest version is very slow:
def simple( ):
while data:
# returns element, not index (doesn't help)
elem = random.choice(data)
# needs to search for the element throughout the list!
data.remove(elem)
process(elem)
Here is a faster version:
def faster( ):
while data:
index = random.randrange(len(data))
elem = data[index]
# direct deletion, no search needed
del data[index]
process(elem)
# the same, but preserving the data list
def faster_preserve( ):
aux = range(len(data))
while aux:
posit = random.randrange(len(aux))
index = aux[posit]
elem = data[index]
# alters the auxiliary list only
del aux[posit]
process(elem)
However, the key improvement is to switch to an
O(N) algorithm:
def improved( ):
size = len(data)
while size:
index = random.randrange(size)
elem = data[index]
data[index] = data[size-1]
size = size - 1
process(elem)
Of course, you can also implement a version of this that preserves
the data list.
But the winner is the version that appears to be the simplest:
def best( ):
random.shuffle(data)
for elem in data: process(elem)
# or, if you need to preserve the data list's original ordering:
def best_preserve( ):
aux = list(data)
random.shuffle(aux)
for elem in aux: process(elem)
2.9.3 Discussion
The simplest, most direct way of consuming a list in a random fashion
is painfully slow for lists with a few hundred elements. While it is
tempting to use the simple, clear
choice/remove combination, as
in the simple function, this is a bad choice,
because remove must linearly search through the
list to find the element to delete. In other words, the overall
performance is O(N2), with
a large multiplicative constant.
Fortunately, there are equally simple (or simpler), but much faster,
ways to do this. The faster function, using
randrange/del to generate the
random indexes for the list, can skip the costly search for the
deletion. If it's important to preserve the input
list, you can use a disposable auxiliary list to generate the data
list indexes, as in the faster_preserve function.
However, del anylist[x] for a random x is still
O(N), so overall
performance is still O(N2),
albeit with a much smaller multiplicative constant. Incidentally, the
pseudorandom order in which items are processed is not the same with
the various approaches, even if random is seeded
in the same way. All of the orderings are equally pseudorandom,
though.
Pursuing O(N) performance,
one possibility is not to delete an item from a list at all, but
rather to overwrite it with the last item and decrease at each step
the number of items from which we're choosing. The
improved function takes this tack and benefits
greatly from it, in terms of performance.
The fastest approach, however, is to shuffle the
data (or an auxiliary copy of it) once, at the start, then just
process the items sequentially in the resulting pseudorandom order.
The nice thing about this approach, shown in the
best and best_preserve
functions, is that it's actually the simplest of
all.
On lists of 10,000 items, as shown in this recipe, the overhead
(meaning pure overhead, using a do-nothing processing function) of
simple is about 13 or 14 times more than that of
faster and faster_preserve.
Those functions, in turn, have over twice as much overhead as
improved, best, and
best_preserve. On lists of 100,000 items,
faster and faster_preserve
become about 15 times slower than improved,
best, and best_preserve. The
latter two have, for every list size, about 20%-30% less overhead
than improved—a very minor issue, although
their utter simplicity clearly does make them
deserve their names.
While an improvement of 25%, or even a factor of 2, may be neglected
without substantial cost for the performance of your program as a
whole, the same does not apply to an algorithm that is 10 or more
times as slow as it could be. Such terrible performance is likely to
make that program fragment a bottleneck, all by itself, particularly
when we're talking about
O(N2) versus
O(N) behavior.
2.9.4 See Also
The documentation for the random module in the
Library Reference.
|