2.10 Performing Frequent Membership Tests on a Sequence
Credit: Alex Martelli
2.10.1 Problem
You need to perform frequent tests for
membership in a sequence. The
O(N) behavior of repeated
in operators hurts performance, but you
can't switch to using just a dictionary, as you also
need the sequence's order.
2.10.2 Solution
Say you need to append items to a list only if
they're not already in the list. The simple, naive
solution is excellent but may be slow:
def addUnique1(baseList, otherList):
for item in otherList:
if item not in baseList:
baseList.append(item)
If otherList is large, it may be faster to build
an auxiliary dictionary:
def addUnique2(baseList, otherList):
auxDict = {}
for item in baseList:
auxDict[item] = None
for item in otherList:
if not auxDict.has_key(item):
baseList.append(item)
auxDict[item] = None
For a list on which you must often perform membership tests, it may
be best to wrap the list, together with its auxiliary dictionary,
into a class. You can then define a special _ _contains_
_ method to
speed the in operator. The dictionary must be
carefully maintained to stay in sync with the sequence.
Here's a version that does the syncing just in time,
when a membership test is required and the dictionary is out of sync,
and works with Python 2.1 or later:
from _ _future_ _ import nested_scopes
import UserList
try: list._ _getitem_ _
except: Base = UserList.UserList
else: Base = list
class FunkyList(Base):
def _ _init_ _(self, initlist=None):
Base._ _init_ _(self, initlist)
self._dict_ok = 0
def _ _contains_ _(self, item):
if not self._dict_ok:
self._dict = {}
for item in self:
self._dict[item] = 1
self._dict_ok = 1
return self._dict.has_key(item)
def _wrapMethod(methname):
_method = getattr(Base, methname)
def wrapper(self, *args):
# Reset 'dictionary OK' flag, then delegate
self._dict_ok = 0
return _method(self, *args)
setattr(FunkyList, methname, wrapper)
for meth in 'setitem delitem setslice delslice iadd'.split( ):
_wrapMethod('_ _%s_ _'%meth)
for meth in 'append insert pop remove extend'.split( ):
_wrapMethod(meth)
del _wrapMethod
2.10.3 Discussion
Python's in operator is extremely handy, but
it's O(N)
when applied to an N-item sequence. If a
sequence is subject to frequent in tests, and the
items are hashable, an auxiliary dictionary at the
sequence's side can provide a signficant performance
boost. A membership check (using the in operator)
on a sequence of N items is
O(N); if
M such tests are performed, the overall time is
O(M x
N). Preparing an auxiliary dictionary whose keys
are the sequence's items is also roughly
O(N), but the
M tests are roughly
O(M), so overall we have
roughly
O(N+M).
This is rather less than O(N
x M) and can thus offer a
very substantial performance boost when M and
N are large.
Even better overall performance can often be obtained by permanently
placing the auxiliary dictionary alongside the sequence,
encapsulating both into one object. However, in this case, the
dictionary must be maintained as the sequence is modified, so that it
stays in sync with the actual membership of the sequence.
The FunkyList class in this recipe, for example,
extends list (UserList in
Python 2.1) and delegates every method to it. However, each method
that can modify list membership is wrapped in a closure that resets a
flag asserting that the auxiliary dictionary is in sync. The
in operator calls the _ _contains_
_ method when it is applied to an instance that has such a
method. The _ _contains_ _ method rebuilds the
auxiliary dictionary, unless the flag is set, proving that the
rebuilding is unnecessary.
If our program needs to run only on Python 2.2 and later versions, we
can rewrite the _ _contains_ _ method in a much
better way:
def __contains__(self, item):
if not self.dict_ok:
self._dict = dict(zip(self,self))
self.dict_ok = 1
return item in self._dict
The built-in type dict, new in Python 2.2, lets us
build the auxiliary dictionary faster and more concisely.
Furthermore, the ability to test for membership in a dictionary
directly with the in operator, also new in Python
2.2, has similar advantages in speed, clarity, and conciseness.
Instead of building and installing the wrapping closures for all the
mutating methods of the list into the FunkyList
class with the auxiliary function _wrapMethod, we
could simply write all the needed defs for the
wrapper methods in the body of FunkyList, with the
advantage of extending backward portability to Python versions even
older than 2.1. Indeed, this is how I tackled the problem in the
first version of this recipe that I posted to the online Python
cookbook. However, the current version of the recipe has the
important advantage of minimizing boilerplate (repetitious plumbing
code that is boring and voluminous and thus a likely home for bugs).
Python's advanced abilities for introspection and
dynamic modification give you a choice: you can build method
wrappers, as this recipe does, in a smart and concise way, or you can
choose to use the boilerplate approach anyway, if you
don't mind repetitious code and prefer to avoid what
some would call the "black magic"
of advanced introspection and dynamic modification of class objects.
Performance characteristics depend on the actual pattern of
membership tests versus membership modifications, and some careful
profiling may be required to find the right approach for a given use.
This recipe, however, caters well to a rather common pattern of use,
where sequence-modifying operations tend to happen in bunches,
followed by a period in which no sequence modification is performed,
but several membership tests may be performed.
Rebuilding the dictionary when needed is far simpler than
incrementally maintaining it at each sequence-modifying step.
Incremental maintenance requires careful analysis of what is being
removed and of what is inserted, particularly upon such operations as
slice assignment. If that strategy is desired, the values in the
dictionary should probably be a count of the
number of occurrences of each key's value in the
sequence. A list of the indexes in which the value is present is
another possibility, but that takes even more work to maintain.
Depending on usage patterns, the strategy of incremental maintenance
can be substantially faster or slower.
Of course, all of this is necessary only if the sequence itself is
needed (i.e., if the order of items in the sequence is significant).
Otherwise, keeping just the dictionary is obviously simpler and more
effective. Again, the dictionary can map values to
counts, if you the need the data structure to be,
in mathematical terms, a bag rather than a set.
An important requisite for any of these membership-test optimizations
is that the values in the sequence must be hashable (otherwise, of
course, they cannot be keys in a dictionary). For example, a list of
tuples might be subjected to this recipe's
treatment, but for a list of lists the recipe as it stands is not
applicable. You can sometimes use cPickle.dumps to
create dictionary keys—or, for somewhat different application
needs, the object's id—but
neither workaround is always fully applicable. In the case of
cPickle.dumps, even when it is applicable, the
overhead may negate some or most of the optimization.
2.10.4 See Also
The Library Reference sections on sequences
types and mapping types.
|