I l@ve RuBoard |
17.9 Looking Up Words by Sound SimilarityCredit: Greg Jorgensen, Scott David Daniels 17.9.1 ProblemYou need to look up words (most often people's surnames) by sound, rather than by spelling, so that likely spelling mistakes don't spoil the search. 17.9.2 SolutionThe Soundex algorithm (by Odell and Russell, made popular by Knuth) transforms each surname into a signature that is more representative of how that surname is likely to sound when pronounced than of how it's spelled: def soundex(name, len=4):
""" soundex module conforming to Odell-Russell algorithm """
# digits holds the soundex values for the alphabet
soundex_digits = '01230120022455012623010202'
sndx = ''
fc = ''
# Translate letters in name to soundex digits
for c in name.upper( ):
if c.isalpha( ):
if not fc: fc = c # Remember first letter
d = soundex_digits[ord(c)-ord('A')]
# Duplicate consecutive soundex digits are skipped
if not sndx or (d != sndx[-1]):
sndx += d
# Replace first digit with first letter
sndx = fc + sndx[1:]
# Remove all 0s from the soundex code
sndx = sndx.replace('0', '')
# Return soundex code truncated or 0-padded to len characters
return (sndx + (len * '0'))[:len]
17.9.3 DiscussionThe common approach to avoiding confusion when a name's spelling induces lookup errors is the Soundex algorithm, by Odell and Russell, as reported by Knuth. The algorithm is designed for English-language surnames. If you have a significant number of non-English surnames, you might want to alter the values in digits to improve your matches. For example, to accommodate a large number of Spanish surnames, you might count "J" and "L" ("L" because of how "ll" is used) as vowels, setting their positions in digits to 0. The basic assumptions of Soundex are that the consonants are more important than the vowels, and they are placed in groups of letters that can be confused with each other. Coming up with a set of such groups for a language is not horribly tough if you know that language's typical pronunciation issues. Just remember that each group should contain all letters that can be confused with any of those in the group. For example, a slightly better code for both English and Spanish names has the digits "01230120002055012623010202". In languages such as Italian, which has strong and very distinct vowels, the basic assumptions of Soundex break down. There, vowels should probably play a contrary role, that of anchors that cannot be confused with each other. However, Italian phonetics teaches us that this is true to a varying degree, depending in part on where the phonic accent falls in the surname—semivowels in destressed syllables are not good anchors—and these complications are somewhat difficult to handle in a simple-minded, speedy algorithm. 17.9.4 See AlsoSoundex is described in Donald Knuth's The Art of Computer Programming (Addison-Wesley), which is discussed at http://www-cs-staff.stanford.edu/~knuth/taocp.html. |
I l@ve RuBoard |