Sunday
Mar042012

Using a Trie in Python

The trie is one of my favorite data structures - it's easy to implement, understand and has a lot of practcal applications where its usage is favorable over other traditional dictionary implementations, particularly a hashtable.

Why Use Tries?

  1. Tries, by nature, natually compress their key prefixes as part of their insertion.  This means that if you have a lot of keys that are closely 'clustered' by prefix, the the total space used is less than the sum of the size of all the keys.  (For example, if you have the keys 'prefix', 'preamble', 'prenatal', and 'precursor' in a trie, you don't use 31 characters to store keys.  Instead, your data set would look be stored as 'pre' + {'fix', 'amble', 'natal', 'cursor'}, which is a total of 22 characters, a reduction of 9 characters.)
  2. Tries don't have collisions.  Period.  This means that while hashtables win the best-case lookup time with their sexy O(1), they still lose when you have a chain of unfortunate collisions and your lookup time becomes the worst-case O(n).  Tries always have a lookup time of O(login), where i is the size of the alphabet.
  3. Tries have true dynamic allocation and only grow to fit the exact size of the data stored in them.  Most modern hastable implemtentations have dynamic allocation, growing to fit the amount of keys being inserted, but they have a lot of open space allocated just to avoid collisions.  Worse, when you hit a certain load factor, you need to do a full memory copy of the current hashtable of the old size to grow into a new, larger size.  Tries don't have this inefficiency at all.

When Not To use Tries?

If they're so amazing, why don't we use them all the time?  First, tries are optimal for clustered data and usually take up more room for disperate data.  Secondly, they still underperform hashtables on the best case of performance and, let's face it, that's statistically often in hashtables.

The main point is this: know the strengths and weaknesses of the data structures you're using and apply them judiciously.  Like sorting, there is never a total panacea.

Why Use Tries in Python?

Not only is Python a kick-ass sexy beast of a language, but it has one major feature that makes using tries particularly attractive: duck typing.  Python provides the operators that allow you to treat a trie like any other dictionary structure and you, the programmer, nominally shouldn't care one bit about the underlying implementation.

For example:

>>> import trie
>>> t = trie.Trie()
>>> d = {}
>>> t['foo'] = 1
>>> d['foo'] = 1
>>> t['food'] = 'Chicken'
>>> d['food'] = 'Chicken'

But if you're providing the object to be consumed, you might care a little bit when you see the memory usage:

>>> import sys
>>> sys.getsizeof(t)
72
>>> sys.getsizeof(d)
280

Essentially, the key point is that if you know there is a data structure that is more optimal for the data that is being represented and you can abstract the implementation away from the consumer, do it.  In the end, it's less memory and operations and that's an all-around win.

What Is A Good Trie Implementation For Python?

Before writing this up I looked for one to use, with the following two criteria: it had to behave exactly like a classic Python dict and it had to work on more than just strings.  A cursory Google search found zero candidates that fit the bill, so I wrote my own and posted it to Github.

Please feel free to download it, use it, and love it.  Thanks go out to David Ignacio of litl for helping me keep the pep8 standards in the module up to date and reminding me that dict has a get(key, default) method that I left out of the original implementation.

Building an Apache Thrift Service in Python »