sobota, 9 kwietnia 2011

DAWG as dictionary? Yes!

If you read Wikipedia entry about DAWG, then you find following sentence:
Because the terminal nodes of a DAWG can be reached by multiple paths, a DAWG is not suitable for storing auxiliary information relating to each path, e.g. a word's frequency in the English language. A trie would be more useful in such a case.
This isn't true!

There is a quite simple algorithm, that allow to perform two-way minimal perfect hashing (MPH), i.e. convert any path representing a word to a unique number, or back --- a number to a path (word). Values lie in range 1..n, where n is number of distinct words saved in DAWG.

The algorithm is described in Applications of Finite Automata Representing Large Vocabularies, by Claudio Lucchiesi and Tomasz Kowaltowski (preprint is freely available somewhere online).

The main part of the algorithm is assigning to each node a number of reachable words from a node; this can be easily done in one pass. Then these numbers are used to perform perfect hashing. Hashing algorithm is fast and simple, translation from pseudocode presented in the paper is straightforward.

Algorithm requires additional memory for numbers in each node and a table of size n to implement dictionary lookups.



I've updated pyDAWG to support MPH.

pyDAWG

pyDAWG is a python module implementing Directed Acyclic Word Graph, that allow to store set of words in a compacted way. DAWGs are much smaller then tries, while sharing the main advantage of tries - linear time to check if word is present in a set.

The main module is a C extension, there is also a pure python code.