Tries, dictionaries and directories
Sat Nov 25 09:02:43 EST 2006
A trie, or prefix tree, is a tree structure that maps keys to values stored in the tree. The tree itself consists of a set of labelled edges and nodes containing the values. The key corresponding to the value at a particular node is the concatenation of the path labels leading from the root of the tree to the node.
A standard application is to use a trie to represent a dictionary (see the exercises at the end of chapter 2 of the ORA Ocaml book). Here the edges are labelled by the characters making up a word, and the associated value is a boolean stating whether the node is the end of a word.
A trie can also be used to represent the layout of a website consisting of a root index page followed by a hierarchy of subpages (ie the layout of my website). Here the node values are wrapper functions that wrap content depending on its position in the tree, for example producing a navigation menu that depends on the position of the element in the tree. The wrapper function should take the trie and the particular node as arguments, and the trie should have accessor functions for children, parent, ancestors, descendents of the node in order to construct the navigation menu.
A trie is interesting on a more fundamental level because it allows sorting in O(n) operations. Anyone who has done an introductory computer science course is at this stage going "Ha - wrong! We learnt in CS101 that the best a sort algorithm can do is O(n*log(n)), which is obvious because of the number of comparisons that you need to make to locate and element uniquely in the tree." This is true, but only if you are considering comparison based sorts. And fundamentally this depends on the structure of the thing that you are sorting.
To sort objects you need at minimum some total ordering function for the objects, ie some function that returns -1,0,1 depending on whether the first object is "smaller", "equal" or "greater" than the second object respectively. What "smaller" etc mean depends on the data type - obvious for numbers, less so for strings, complicated for more complicated objects. But as long as you can find some function that allows comparison between two objects you can construct a binary search tree from those objects (and at worst you can create a unique hash value for each object and use that as the key). This is the basis of the O(n*log(n)) limit.
However, for some objects there is a more powerful ordering function that can be used - lexicographical ordering. This requires the objects to be able to be broken up into smaller parts, the obvious example being a string being broken up into individual characters. Then the ordering function can compare objects piece by piece. This ordering is available for anything that can be represented as a string, so is applicable for things like positive integers and directory paths. There are some problems with using this for floats and negative integers but these can be overcome (google for radix sort float). Non-compound objects (ie ones that can't be represented of strings of characters) cannot be ordered with this method.
Using lexicographical ordering it is possible to construct a trie from an input list of objects, taking O(n) operations, and then read out the objects in the tree in order in O(n) operations (claimed, not proved here, check the trie wikipedia article if interested), resulting in an O(n) sort. I haven't proved this here, check the literature if interested. The main point is that by altering one underlying assumption (going from total ordering to lexicographical ordering) it was possible to bypass a "fundamental" limit.
- Sidenote: the radix sort for floats I found on google a while back (hopefully near the top of the first page since I can't remember the link) described the radix sort in terms of creating an offset table and some other tables, multiple passes looking at each byte of the input etc. Here we see the advantage of the functional programming description: "Put each element of the input list into a trie. Read out the trie in order." This relies on us having a trie data structure with Trie.oflist and Trie.tolist functions, which we have abstracted out of the description of the sort. (A slight complication is the need to pad integers with leading zeros if the integers have different numbers of digits.)
Here is the code for an example trie or prefix tree datatype (still needs to be cleaned up a bit, but some example code is better than none):
(* Lexical trees (or tries) are used to store dictionaries *) type 'a lex_node = Letter of char * 'a * 'a lex_tree and 'a lex_tree = 'a lex_node list;; type word = string;; (* node constructor *) let make_nd l v c = Letter(l,v,c) (* accessor functions for the trie *) let label (Letter(c,_,_)) = c let value (Letter(_,v,_)) = v let children (Letter(_,_,lst)) = lst let id x = x (* Explode string to list of chars *) let explode wd = let n = String.length wd in let rec aux k acc = if k=0 then (wd. :: acc) else aux (k-1) (wd.[k] :: acc) in aux (n-1) ;; (* Make a tree from a list of characters. *) let rec make_tr f d v clst = let rec aux lst acc = match lst with |  ->  | x ::  -> [make_nd x (v x acc) ] | x :: xs -> [make_nd x (d x acc) (aux xs ((f x)::acc))] in aux clst  (* Merge two trees of the same type *) let rec merge f t1 t2 = match t1, t2 with | _,  -> t1 | , _ -> t2 | x :: xs, y :: ys -> if (label x) = (label y) then merge f ((make_nd (label x) (f (value x) (value y)) (merge f (children x) (children y)) ) :: xs) ys else if (label x) < (label y) then x :: (merge f xs t2) else merge f (y :: t1) ys (* Insert a character list into a tree by merging the tree with the tree created from the list of characters *) let insert f g d v t clst = merge f t (make_tr g d v clst) (* Create a tree from a list of words *) let of_list f g d v strlst = List.fold_left (fun acc x -> insert f g d v acc (explode x))  strlst (* Create a dictionary ie mapping word -> bool *) let make_dict = of_list (fun x y -> x & y) id (fun x acc -> false) (fun x acc -> true);; (* Create a word counter ie mapping word -> int *) let make_count = of_list (fun x y -> x + y) id (fun x acc-> 0) (fun x acc-> 1);; let make_wrap = let prefix lst = (List.fold_left (fun a c -> a ^ (String.make 1 c)) "" lst ) in of_list (fun x y -> x ^ y) id (fun x acc -> "") (fun x acc -> (prefix (List.rev (x :: acc))) ^ "/");; (* lookup : returns value associated with a key *) let lookup_nd tr wd = let lfun f lst nd = if lst =  then nd else f (children nd) lst in let rec aux ndlst wdlst = match ndlst with |  -> raise Not_found | x :: xs -> match wdlst with |  -> raise Not_found | y :: ys -> if y=(label x) then lfun aux ys x else aux xs wdlst in aux tr (explode wd);; let lookup tr wd = value (lookup_nd tr wd) let child_tree tr wd = children (lookup_nd tr wd) let compose f g x = f(g x) let add_el y lst = y :: lst (* path : return path to word *) let path tr wd = let pfun f lst nd = if lst= then ( if (value nd) then  else raise Not_found ) else (f (children nd) lst) in let rec aux ndlst wdlst = match ndlst with |  -> raise Not_found | x :: xs -> match wdlst with |  -> raise Not_found | y :: ys -> if y=(label x) then y :: (pfun aux ys x) else aux xs wdlst in aux tr (explode wd);; (* verify a list of words against a trie, return those that aren't in the trie *) let verify tr wdlst = List.filter (fun x->not (lookup tr x)) wdlst;; let test = make_dict ["fa";"false";"fare";"fried";"frieze";"table";"ta"];; (* select words of a given length from a trie *) let select tr n = let rec aux nd k prefix = let prefixstr = prefix ^ (String.make 1 (label nd)) in match k with | 0 ->  | 1 -> if (value nd) then [prefixstr] else  | _ -> List.fold_right ( @ ) (List.map (fun x->aux x (k-1) prefixstr) (children nd))  in List.fold_right (fun x acc->(aux x n "") @ acc) tr ;; (* Return list of all words in tree *) let to_list tr = let rec aux nd prefix = let prefixstr = prefix ^ (String.make 1 (label nd)) in List.fold_right ( @ ) (List.map (fun x->aux x prefixstr) (children nd)) (if (value nd) then [prefixstr] else ) in List.fold_right (fun x acc->(aux x "") @ acc) tr ;; (* Radix sort for integers of length <8 *) let radix_sort intlst = List.map int_of_string (to_list (make_dict (List.map (fun x->Printf.sprintf "%08d" x) intlst)));;