Prefix tree (trie) in F#

What is a prefix tree?

A prefix tree - otherwise known as a trie (pronounced “tray”) - is a data structure designed for fast searching over prefixes - for example given the list of words [CAT, CATS, CART, CARTS], which words are prefixed with “CAR”? Performing this search on a hashset would require you to check every value (O(n)) complexity). A sorted list would need to perform a fast search for the first item with the prefix and then scan or search the list until no matching items remain - O(logn) + O(logm) where m is the sublist which has the prefix - which is fast, but a sorted list is slow to build, slower than a hashset to search for single items and slow to tell me if a given string is a valid prefix to any word in the list.

With a prefix tree this search is performed in O(m*p) time (again m is the sublist of prefixed strings, and p is the length of the average result string - typically very small compared to the string list) and search for a single string in the prefix tree is performed in O(p) where p is the length of the string to search for (not as fast a a hash set, but very near for most cases). At a glance this seems worse than a sorted list, but in typical use cases, such as storing dictionary words, the length of the word you’re searching for is a much smaller number than the number of words you’re storing.

And the coolest thing about a prefix tree is that it’s actually really simple. A prefix tree is a node with (optionally) a value, and a map of values to subnodes. To construct a prefix tree, start with a node with no value and add a string by appending that node’s map with the first character, mapping to a new trie with that character as its value, and add the rest of the string to that node recursively in the same way.

Implementation

I have implemented a simple prefix tree in F#. The code is available here and you can find it on nuget. This trie works well in C# and F# (and any other CLR language) and unlike other F# prefix trees I’ve spotted around the internet, mine is completely immutable.

Here’s an example of usage from F#:

open FTrie
[<EntryPoint>]
let main argv = 
    let t = Trie(["abcde";"abba";"ababab"])
    printfn "%A" (t.getWords()) // seq ["ababab"; "abba"; "abcde"]

    let t2 = t.withWord("cdfge");
    printfn "%A" (t2.getWordsPrefixedBy("ab")) // seq ["ababab"; "abba"; "abcde"]
    0

And in C#

using FTrie;
using System;
class Program
{
    static void Main(string[] args)
    {
        Trie t = new Trie(new[] { "cars", "carts", "cats", "cartoons" });
        Console.WriteLine(string.Join(", ", t.getWordsPrefixedBy("car"))); // cars, cartoons, carts

        var t2 = t.withWord("carve");
        Console.WriteLine(t.isPrefix("car")); // True
        Console.WriteLine(t.isPrefix("cra")); // False;
        Console.ReadLine();
    }
}

A better add function

This immutable implementation creates a new trie when you call the withWord() function instead of adding the word to the existing data structure. It does this by just pulling out every existing word and calling the Trie constructor with the plus the new one. With very large tries this could become quite memory and computationally intensive.

You could implement a better add function using the following algorithm. I didn’t bother because I didn’t need this functionality, so I leave it as an exercise for you (feel free to open a pull request into my repo if you find a good solution).

add(word)
1. currentNode = root
2. while currentNode contains first letter of word
3.   currentNode = currentNode.children[first letter of word]
4. clone currentNode except with a new children map containing the rest of this word chain
5. replace the chain of parents with new tries that reference the new nodes

What you’ll end up with is two trees that reference identical nodes in all but one branch and have different roots. This would be much faster than creating whole new prefix trees from scratch, and use less memory. Because all branches are immutable no one with a reference to another trie could mess around with data in your branches. Don’t forget to account for the EOW bit!