auto-complete

Question

I'm currently playing with the Typeahead problem on Talentbuddy. There are already 2 questions on this subject (Typeahead Talent Buddy and Typeahead autocomplete functionality challenge), but none of these use a trie which is why I created a new one.

Problem Statement

Given a big list of user names and a list of strings representing queries Your task is to:

Write a function that prints to the standard output (stdout) for each query the user name that matches the query
If there are multiple user names matching the query please select the one that is the smallest lexicographically
All string matches must be case insensitive
If no match is found for a given query please print "-1"

Note that your function will receive the following arguments: usernames

Which is an array of strings representing the user names queries

Which is an array of strings representing the queries described above

Data constraints

The length of the array above will not exceed 100,000 entries

Each name or query string will not exceed 30 characters

Efficiency constraints

Your function is expected to print the requested result and return in less than 2 seconds

Example
names: ["james", "jBlank"]
queries: ["j", "jm", "jbl", "JB"]
Output: 
      james
      -1
      jBlank
      jBlank"

I quickly whipped up a brute force solution which was far too slow (about 100 seconds on my laptop for one of the medium test data sets). Then I started researching and came to the conclusion that a trie is probably the best data structure for this. I first went with a generic trie which improved the speed by about 100x. This got me a bit further in the tests, but they try with a few bigger data sets after that and with the last one the algorithm is still too slow.

I've optimized the trie specifically for the problem, then profiled and finetuned the implementation, which got me another 100x speed improvement. However, my implementation still barely makes the 2 second time limit on my machine (1.1 seconds on average with cpython) and doesn't make it on the Talentbuddy server.

Current Implementation:

My algorithm works by first creating a prefix tree (trie) with the usernames. The trie is build from dicts, where a dict entry for a letter references it's corresponding trie node (which is a dict itself). Each node also stores the lexicographic smallest word matching the prefix used to get to the node in it's dict with a special key ("_word").

Creating the trie works by traversing the trie letter by letter for each word and overwriting the "_word" mapping in each encountered node with the current word. Because the word list is sorted in reverse order, this ensures each node will store the lexicographic smallest word with the prefix used to arrive there.

The trie is queried by traversing it using the letters from the query. The node we arrive at the last letter stores the lexicographic smallest username matching this prefix.

class Trie(object):
    """ Simple trie implementation for autocompletion.

    Build this trie once by instanciating it with a list of words.
    You can then get the smalles lexicographic match by calling the by_prefix
    method.

    Note: While storing text unaltered, the trie ignores case.
    """

    _word = "_word"

    def __init__(self, words):
        self.root = {}
        # presorting ensures every time we traverse a node the current word will 
        # be lexicographically smaller, thus we can always replace it.
        words.sort(key=str.lower, reverse=True)
        for word in words:
            node = self.root
            for char in word.lower():
                node = node.setdefault(char, {}) 
                node[self._word] = word

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """
        curr = self.root
        for char in prefix.lower():
            if char in curr:
                curr = curr[char]
            else:
                return "-1"
        else:
            return curr[self._word]

def typeahead(usernames, queries):
    """Given a list of users and queries,
    this function prints all valid users for these queries.

    Args:
        usernames: list of strings representing users.
        queries: list of strings representing (incomplete) input
    """
    users = Trie(usernames)
    print "\n".join(users.by_prefix(q) for q in queries)

Complexity

The worst case complexity for a lookup in the trie now is O(n) where n is the length of a query (which can't be larger than 30).

Creating the trie is a bit worse, with O(n*m) where n is the size of the username list and m the average length of a username.

Optimization

I have used the line profiler to find out where most time is spent. I have modified the typeahead function to be able to pinpoint the bottlenecks better:

Total time: 1.01042 s
File: typeahead.py
Function: __init__ at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                               @profile
    14                                               def __init__(self, words):
    15         1            1      1.0      0.0          self.root = {}
    16                                                   # presorting ensures every time we traverse a node the current word will
    17                                                   # be lexicographically smaller, thus we can always replace it.
    18         1        38859  38859.0      3.8          words.sort(key=str.lower, reverse=True)
    19     50001        22553      0.5      2.2          for word in words:
    20     50000        20530      0.4      2.0              node = self.root
    21    615946       249450      0.4     24.7              for char in word.lower():
    22    565946       419829      0.7     41.6                  node = node.setdefault(char, {})
    23    565946       259193      0.5     25.7                  node[self._word] = word

Total time: 0.402196 s
File: typeahead.py
Function: by_prefix at line 25

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                               @profile
    26                                               def by_prefix(self, prefix):
    27                                                   """Return lexicographically smallest word that starts with a given
    28                                                   prefix.
    29                                                   """
    30     50000        19609      0.4      4.9          curr = self.root
    31    332685       130631      0.4     32.5          for char in prefix.lower():
    32    288221       114342      0.4     28.4              if char in curr:
    33    282685       112530      0.4     28.0                  curr = curr[char]
    34                                                       else:
    35      5536         1950      0.4      0.5                  return "-1"
    36                                                   else:
    37     44464        23134      0.5      5.8              return curr[self._word]

Total time: 3.46204 s
File: typeahead.py
Function: typeahead at line 39

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    39                                           @profile
    40                                           def typeahead(usernames, queries):
    41                                               """Given a list of users and queries,
    42                                               this function prints all valid users for these queries.
    43                                           
    44                                               Args:
    45                                                   usernames: list of strings representing users.
    46                                                   queries: list of strings representing (incomplete) input
    47                                               """
    48         1      1745292 1745292.0     50.4      users = Trie(usernames)
    49         1            3      3.0      0.0      results = (users.by_prefix(q) for q in queries)
    50         1       844213 844213.0     24.4      output = "\n".join(results)
    51         1       872537 872537.0     25.2      print output

Possible Bottlenecks

Dictionary creation takes almost 40% of the cost of building the trie. Is there any other data structure in Python suited for fast lookup? In C I would use an array of pointers, but there is no such thing in Python and from looking at the alternatives I gather that classes or named tuples also use a dict to perform member dereferencing. Would using a list and indexing it by (letter - ord('a')) be faster?
When performing the lookups in the trie most of the cost (almost 60%) comes from dictionary lookups. Solving Bottleneck 1 would most likely also solve this one.
25% of overall time is spent printing the output to the terminal. Unfortunately there is not much we can do about that. I already "optimized" this by doing only 1 call to print, because it blocks on every newline... This is somewhat of a problem with the format talentbuddy uses for output validation. However I hope they at least don't use a terminal on the validation machine but redirect stdout.

Do you have any idea how I can further speed this up? I expect it needs to run at least 2x faster to pass on the Talentbuddy machine.

Antti Haapala · Accepted Answer · 2014-09-18 05:03:26Z

Do it the simple way: make the usernames into a list of tuples (username.lower(), username). Then sort this list; the usernames are now ordered lexicographically; in your example you'd get

index = [("james", "james"),
         ("jblank", "jBlank")]

Now, use the bisect module and the bisect_left adapted from find_le in the documentation page to find the rightmost value in the list with lowercase username less than or equal to the lowercase prefix; compare the returned value to

i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
    found = index[i]
    if found[0].startswith(prefix):
        return found[1]

return '-1'

The returned value is tuple ('username', 'UserName'); now just check if the entry[0].startswith(prefix); if it does, the answer is entry[1], if it does not, give -1.

Adapting this into a class one gets:

from bisect import bisect_left

names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]


class Index(object):
    def __init__(self, words):
        self.index = [ (w.lower(), w) for w in words ]
        self.index.sort()

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """

        prefix = prefix.lower()
        i = bisect_left(self.index, (prefix, ''))
        if 0 <= i < len(self.index):
            found = self.index[i]
            if found[0].startswith(prefix):
                return found[1]

        return '-1'


def typeahead(usernames, queries):
    """Given a list of users and queries,
    this function prints all valid users for these queries.

    Args:
        usernames: list of strings representing users.
        queries: list of strings representing (incomplete) input
    """
    users = Index(usernames)
    print("\n".join(users.by_prefix(q) for q in queries))

typeahead(names, queries)

This is guaranteed to be faster than your Trie method; the initial overhead comes from creating a tuple for all entries, which is still less burdensome than your solution having to create O(n) dictionaries; the lookup is guaranteed to be faster too.

In case this is not fast enough, then instead of a trie, you can store all prefixes in a single dictionary!

class Index(object):
    def __init__(self, words):
        index = {}
        for w in sorted(words, key=str.lower, reverse=True):
            lw = w.lower()
            for i in range(1, len(lw) + 1):
                index[lw[:i]] = w

        self.index = index

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """
        return self.index.get(prefix.lower(), '-1')

This is a very cool solution to the problem, I like the simplicity. However, it is still not fast enough. When removing the print statement and running it 100 times it takes an average of 168ms on my machine. My trie takes 238ms in the same conditions. So while it is about 30% faster, it still doesn't pass on the Talentbuddy machine. I wonder If it's actually possible to solve this problem much faster in Python. I think the trie could still be much faster if it wasn't using dicts as nodes. However, I can't think of a better alternative. — driest, Sep 17 '14 at 13:12
You did run all the queries on 1 already generated datastructure, yes? And no, trie cannot be much faster; bisect has constantish branching factor of 17 for 100000 usernames, your trie depends on the number of characters... — Antti Haapala, Sep 18 '14 at 4:37
I cannot think anything that would be faster for lookups than my dictionary approach, yet it would be easier to generate than your trie. Even if I tried to micro-optimize the creation loop it did not have any significance. — Antti Haapala, Sep 18 '14 at 5:06
I called the typeahead function 100 times from ipython with the %timeit function. Since typeahead generates the datastructure for each call I think the benchmark is accurate. I'm not saying I think a trie is easier than your approach, actually i really like the simplicity of it. However, I think an efficient trie implementation could be faster. As you stated before, lookups in a trie depend on the size of the query, which have an average length of 6 in the test data set. Bisecting the tree takes 17 operations for each query so it should be slower. — driest, Sep 18 '14 at 16:06
Ups, just realized you added a second solution to your answer. It uses a lot of memory (but so does the trie) and is actually minimally faster than the trie (225ms compared to 238ms). However, the bisection algorithm is still faster (168ms). I'm going to accept your answer regardless of this being fast enough for Talentbuddy because it is a significant improvement in speed and the performance issues might be Python specific for this problem. — driest, Sep 18 '14 at 16:23

auto-complete

Autocomplete Trie Optimization

1 Answer