auto-complete

I'm currently playing with the Typeahead problem on Talentbuddy. There are already 2 questions on this subject (Typeahead Talent Buddy and Typeahead autocomplete functionality challenge), but none of these use a trie which is why I created a new one.

Problem Statement

Given a big list of user names and a list of strings representing queries Your task is to:

  • Write a function that prints to the standard output (stdout) for each query the user name that matches the query
  • If there are multiple user names matching the query please select the one that is the smallest lexicographically
  • All string matches must be case insensitive
  • If no match is found for a given query please print "-1"

Note that your function will receive the following arguments: usernames

  • Which is an array of strings representing the user names queries

  • Which is an array of strings representing the queries described above

Data constraints

  • The length of the array above will not exceed 100,000 entries

  • Each name or query string will not exceed 30 characters

Efficiency constraints

  • Your function is expected to print the requested result and return in less than 2 seconds

Example

names: ["james", "jBlank"]
queries: ["j", "jm", "jbl", "JB"]
Output: 
      james
      -1
      jBlank
      jBlank"

I quickly whipped up a brute force solution which was far too slow (about 100 seconds on my laptop for one of the medium test data sets). Then I started researching and came to the conclusion that a trie is probably the best data structure for this. I first went with a generic trie which improved the speed by about 100x. This got me a bit further in the tests, but they try with a few bigger data sets after that and with the last one the algorithm is still too slow.

I've optimized the trie specifically for the problem, then profiled and finetuned the implementation, which got me another 100x speed improvement. However, my implementation still barely makes the 2 second time limit on my machine (1.1 seconds on average with cpython) and doesn't make it on the Talentbuddy server.

Current Implementation:

My algorithm works by first creating a prefix tree (trie) with the usernames. The trie is build from dicts, where a dict entry for a letter references it's corresponding trie node (which is a dict itself). Each node also stores the lexicographic smallest word matching the prefix used to get to the node in it's dict with a special key ("_word").

Creating the trie works by traversing the trie letter by letter for each word and overwriting the "_word" mapping in each encountered node with the current word. Because the word list is sorted in reverse order, this ensures each node will store the lexicographic smallest word with the prefix used to arrive there.

The trie is queried by traversing it using the letters from the query. The node we arrive at the last letter stores the lexicographic smallest username matching this prefix.

class Trie(object):
    """ Simple trie implementation for autocompletion.

    Build this trie once by instanciating it with a list of words.
    You can then get the smalles lexicographic match by calling the by_prefix
    method.

    Note: While storing text unaltered, the trie ignores case.
    """

    _word = "_word"

    def __init__(self, words):
        self.root = {}
        # presorting ensures every time we traverse a node the current word will 
        # be lexicographically smaller, thus we can always replace it.
        words.sort(key=str.lower, reverse=True)
        for word in words:
            node = self.root
            for char in word.lower():
                node = node.setdefault(char, {}) 
                node[self._word] = word

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """
        curr = self.root
        for char in prefix.lower():
            if char in curr:
                curr = curr[char]
            else:
                return "-1"
        else:
            return curr[self._word]

def typeahead(usernames, queries):
    """Given a list of users and queries,
    this function prints all valid users for these queries.

    Args:
        usernames: list of strings representing users.
        queries: list of strings representing (incomplete) input
    """
    users = Trie(usernames)
    print "\n".join(users.by_prefix(q) for q in queries)

Complexity

The worst case complexity for a lookup in the trie now is O(n) where n is the length of a query (which can't be larger than 30).

Creating the trie is a bit worse, with O(n*m) where n is the size of the username list and m the average length of a username.

Optimization

I have used the line profiler to find out where most time is spent. I have modified the typeahead function to be able to pinpoint the bottlenecks better:

Total time: 1.01042 s
File: typeahead.py
Function: __init__ at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                               @profile
    14                                               def __init__(self, words):
    15         1            1      1.0      0.0          self.root = {}
    16                                                   # presorting ensures every time we traverse a node the current word will
    17                                                   # be lexicographically smaller, thus we can always replace it.
    18         1        38859  38859.0      3.8          words.sort(key=str.lower, reverse=True)
    19     50001        22553      0.5      2.2          for word in words:
    20     50000        20530      0.4      2.0              node = self.root
    21    615946       249450      0.4     24.7              for char in word.lower():
    22    565946       419829      0.7     41.6                  node = node.setdefault(char, {})
    23    565946       259193      0.5     25.7                  node[self._word] = word

Total time: 0.402196 s
File: typeahead.py
Function: by_prefix at line 25

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                               @profile
    26                                               def by_prefix(self, prefix):
    27                                                   """Return lexicographically smallest word that starts with a given
    28                                                   prefix.
    29                                                   """
    30     50000        19609      0.4      4.9          curr = self.root
    31    332685       130631      0.4     32.5          for char in prefix.lower():
    32    288221       114342      0.4     28.4              if char in curr:
    33    282685       112530      0.4     28.0                  curr = curr[char]
    34                                                       else:
    35      5536         1950      0.4      0.5                  return "-1"
    36                                                   else:
    37     44464        23134      0.5      5.8              return curr[self._word]

Total time: 3.46204 s
File: typeahead.py
Function: typeahead at line 39

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    39                                           @profile
    40                                           def typeahead(usernames, queries):
    41                                               """Given a list of users and queries,
    42                                               this function prints all valid users for these queries.
    43                                           
    44                                               Args:
    45                                                   usernames: list of strings representing users.
    46                                                   queries: list of strings representing (incomplete) input
    47                                               """
    48         1      1745292 1745292.0     50.4      users = Trie(usernames)
    49         1            3      3.0      0.0      results = (users.by_prefix(q) for q in queries)
    50         1       844213 844213.0     24.4      output = "\n".join(results)
    51         1       872537 872537.0     25.2      print output

Possible Bottlenecks

  1. Dictionary creation takes almost 40% of the cost of building the trie. Is there any other data structure in Python suited for fast lookup? In C I would use an array of pointers, but there is no such thing in Python and from looking at the alternatives I gather that classes or named tuples also use a dict to perform member dereferencing. Would using a list and indexing it by (letter - ord('a')) be faster?
  2. When performing the lookups in the trie most of the cost (almost 60%) comes from dictionary lookups. Solving Bottleneck 1 would most likely also solve this one.
  3. 25% of overall time is spent printing the output to the terminal. Unfortunately there is not much we can do about that. I already "optimized" this by doing only 1 call to print, because it blocks on every newline... This is somewhat of a problem with the format talentbuddy uses for output validation. However I hope they at least don't use a terminal on the validation machine but redirect stdout.

Do you have any idea how I can further speed this up? I expect it needs to run at least 2x faster to pass on the Talentbuddy machine. 

share improve this question
 

1 Answer

up vote 4 down vote accepted

Do it the simple way: make the usernames into a list of tuples (username.lower(), username). Then sort this list; the usernames are now ordered lexicographically; in your example you'd get

index = [("james", "james"),
         ("jblank", "jBlank")]

Now, use the bisect module and the bisect_left adapted from find_le in the documentation page to find the rightmost value in the list with lowercase username less than or equal to the lowercase prefix; compare the returned value to 

i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
    found = index[i]
    if found[0].startswith(prefix):
        return found[1]

return '-1'

The returned value is tuple ('username', 'UserName'); now just check if the entry[0].startswith(prefix); if it does, the answer is entry[1], if it does not, give -1.

Adapting this into a class one gets:

from bisect import bisect_left

names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]


class Index(object):
    def __init__(self, words):
        self.index = [ (w.lower(), w) for w in words ]
        self.index.sort()

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """

        prefix = prefix.lower()
        i = bisect_left(self.index, (prefix, ''))
        if 0 <= i < len(self.index):
            found = self.index[i]
            if found[0].startswith(prefix):
                return found[1]

        return '-1'


def typeahead(usernames, queries):
    """Given a list of users and queries,
    this function prints all valid users for these queries.

    Args:
        usernames: list of strings representing users.
        queries: list of strings representing (incomplete) input
    """
    users = Index(usernames)
    print("\n".join(users.by_prefix(q) for q in queries))

typeahead(names, queries)

This is guaranteed to be faster than your Trie method; the initial overhead comes from creating a tuple for all entries, which is still less burdensome than your solution having to create O(n) dictionaries; the lookup is guaranteed to be faster too.


In case this is not fast enough, then instead of a trie, you can store all prefixes in a single dictionary!

class Index(object):
    def __init__(self, words):
        index = {}
        for w in sorted(words, key=str.lower, reverse=True):
            lw = w.lower()
            for i in range(1, len(lw) + 1):
                index[lw[:i]] = w

        self.index = index

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """
        return self.index.get(prefix.lower(), '-1')
share improve this answer
 
 
This is a very cool solution to the problem, I like the simplicity. However, it is still not fast enough. When removing the print statement and running it 100 times it takes an average of 168ms on my machine. My trie takes 238ms in the same conditions. So while it is about 30% faster, it still doesn't pass on the Talentbuddy machine. I wonder If it's actually possible to solve this problem much faster in Python. I think the trie could still be much faster if it wasn't using dicts as nodes. However, I can't think of a better alternative. –  driest  Sep 17 '14 at 13:12
 
You did run all the queries on 1 already generated datastructure, yes? And no, trie cannot be much faster; bisect has constantish branching factor of 17 for 100000 usernames, your trie depends on the number of characters... –  Antti Haapala  Sep 18 '14 at 4:37 
 
I cannot think anything that would be faster for lookups than my dictionary approach, yet it would be easier to generate than your trie. Even if I tried to micro-optimize the creation loop it did not have any significance. –  Antti Haapala  Sep 18 '14 at 5:06 
 
I called the typeahead function 100 times from ipython with the %timeit function. Since typeahead generates the datastructure for each call I think the benchmark is accurate. I'm not saying I think a trie is easier than your approach, actually i really like the simplicity of it. However, I think an efficient trie implementation could be faster. As you stated before, lookups in a trie depend on the size of the query, which have an average length of 6 in the test data set. Bisecting the tree takes 17 operations for each query so it should be slower. –  driest  Sep 18 '14 at 16:06 
 
Ups, just realized you added a second solution to your answer. It uses a lot of memory (but so does the trie) and is actually minimally faster than the trie (225ms compared to 238ms). However, the bisection algorithm is still faster (168ms). I'm going to accept your answer regardless of this being fast enough for Talentbuddy because it is a significant improvement in speed and the performance issues might be Python specific for this problem. –  driest  Sep 18 '14 at 16:23 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值