Revisions to Longest word in dictionary that is a subsequence of a given string

Post Undeleted by Rainer P.

occurred Sep 14, 2018 at 18:08

added 1452 characters in body

Source Link

edited Sep 14, 2018 at 18:08

3.1k
12
18

If this wasThe code can be made quite a real interview questionbit more readable by moving parts of the logic into a separate class. Also, they would have expectedsorting the candidates by length is not helpful, even harmful in certain situations. If the shortest candidate word is the only valid subsequence, you still need to come up with somethingtest all others. The only thing you 'gain' from sorting is an increase in time complexity. The main method could look like this:

def longest_substringlongest_subsequence(string, words):
    treesequences = SuffixTreeSequences(string)
    best = ""
    for word in words:
        if len(word) > len(best) and word in treesequences:
            best = word
    return best

You certainly hadPreprocessing has moved into the right idea when you started to preprocessSequences constructor and look-ups are now performed by the string to allowin operator. The inner workings of the sequences-object don't matter for faster searches, but your implementation is not only slow but also produces wrong resultsnow, as others have already commented. The correct methodlong as preprocessing is to construct a full suffix tree, which can be done in linear time and look-ups are fast, i. Thee. do not require an exhaustive search.

You could just move your preprocessing code (s_dict_first = {}; for i,s in ...) into the constructor of Sequences and your implementation thereof is out of scope for an interview questionisSubstringPresent into the in-operator and be done, it is sufficient to show that you know whatbut we can do better. Instead of locating only the first character using a suffix tree doesdictionary and why it's applicable here. There are also library implementationsthen falling back to an exhaustive search, we can create dictionaries for all characters.

Next, you sorted the candidate wordsThis can be done efficiently by lengthwalking the string backwards, starting with an empty dictionary. This is unhelpfulAs we walk, even harmfulwe record the last (actually the next, since we're walking backwards) occurrence of each character. We then store all these dictionaries in certain situationsa list.

class Sequences:
    """Fast lookups for subsequences in a fixed string"""

    def __init__(self, string):
        """Constructor, preprocessing in O(n) time"""
        n = len(string)
        self.next = [None] * (n+1)
        self.next[n] = {}
        for i in range(n,0,-1):
            self.next[i-1] = { **self.next[i], string[i-1]: i }

    def __contains__(self, string):
        """Fast lookup"""
        i = 0
        for char in string:
            if char not in self.next[i]: return False
            i = self.next[i][char]
        return True

If we later want to look up the shortest candidate word is the only valid substringapple, you still needwe go to test all othersthe first dictionary and query it for the first occurrence of the character a, which might be at position 10. The only thing you 'gain'We then grab the 10th dictionary from sorting is an increase in time complexitythe list of dictionaries and query it for a p, so we get the position of the first p that comes after position 10. We continue until either a query fails or the entire word was found.

If this was a real interview question, they would have expected you to come up with something like this:

def longest_substring(string, words):
    tree = SuffixTree(string)
    best = ""
    for word in words:
        if len(word) > len(best) and word in tree:
            best = word
    return best

You certainly had the right idea when you started to preprocess the string to allow for faster searches, but your implementation is not only slow but also produces wrong results, as others have already commented. The correct method is to construct a full suffix tree, which can be done in linear time. The implementation thereof is out of scope for an interview question, it is sufficient to show that you know what a suffix tree does and why it's applicable here. There are also library implementations.

Next, you sorted the candidate words by length. This is unhelpful, even harmful in certain situations. If the shortest candidate word is the only valid substring, you still need to test all others. The only thing you 'gain' from sorting is an increase in time complexity.

The code can be made quite a bit more readable by moving parts of the logic into a separate class. Also, sorting the candidates by length is not helpful, even harmful in certain situations. If the shortest candidate word is the only valid subsequence, you still need to test all others. The only thing you 'gain' from sorting is an increase in time complexity. The main method could look like this:

def longest_subsequence(string, words):
    sequences = Sequences(string)
    best = ""
    for word in words:
        if len(word) > len(best) and word in sequences:
            best = word
    return best

Preprocessing has moved into the Sequences constructor and look-ups are now performed by the in operator. The inner workings of the sequences-object don't matter for now, as long as preprocessing is done in linear time and look-ups are fast, i.e. do not require an exhaustive search.

You could just move your preprocessing code (s_dict_first = {}; for i,s in ...) into the constructor of Sequences and your implementation of isSubstringPresent into the in-operator and be done, but we can do better. Instead of locating only the first character using a dictionary and then falling back to an exhaustive search, we can create dictionaries for all characters.

This can be done efficiently by walking the string backwards, starting with an empty dictionary. As we walk, we record the last (actually the next, since we're walking backwards) occurrence of each character. We then store all these dictionaries in a list.

class Sequences:
    """Fast lookups for subsequences in a fixed string"""

    def __init__(self, string):
        """Constructor, preprocessing in O(n) time"""
        n = len(string)
        self.next = [None] * (n+1)
        self.next[n] = {}
        for i in range(n,0,-1):
            self.next[i-1] = { **self.next[i], string[i-1]: i }

    def __contains__(self, string):
        """Fast lookup"""
        i = 0
        for char in string:
            if char not in self.next[i]: return False
            i = self.next[i][char]
        return True

If we later want to look up the word apple, we go to the first dictionary and query it for the first occurrence of the character a, which might be at position 10. We then grab the 10th dictionary from the list of dictionaries and query it for a p, so we get the position of the first p that comes after position 10. We continue until either a query fails or the entire word was found.

Post Deleted by Rainer P.

occurred Sep 14, 2018 at 13:28

Source Link

answered Sep 13, 2018 at 21:03

Rainer P.

3.1k
12
18

If this was a real interview question, they would have expected you to come up with something like this:

def longest_substring(string, words):
    tree = SuffixTree(string)
    best = ""
    for word in words:
        if len(word) > len(best) and word in tree:
            best = word
    return best

You certainly had the right idea when you started to preprocess the string to allow for faster searches, but your implementation is not only slow but also produces wrong results, as others have already commented. The correct method is to construct a full suffix tree, which can be done in linear time. The implementation thereof is out of scope for an interview question, it is sufficient to show that you know what a suffix tree does and why it's applicable here. There are also library implementations.

Next, you sorted the candidate words by length. This is unhelpful, even harmful in certain situations. If the shortest candidate word is the only valid substring, you still need to test all others. The only thing you 'gain' from sorting is an increase in time complexity.

Stack Exchange Network

Return to Answer