Longest word in dictionary that is a subsequence of a given string

Question

I am trying out the following former Google interview question.

The Challenge Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

Note: D can appear in any format (list, hash table, prefix tree, etc.

For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple"

The words "able" and "ale" are both subsequences of S, but they are shorter than "apple". The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order. The word "kangaroo" is the longest word in D, but it isn't a subsequence of S. Learning objectives This question gives you the chance to practice with algorithms and data structures. It’s also a good example of why careful analysis for Big-O performance is often worthwhile, as is careful exploration of common and worst-case input conditions

My approach uses a greedy algorithm.

Sort D in descending order(longest word first)
Find the index of first character in a word
Scan S from indexOfFirstCharacter to find other characters in the word
If we reach to the end of string and there are still characters remaining to be seen in a word then the word is not found
Repeat 2-4 until we find a word

from collections import deque
D = ["able","ale", "apple", "bale", "kangaroo"]
"""
TC : DlogD 
"""
if len(D) > 1:
  D.sort(key=len,reverse=True)

S = "abppplee"
s_dict_first = {}
"""
TC : O(S) 
"""
for i,s in enumerate(S):
  if s not in s_dict_first: s_dict_first[s] = i

"""
TC : O(1)
"""
#return first index of char
def getindex(char):
  if char in s_dict_first:
    return s_dict_first[char]
  return -1

"""
TC : O(w + w +S) = O(w + s)
"""
def isSubstringPresent(startIndex,word):
  #word is too big to be found in S
  if len(word) - 2 > len(S) - 1 - startIndex:
    return False
  remaining_chars = list(word[1:])
  char_queue = deque()
  for char in remaining_chars:
    char_queue.append(char)
  for i in range(startIndex,len(S)):
    if len(char_queue) == 0:return True 
    if S[i] == char_queue[0]:
      char_queue.popleft()
  return len(char_queue) == 0

"""
TC : O(DlogD + D(w + s)) 
"""
for word in D:
  first_char = word[0]
  indexStart = getindex(first_char) + 1
  if indexStart == 0:continue
  if isSubstringPresent(indexStart,word):
     print(word)
     break;

I am open to suggestion in improving the code / any bug in my time/space complexity analysis or any other useful comments.

If I set D = ['abcd', 'bcd'] and S = 'abcd' then this code prints bcd. But the problem says that it should print the longest word that is a substring of the target, so it should print abcd. — Gareth Rees
– Gareth Rees, Commented Sep 13, 2018 at 20:19
@GarethRees you are right , i forgot to set the key as len for sorting , that's why it fails.D.sort(key=len,reverse=True) — user2650277
– user2650277, Commented Sep 14, 2018 at 3:29
@GarethRees could you please remove the hold as i had fixed the bug (check my edit) — user2650277
– user2650277, Commented Sep 14, 2018 at 11:32

jaksz · Accepted Answer · 2018-09-14 20:25:46Z

2

How about using a regex for S, where all characters are optional by appending a ? to each character:

import re

D = ["able","ale", "apple", "bale", "kangaroo"]
S = "abppplee"

regex = '^' + '?'.join(list(S)) + '?' + '$'

words = sorted(D, key=lambda word: len(word), reverse=True)

for word in words:
    if re.match(regex, word):
        print(word)
        break

edited Sep 14, 2018 at 20:25

answered Sep 14, 2018 at 14:07

jaksz

212 bronze badges

\$\begingroup\$ Did you test this code? \$\endgroup\$

Gareth Rees
– Gareth Rees

2018-09-14 20:13:18 +00:00
Commented Sep 14, 2018 at 20:13
\$\begingroup\$ the new code should be working @GarethRees \$\endgroup\$

jaksz
– jaksz

2018-09-14 20:30:23 +00:00
Commented Sep 14, 2018 at 20:30
\$\begingroup\$ How do you define the time and space complexity of a regex solution ? \$\endgroup\$

user2650277
– user2650277

2018-09-15 06:43:29 +00:00
Commented Sep 15, 2018 at 6:43
\$\begingroup\$ Looks better now, but what if S=''? \$\endgroup\$

Gareth Rees
– Gareth Rees

2018-09-15 12:30:40 +00:00
Commented Sep 15, 2018 at 12:30
\$\begingroup\$ @GarethRees this is a valid question but it concerns edge cases, the OP was asking about implementation strategies in general. to polish the code for an actual implementation, one could use an if clause. \$\endgroup\$

jaksz
– jaksz

2018-09-15 14:31:57 +00:00
Commented Sep 15, 2018 at 14:31

| Show 1 more comment

Rainer P. · Accepted Answer · 2018-09-14 18:08:16Z

The code can be made quite a bit more readable by moving parts of the logic into a separate class. Also, sorting the candidates by length is not helpful, even harmful in certain situations. If the shortest candidate word is the only valid subsequence, you still need to test all others. The only thing you 'gain' from sorting is an increase in time complexity. The main method could look like this:

def longest_subsequence(string, words):
    sequences = Sequences(string)
    best = ""
    for word in words:
        if len(word) > len(best) and word in sequences:
            best = word
    return best

Preprocessing has moved into the Sequences constructor and look-ups are now performed by the in operator. The inner workings of the sequences-object don't matter for now, as long as preprocessing is done in linear time and look-ups are fast, i.e. do not require an exhaustive search.

You could just move your preprocessing code (s_dict_first = {}; for i,s in ...) into the constructor of Sequences and your implementation of isSubstringPresent into the in-operator and be done, but we can do better. Instead of locating only the first character using a dictionary and then falling back to an exhaustive search, we can create dictionaries for all characters.

This can be done efficiently by walking the string backwards, starting with an empty dictionary. As we walk, we record the last (actually the next, since we're walking backwards) occurrence of each character. We then store all these dictionaries in a list.

class Sequences:
    """Fast lookups for subsequences in a fixed string"""

    def __init__(self, string):
        """Constructor, preprocessing in O(n) time"""
        n = len(string)
        self.next = [None] * (n+1)
        self.next[n] = {}
        for i in range(n,0,-1):
            self.next[i-1] = { **self.next[i], string[i-1]: i }

    def __contains__(self, string):
        """Fast lookup"""
        i = 0
        for char in string:
            if char not in self.next[i]: return False
            i = self.next[i][char]
        return True

If we later want to look up the word apple, we go to the first dictionary and query it for the first occurrence of the character a, which might be at position 10. We then grab the 10th dictionary from the list of dictionaries and query it for a p, so we get the position of the first p that comes after position 10. We continue until either a query fails or the entire word was found.

I forgot to set the key as len , that's why its gave wrong result.I didn't know about suffix trees.Can you elaborate more on how a suffix tree is applicable here since we can skip characters.Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters. — user2650277
– user2650277, Commented Sep 14, 2018 at 3:57
for S = "abppplee" will a suffix tree recognise "apple" even if there is a b after the "a" and an additional "p". ? — user2650277
– user2650277, Commented Sep 14, 2018 at 13:11
Read the question again as this is exactly what is needed here.For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple" — user2650277
– user2650277, Commented Sep 14, 2018 at 13:21
The comment says, "preprocessing in \$O(n)\$ time" but I make it \$O(n^2)\$ since at each step the code copies out the previously constructed dictionary and adds a new entry if string[i-1] is different from all previously encountered characters. — Gareth Rees
– Gareth Rees, Commented Sep 15, 2018 at 16:40

Stack Exchange Network

Longest word in dictionary that is a subsequence of a given string

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Longest word in dictionary that is a subsequence of a given string

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions