4

I need to find a set of substrings (each about 32 characters) in a very large string ( about 100k) as fast as possible. I need the search to be fuzzy.

What is the best algorithm? I tried scanning whole big string for small strings and checking Levenshtein Distance for each step, but it takes lots of time.

7
  • @NeerajJain String.contains() is not a fuzzy method. It searches for exact matches. Commented Apr 16, 2015 at 5:56
  • some discussion over at stackoverflow.com/questions/2891514/… stackoverflow.com/questions/327513/fuzzy-string-search-in-java stackoverflow.com/questions/16351641/… Commented Apr 16, 2015 at 5:56
  • how fast do you need it to be ? Give us something to aim at. Commented Apr 16, 2015 at 16:39
  • do you need to determine the exact positions where the substrings occur ? Or is it enough to just know that it's in there somewhere ? Commented Apr 16, 2015 at 16:53
  • is there a big number of substrings to search for ? Commented Apr 16, 2015 at 16:54

2 Answers 2

3

Take a look at BLAST algorithm (http://en.wikipedia.org/wiki/BLAST). It is used for sequence search (eg DNA search). The basic problem is very similar to yours.

Essentially what you do is index short strings and find areas where matches are abundant, and do more computationally expensive search in that region.

Sign up to request clarification or add additional context in comments.

Comments

1

If I understand what you want right (you want to find a subsequences of a large string that are equal to a given set of strings of length 32), and your alphabet has a reasonable size (letters, digits and punctuation, for instance), then you can do the following:

  1. Find the first occurrence of each letter.

  2. For each position in the string, find the next occurrence of every letter after this position (you can do it in O(l * n) where l is the length of the string and n is the size of your alphabet by scanning from the end for each letter)

  3. For each string in your set of strings, find the first occurrence of the first letter of that string, then from that position find the first occurrence of the second letter in your string etc.

This way you spend O(l * n) time to preprocess, but then for each small string in your set you only do O(m) work where m is the length of that string.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.