fuzzy fast string matching and indexing algorithm

Question

I need to find a set of substrings (each about 32 characters) in a very large string ( about 100k) as fast as possible. I need the search to be fuzzy.

What is the best algorithm? I tried scanning whole big string for small strings and checking Levenshtein Distance for each step, but it takes lots of time.

@NeerajJain String.contains() is not a fuzzy method. It searches for exact matches. — AVEbrahimi
– AVEbrahimi, Commented Apr 16, 2015 at 5:56
some discussion over at stackoverflow.com/questions/2891514/… stackoverflow.com/questions/327513/fuzzy-string-search-in-java stackoverflow.com/questions/16351641/… — Thilo
– Thilo, Commented Apr 16, 2015 at 5:56
how fast do you need it to be ? Give us something to aim at. — bvdb
– bvdb, Commented Apr 16, 2015 at 16:39
do you need to determine the exact positions where the substrings occur ? Or is it enough to just know that it's in there somewhere ? — bvdb
– bvdb, Commented Apr 16, 2015 at 16:53

ElKamina · Accepted Answer · 2015-04-16 06:33:33Z

3

Take a look at BLAST algorithm (http://en.wikipedia.org/wiki/BLAST). It is used for sequence search (eg DNA search). The basic problem is very similar to yours.

Essentially what you do is index short strings and find areas where matches are abundant, and do more computationally expensive search in that region.

answered Apr 16, 2015 at 6:33

ElKamina

7,82730 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ishamael · Accepted Answer · 2015-04-16 08:05:20Z

If I understand what you want right (you want to find a subsequences of a large string that are equal to a given set of strings of length 32), and your alphabet has a reasonable size (letters, digits and punctuation, for instance), then you can do the following:

Find the first occurrence of each letter.
For each position in the string, find the next occurrence of every letter after this position (you can do it in O(l * n) where l is the length of the string and n is the size of your alphabet by scanning from the end for each letter)
For each string in your set of strings, find the first occurrence of the first letter of that string, then from that position find the first occurrence of the second letter in your string etc.

This way you spend O(l * n) time to preprocess, but then for each small string in your set you only do O(m) work where m is the length of that string.

Collectives™ on Stack Overflow

fuzzy fast string matching and indexing algorithm

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related