Return to Revisions

2 of 4

edited tags

edited Nov 17, 2014 at 10:59

145.6k
22
191
481

Java Karp-Rabin String Matching, Part 2

Part 1 is here. I wrote a Java implementation of the Karp-Rabin seminumerical string matching algorithm. I incorporated the suggestions of @fge and @maaartinus from the other question into my new code: the match method now returns a boolean so you can write if (KarpRabin.match(pattern, text)), and I changed the underlying algorithm to use arithmetic mod \$2^{32}\$ with an arbitrary base, so that overflow isn't a problem and any character supported by Java's Strings (which are UTF-16) can be understood.

I also added another method, allMatches, which finds all positions in the string which match the given pattern and returns them in a list. It wasn't obvious to me how to split up allMatches and match so I could reuse the code, so most of it is repeated; that's one major issue I'd like suggestions for.

The current implementation of allMatches is not consistently typed because it returns Collections.EMPTY_LIST (with no generic type parameter) when no matches are found, but a List<Integer> when matches are found. I did this because I like using it as in this code:

List<Integer> result = KarpRabin.allMatches(pattern, text);
if (result == Collections.EMPTY_LIST) {
    // do whatever you do when nothing matches
}

better than as in this code:

List<Integer> result = KarpRabin.allMatches(pattern, text);
if (result.size() == 0) {
     // do whatever you do when nothing matches
}

But this breaks generic type consistency and generates a compiler warning, so I'm open to other ways of doing it.

Here's the code:

public class KarpRabin {
    private static final int BASE = 103;  // Arbitrary base for hash.

    /**
       Finds first match, returns true, false if no match found.
    */
    public static boolean match(String pattern, String text) {
        if (pattern.length() > text.length()) {
            return false;
        }
        int out, phash, thash;
        out = 1;

        for (int expt = pattern.length(); expt > 1; --expt) {
            out *= BASE;
        }
    
        // Calculate fingerprint of pattern and of first
        // pattern.length-length group in text.
        phash = thash = 0;
        for (int i = 0; i < pattern.length(); ++i) {
            phash = phash*BASE + pattern.charAt(i);
            thash = thash*BASE + text.charAt(i);
        }
    
        for (int s = 0; s < text.length() - pattern.length(); ++s) {
            if (phash == thash) {
                if (pattern.equals(text.substring(s, s+pattern.length()))) {
                    return true;
                }
            }
            assert s < text.length() &&
                s + pattern.length() < text.length() :
            "s is " + s + " and s+pattern.length() is " +
                (s + pattern.length());
            thash = BASE*(thash - out*text.charAt(s)) +
                text.charAt(s+pattern.length());
        }

        // See note [4].
        if (phash == thash) {
            if (pattern.equals(text.substring(text.length() - pattern.length(),
                                              text.length())))
                return true;
        }
        return false;
    }

    /**
       Finds all matches of pattern in text, returns a list of
       matching positions.

       Currently, the code is mostly repeated from match.
    */
    public static List<Integer> allMatches(String pattern, String text) {
        if (pattern.length() > text.length()) {
            return Collections.EMPTY_LIST;
        }
        int out, phash, thash;
        out = 1;

        for (int expt = pattern.length(); expt > 1; --expt) {
            out *= BASE;
        }
    
        // Calculate fingerprint of pattern and of first
        // pattern.length-length group in text.
        phash = thash = 0;
        for (int i = 0; i < pattern.length(); ++i) {
            phash = phash*BASE + pattern.charAt(i);
            thash = thash*BASE + text.charAt(i);
        }

        List<Integer> matches = new ArrayList<>();
        for (int s = 0; s < text.length() - pattern.length(); ++s) {
            if (phash == thash) {
                if (pattern.equals(text.substring(s, s+pattern.length()))) {
                    matches.add(s);
                }
            }
            assert s < text.length() &&
                s + pattern.length() < text.length() :
            "s is " + s + " and s+pattern.length() is " +
                (s + pattern.length());
            thash = BASE*(thash - out*text.charAt(s)) +
                text.charAt(s+pattern.length());
        }

        // See note [4].
        if (phash == thash) {
            if (pattern.equals(text.substring(text.length() - pattern.length(),
                                              text.length())))
                matches.add(text.length() - pattern.length());
        }
        if (matches.size() == 0) {
            return Collections.EMPTY_LIST;
        }
        return matches;
    }
}

Suggestions for improving readability, robustness, performance, and type safety, as well as for refactoring these methods to get rid of the repeated code, are all welcome and very much appreciated.

asked Nov 17, 2014 at 8:40

tsleyson