Skip to main content
Tweeted twitter.com/#!/StackCodeReview/status/534495298077200384
typos
Source Link
rolfl
  • 98.2k
  • 17
  • 220
  • 419

Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the rolling hash matches the search termsterm's hash, does it double-check that the search actually matches. Because the haskhash is computed in \$O(1)\$ time for each advance through the text, it is essentially an \$O(n)\$ search algorithm with an additional \$O(m)\$ confirmation check for each potential match that is found. The best case is \$O(n)\$ for no matches, and worst case is \$O(nm)\$ if everything matches (for example, searching for "aa" in "aaaaaaaaaaaaa").

Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the hash matches the search terms, does it double-check that the search actually matches. Because the hask is computed in \$O(1)\$ time for each advance through the text, it is essentially an \$O(n)\$ search algorithm with an additional \$O(m)\$ confirmation check for each potential match that is found. The best case is \$O(n)\$ for no matches, and worst case is \$O(nm)\$ if everything matches (for example, searching for "aa" in "aaaaaaaaaaaaa").

Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the rolling hash matches the search term's hash, does it double-check that the search actually matches. Because the hash is computed in \$O(1)\$ time for each advance through the text, it is essentially an \$O(n)\$ search algorithm with an additional \$O(m)\$ confirmation check for each potential match that is found. The best case is \$O(n)\$ for no matches, and worst case is \$O(nm)\$ if everything matches (for example, searching for "aa" in "aaaaaaaaaaaaa").

Source Link
rolfl
  • 98.2k
  • 17
  • 220
  • 419

Karp-Rabin with bitwise hash

As a Rags-to-Riches re-implementation of the Karp-Rabin question here, from @tsleyson, I thought it would be interesting to understand the algorithm better, and to use bitwise-shifting instead of prime multiplication, to compute the rolling hash.

Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the hash matches the search terms, does it double-check that the search actually matches. Because the hask is computed in \$O(1)\$ time for each advance through the text, it is essentially an \$O(n)\$ search algorithm with an additional \$O(m)\$ confirmation check for each potential match that is found. The best case is \$O(n)\$ for no matches, and worst case is \$O(nm)\$ if everything matches (for example, searching for "aa" in "aaaaaaaaaaaaa").

The description for the algorithm suggests using a 'base' prime number as the foundation for a hashing algorithm that allows you to 'rotate' an out-of-scope character off of the hash, and then add a new member to it. As you 'roll' the hash down the search text, if the hash matches the hash of your pattern, you can then double-check to see if it is just a collision, or a real match.

Instead of using a system requiring repeated multiplications by prime numbers, I decided to use a computationally simpler bit-shifting algorithm (though the logic may be more complicated, the computations should be simpler). In essence, the algorithm I use uses a long value (64-bit) as a hash, and it rotates characters on to the hash, and uses XOR bitwise logic to 'add' or 'remove' the characters.

I am looking for a review of the usability, and any performance suggestions.

import java.util.Arrays;

public class KarpRabin {

    private static final int[] EMPTY = new int[0];

    // Use a prime number as the shift, which means it takes a while to reuse
    // bit positions in the hash. Creates a good hash distribution in the
    // long bit mask.
    private static final int SHIFT = 31;
    // bits in the long hash value
    private static final int BITS = 64; 

    private final char[] patternChars, textChars;
    private final long patternHash;
    private final int size;
    private final int endPosition;
    private final int tailLeftShift, tailRightShift;

    private long textHash = 0L;
    private int pos = 0;

    /*
     * Private constructor, only accessible from public static entry methods.
     */
    private KarpRabin(String pattern, String text) {
        // inputs pre-validated.
        // inputs not null, not empty, and pattern is shorter than text.
        patternChars = pattern.toCharArray();
        textChars = text.toCharArray();
        size = this.patternChars.length;
        endPosition = this.textChars.length - size;

        // calculate the bit-shifts needed to remove the tail of the search
        // since the hash values are 'rotated' each time, we can count the
        // rotations that will be needed (size - 1) before a char is at the
        // tail, and that will be the relative left/right shifts needed.
        // note that left and right shifts perform a 'rotation' when combined,
        // essentially rotating the character by tailLeft bits to the left.
        tailLeftShift = ((size - 1) * SHIFT) % BITS;
        tailRightShift = BITS - tailLeftShift;

        // Compute the hash of the pattern
        // use a temp variable ph because patternHash is final
        long ph = 0L;
        for (char c : this.patternChars) {
            ph = addHash(ph, c);
        }
        patternHash = ph;

        // seed the hash for the first X chars in the text to search.
        textHash = 0L;
        for (int i = 0; i < size; i++) {
            textHash = addHash(textHash, this.textChars[i]);
        }

        // advance the search to the first 'hit' (if there is one).
        // check to make sure we are not already at a match first.
        if (textHash != patternHash || !confirmed()) {
            advance();
        }
    }

    /*
     * Shift the existing hash, and XOR in the new char.
     */
    private final long addHash(long base, char c) {
        return (base << SHIFT | base >>> (BITS - SHIFT)) ^ c;
    }

    /*
     * Shift the char to remove to the place it would be at the tail, and XOR it
     * off.
     */
    private final long removeHash(long base, char c) {
        long ch = c;
        // this is essentially a rotation in a 64-bit space.
        ch = ch << tailLeftShift | ch >>> tailRightShift;
        return base ^ ch;
    }

    /*
     * Return the next match, and advance the search if needed.
     */
    private int next() {
        if (pos > endPosition) {
            return -1;
        }
        int ret = pos;
        advance();
        return ret;
    }

    /*
     * move the position to the next 'hit', or the end of the search text.
     */
    private void advance() {
        while (++pos <= endPosition) {
            // remove the tail char from the hash
            textHash = removeHash(textHash, textChars[pos - 1]);
            // add in the new head char
            textHash = addHash(textHash, textChars[pos + size - 1]);
            // check to see if we have a match.
            if (textHash == patternHash && confirmed()) {
                return;
            }
        }

    }

    /*
     * Double-check a hash-hit. Assumes the hash computes are equal.
     */
    private boolean confirmed() {
        for (int i = 0; i < size; i++) {
            if (patternChars[i] != textChars[pos + i]) {
                return false;
            }
        }
        return true;
    }

    /**
     * Identify whether the pattern appears in the given text.
     * <p>
     * Null input values will return false, and an empty-string pattern will
     * return true (the empty string is a match in any other non-null string)
     * 
     * @param pattern
     *            the pattern to search for
     * @param text
     *            the text to search in
     * @return true if the pattern exists in the text.
     */
    public static boolean match(final String pattern, final String text) {
        if (pattern == null || text == null || pattern.length() > text.length()) {
            return false;
        }
        if (pattern.isEmpty()) {
            return true;
        }
        
        KarpRabin kr = new KarpRabin(pattern, text);
        return kr.next() >= 0;
    }

    /**
     * Identify all locations where the pattern exists in the search text.
     * <p>
     * Null input values will result in no matches, and the empty search pattern
     * will be located at all positions in the search text.
     * 
     * @param pattern
     *            the pattern to search for
     * @param text
     *            the text to search in
     * @return the integer positions of each found occurrence. An empty array
     *         will be returned if there are no matches.
     */
    public static int[] allMatches(String pattern, String text) {
        if (pattern == null || text == null || pattern.length() > text.length()) {
            return EMPTY;
        }
        if (pattern.isEmpty()) {
            // empty string is found everywhere
            int[] ret = new int[text.length()];
            for (int i = 0; i < ret.length; i++) {
                ret[i] = i;
            }
            return ret;
        }
        
        KarpRabin kr = new KarpRabin(pattern, text);
        // guess the size first, and expand/trim it as needed.
        int[] possibles = new int[text.length() / pattern.length()];
        int count = 0;
        int pos = 0;
        while ((pos = kr.next()) >= 0) {
            if (count >= possibles.length) {
                // expand the matches (a lot). This indicates overlapping results.
                // which is unusual.
                possibles = Arrays.copyOf(possibles, possibles.length * 2 + 2);
            }
            possibles[count] = pos;
            count++;
        }
        // trim the matches.
        return Arrays.copyOf(possibles, count);
    }
}

I have been testing it against some input data similar to @tsleyson's, and then added in a performance test too, that checks the scalability of a non-matching case.

private static final void scalabilityTests() {
    String input = "abcdefghijklmnopquestuvwxyz";
    long[] times = new long[9];
    for (int i = 0; i < times.length; i++) {
        int sum = 0;
        long nanos = System.nanoTime();
        for (int j = 0; j < 5000; j++) {
            sum += allMatches("abcdefghx", input).length;
        }
        times[i] = sum + System.nanoTime() - nanos;
        // double the times....
        input += input;
    }
    System.out.println("Scalability (should double each time): " + Arrays.toString(times));
}

public static void main(String[] args) {
    String[] inputs = { "", null, "abr", "abra", "abrabrabra", 
                        "abracadabrabracadabra", "cabrac", "cabracadabrabracadabrac",
                        "abbbbrrraaabbbccrrabbba" };
    for (String inp : inputs) {
        int[] matches = allMatches("abra", inp);
        System.out.printf("Matches %s %s at %s%n", inp, match("abra", inp), Arrays.toString(matches));
    }
    // run three times for warmups....
    scalabilityTests();
    scalabilityTests();
    scalabilityTests();
}