3

I want to count the existence of words in textToBeTested array from expList.

Note that both expList and textToBeTested arrays can very large.

I can simply traverse both lists and use ".matches" method to count, but it is in O(n^2).

Is there any faster algorithm or implementation I can use?

    String[] expList = {"i", "i'd", "i'll", "i'm", "i'm", "bet[a-zA-Z]*", "my[a-zA-Z]*"};
    String[] textToBeTested = {"this", "is", "better", "than", "my", "method"};

e.g. in the above textToBeTested array, "better" and "my" matches with a string in expList arrays, so it will return 2.

Thank you very much for any help.

1 Answer 1

4

What about compiling all your patterns into a larger pattern that uses alternation? Alternation can be fast (like Aho Corasick or KMP) if it's compiled into a state machine correctly.

boolean first = true;
StringBuilder sb = new StringBuilder();
for (String s : expList) {
    sp.append("(?:").append(Pattern.quote(s)).append(')');
    if (!first) {
        sb.append('|');
    }
    first = false;
}

Pattern pattern = Pattern.compile(sb.toString());

// Possibly make this a ForkJoinTask
int count = 0;
for (String s : textToBeTested) {
    if (pattern.matcher(s).matches()) {
        count++;
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

@user3286982 Yup. What you're basically asking for is Aho Corasick generalized to a regex. This is basically how you do that. Now, I'm curious how the performance compares.
The performance improved magically. Your method reduced the problem from O(n^2) to O(n) which contributes to the burst in performance. Thank you very much for the help.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.