7

Lets say I have this list of words:

 String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"};

Than I have text

 String text = "I would like to do a nice novel about nature AND people"

Is there method that matches the stopWords and removes them while ignoring case; like this somewhere out there?:

 String noStopWordsText = remove(text, stopWords);

Result:

 " would like do nice novel nature people"

If you know about regex that wold work great but I would really prefer something like commons solution that is bit more performance oriented.

BTW, right now I'm using this commons method which is lacking proper insensitive case handling:

 private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"};
 private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""};

 noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords);     
2
  • Do you have punctuation in your strings? Commented Jan 22, 2011 at 17:23
  • Do you have some hard numbers that point to a regexp solution not being performant enough, or is that just premature optimization? I mean, it's definitely not the most performant solution, but unless this is all you do and you need to do it 10K times a second, I would bet it's not an issue. Commented Jan 22, 2011 at 18:54

4 Answers 4

17

Create a regular expression with your stop words, make it case insensitive, and then use the matcher's replaceAll method to replace all matches with an empty string

import java.util.regex.*;

Pattern stopWords = Pattern.compile("\\b(?:i|a|and|about|an|are|...)\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people");
String clean = matcher.replaceAll("");

the ... in the pattern is just me being lazy, continue the list of stop words.

Another method is to loop over all the stop words and use String's replaceAll method. The problem with that approach is that replaceAll will compile a new regular expression for each call, so it's not very efficient to use in loops. Also, you can't pass the flag that makes the regular expression case insensitive when you use String's replaceAll.

Edit: I added \b around the pattern to make it match whole words only. I also added \s* to make it glob up any spaces after, that's maybe not necessary.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, it should. I had an error in the regexp, \b needs to be \\b in Java, I forgot that. But now it should work.
5

You can make a reg expression to match all the stop words [for example a , note space here]and end up with

str.replaceAll(regexpression,"");

OR

 String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "};
        String text = " I would like to do a nice novel about nature AND people ";

        for (String stopword : stopWords) {
            text = text.replaceAll("(?i)"+stopword, " ");
        }
        System.out.println(text);

output:

 would like do nice novel nature people 

There might be better way.

2 Comments

1) Doesn't handle the requirement that the method should be case insensitive. 2) doesn't remove stop words -- it would remove "no" in "novel".
Clever trick, didn't know that was possible. The only criticism I have is that replaceAll is really inefficient, it compiles a one-off regexp pattern, so using it in a loop is not great.
4

This is a solution that does not use regular expressions. I think it's inferior to my other answer because it is much longer and less clear, but if performance is really, really important then this is O(n) where n is the length of the text.

Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("and");
// and so on ...

String sampleText = "I would like to do a nice novel about nature AND people";
StringBuffer clean = new StringBuffer();
int index = 0;

while (index < sampleText.length) {
  // the only word delimiter supported is space, if you want other
  // delimiters you have to do a series of indexOf calls and see which
  // one gives the smallest index, or use regex
  int nextIndex = sampleText.indexOf(" ", index);
  if (nextIndex == -1) {
    nextIndex = sampleText.length - 1;
  }
  String word = sampleText.substring(index, nextIndex);
  if (!stopWords.contains(word.toLowerCase())) {
    clean.append(word);
    if (nextIndex < sampleText.length) {
      // this adds the word delimiter, e.g. the following space
      clean.append(sampleText.substring(nextIndex, nextIndex + 1)); 
    }
  }
  index = nextIndex + 1;
}

System.out.println("Stop words removed: " + clean.toString());

2 Comments

Very true, I changed the break into nextIndex = sampleText.length, which should solve that.
Ooops, that's actually what I tested, but I was sloppy when I changed the code. Thanks for pointing that out.
1

Split text on whilespace. Then loop through the array and keep appending to a StringBuilder only if it is not one of the stop words.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.