13

I have to write some sort of parser that get a String and replace certain sets of character with others. The code looks like this:

noHTMLString = noHTMLString.replaceAll("</p>", "\n");
noHTMLString = noHTMLString.replaceAll("<br/>", "\n\n");
noHTMLString = noHTMLString.replaceAll("<br />", "\n\n");
//here goes A LOT of lines like these ones

The function is very long and performs a lot of strings replaces. The issue here is that it takes a lot of time because the method it's called a lot of times, slowing down the application performance.

I have read some threads here about using StringBuilder as an alternative but it lacks the ReplaceAll method and as it's noted here Does string.replaceAll() performance suffer from string immutability? the replaceAll method in String class works with

Match Pattern & Matcher and Matcher.replaceAll() uses a StringBuilder to store the eventually returned value so I don't know if switching to StringBuilder will really reduce the time to perform the substitutions.

Do you know a fast way to do a lot of String replace in a fast way? Do you have any advice for this problem?

Thanks.

EDIT: I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. The problem is that I have to invoke the method very often

4
  • What slows you down? - The length of your noHTMLString text, or do you invoke this three Statements very very often? Commented Nov 26, 2010 at 16:42
  • I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. So I the problem is that I have to invoke the method very often. Commented Nov 26, 2010 at 21:48
  • See also: stackoverflow.com/a/1765616/59087 Commented Nov 26, 2016 at 23:47
  • Does this answer your question? Java Replacing multiple different substring in a string at once (or in the most efficient way) Commented Mar 7, 2023 at 17:29

4 Answers 4

14

I found that org.apache.commons.lang.StringUtils is the fastest, if you don't want to bother with the StringBuffer.

You can use it like this:
noHTMLString = StringUtils.replace(noHTMLString, "</p>", "\n");

I did performance testing, and found this to be faster than my custom StringBuffer solution (similar to the one @extraneon proposed).

Sign up to request clarification or add additional context in comments.

3 Comments

That was indeed faster than the replaceAll from String.class. Thanks.
For multiple strings, it's probably faster to use StringUtils.replaceEach, not that parsing HTML this way is a good idea.
7

It looks like your parsing HTML there, have you though about using a 3rd party library instead of re-inventing the wheel?

Comments

4

I agree with Martijn in using a ready-built solution instead of parsing it yourself - there's loads of stuff built into Java in the javax.xml package. A neat solution would be to use XSLT transformation to replace, this looks like an ideal use case for it. However, it is complicated.

To answer the question, have you considered using the regular expression libraries? It looks like you have many different things you want to match, and replace with the same thing (\n or empty string). Using regular expressions you could be an expression like "<br>|<br/>|<br />" or even more clever like <br.*?>" to create a matcher object, on which you can call replaceAll.

2 Comments

You cannot parse HTML with regular expressions: stackoverflow.com/questions/1732348/…
Adriaan, you are right, HTML is a context free language, not a regular language. But you can do text-replacements with regular expressions, and that was what was asked about.
3

I fully agree with Martijn here. Pick the right tool for the job.

If your file however is not HTML, but only contains some HTML tokens there are a few ways you can speed things up.

First, if some amount of the input does not contain replaceable elements, consider starting with something like:

if (!input.contains('<')) {
    return input;
}

Second, consider a regex:

Pattern p = Pattern.compile( your_regex );

Don't make a pattern for every single replaceAll line, but try to combine them (regex has a OR operator) and let Pattern optimize the regex. Do use the compiled pattern and don't compile it in every call, it's fairly expensive.

If regexes are a bit to complex you can also implement some faster (but potentially less readable) replacement engine yourself:

StringBuilder result = new StringBuilder(input.length();
for (int i=0; i < input.length(); i++) {
  char c = input.charAt(i);

  if ( c != '<' ) {
    continue;
  }

  int closePos = input.indexOf( '>', i);
  if (closePos == -1) {// not found
    result.append( input.substring(i, input.length());
    return result.toString();
  }
  i = closePos;
  String token = input.substring(i, closePos);
  if ( token.equals( "p/" ) {
    result.append("\\n");
  } else if (token.equals(...)) {
  } else if (...) {
  } 
}
return result.toString();

This may have some errors :)

The advantage is you have to iterate through the input only once. The big disadvantage is that it is not all that easy to understand. You could also write a state machine, analyzing per character what the new state should be, and that would probably be faster and even more work.

4 Comments

You cannot parse HTML with regular expressions: stackoverflow.com/questions/1732348/…
@Adriaan Koster : That's not what I said. I said, if you have HTML use an HTML parser. If it's plain text with HTML tags in it (which isn't parseable by an HTML parser) try it the hard way.
@Adriaan: WRONG! Yes you can parse HTML with regex. However, you probably don’t want to unless you have constrained and limited HTML to work with, such as you yourself have generated. Otherwise although it is entirely possible to parse HTML with regexes, you really and truly do not want to.
A late nitpick: you cannot parse arbitrary HTML with a single regex, because regexes cannot recognize arbitrary depth recursive nesting. You can certainly perform lexical analysis (i.e. tokenize) of arbitrary HTML with one or more regexes, just as you may be able to recognize interesting parts of an HTML file.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.