Fastest way to perform a lot of strings replace in Java

Question

I have to write some sort of parser that get a String and replace certain sets of character with others. The code looks like this:

noHTMLString = noHTMLString.replaceAll("</p>", "\n");
noHTMLString = noHTMLString.replaceAll("<br/>", "\n\n");
noHTMLString = noHTMLString.replaceAll("<br />", "\n\n");
//here goes A LOT of lines like these ones

The function is very long and performs a lot of strings replaces. The issue here is that it takes a lot of time because the method it's called a lot of times, slowing down the application performance.

I have read some threads here about using StringBuilder as an alternative but it lacks the ReplaceAll method and as it's noted here Does string.replaceAll() performance suffer from string immutability? the replaceAll method in String class works with

Match Pattern & Matcher and Matcher.replaceAll() uses a StringBuilder to store the eventually returned value so I don't know if switching to StringBuilder will really reduce the time to perform the substitutions.

Do you know a fast way to do a lot of String replace in a fast way? Do you have any advice for this problem?

Thanks.

EDIT: I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. The problem is that I have to invoke the method very often

What slows you down? - The length of your noHTMLString text, or do you invoke this three Statements very very often? — Ralph
– Ralph, Commented Nov 26, 2010 at 16:42
I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. So I the problem is that I have to invoke the method very often. — Averroes
– Averroes, Commented Nov 26, 2010 at 21:48
Does this answer your question? Java Replacing multiple different substring in a string at once (or in the most efficient way) — Dave Jarvis
– Dave Jarvis, Commented Mar 7, 2023 at 17:29

Alin Gabriel Arhip · Accepted Answer · 2022-09-08 02:45:08Z

14

I found that org.apache.commons.lang.StringUtils is the fastest, if you don't want to bother with the StringBuffer.

You can use it like this:
noHTMLString = StringUtils.replace(noHTMLString, "</p>", "\n");

I did performance testing, and found this to be faster than my custom StringBuffer solution (similar to the one @extraneon proposed).

edited Sep 8, 2022 at 2:45

Alin Gabriel Arhip

2,6881 gold badge17 silver badges25 bronze badges

answered Nov 27, 2010 at 0:13

MatBanik

26.9k40 gold badges119 silver badges179 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Averroes Over a year ago

That was indeed faster than the replaceAll from String.class. Thanks.

Vadzim Over a year ago

See Commons Lang StringUtils.replace performance vs String.replace with benchmark.

Dave Jarvis Over a year ago

For multiple strings, it's probably faster to use StringUtils.replaceEach, not that parsing HTML this way is a good idea.

Martijn Verburg · Accepted Answer · 2010-11-26 12:00:36Z

7

It looks like your parsing HTML there, have you though about using a 3rd party library instead of re-inventing the wheel?

answered Nov 26, 2010 at 12:00

Martijn Verburg

3,31523 silver badges26 bronze badges

Comments

Allanrbo · Accepted Answer · 2010-11-26 12:31:03Z

4

I agree with Martijn in using a ready-built solution instead of parsing it yourself - there's loads of stuff built into Java in the javax.xml package. A neat solution would be to use XSLT transformation to replace, this looks like an ideal use case for it. However, it is complicated.

To answer the question, have you considered using the regular expression libraries? It looks like you have many different things you want to match, and replace with the same thing (\n or empty string). Using regular expressions you could be an expression like "<br>|<br/>|<br />" or even more clever like <br.*?>" to create a matcher object, on which you can call replaceAll.

edited Nov 26, 2010 at 12:31

answered Nov 26, 2010 at 12:25

Allanrbo

2,3681 gold badge23 silver badges27 bronze badges

2 Comments

Adriaan Koster Over a year ago

You cannot parse HTML with regular expressions: stackoverflow.com/questions/1732348/…

Allanrbo Over a year ago

Adriaan, you are right, HTML is a context free language, not a regular language. But you can do text-replacements with regular expressions, and that was what was asked about.

extraneon · Accepted Answer · 2010-11-26 12:26:33Z

I fully agree with Martijn here. Pick the right tool for the job.

If your file however is not HTML, but only contains some HTML tokens there are a few ways you can speed things up.

First, if some amount of the input does not contain replaceable elements, consider starting with something like:

if (!input.contains('<')) {
    return input;
}

Second, consider a regex:

Pattern p = Pattern.compile( your_regex );

Don't make a pattern for every single replaceAll line, but try to combine them (regex has a OR operator) and let Pattern optimize the regex. Do use the compiled pattern and don't compile it in every call, it's fairly expensive.

If regexes are a bit to complex you can also implement some faster (but potentially less readable) replacement engine yourself:

StringBuilder result = new StringBuilder(input.length();
for (int i=0; i < input.length(); i++) {
  char c = input.charAt(i);

  if ( c != '<' ) {
    continue;
  }

  int closePos = input.indexOf( '>', i);
  if (closePos == -1) {// not found
    result.append( input.substring(i, input.length());
    return result.toString();
  }
  i = closePos;
  String token = input.substring(i, closePos);
  if ( token.equals( "p/" ) {
    result.append("\\n");
  } else if (token.equals(...)) {
  } else if (...) {
  } 
}
return result.toString();

This may have some errors :)

The advantage is you have to iterate through the input only once. The big disadvantage is that it is not all that easy to understand. You could also write a state machine, analyzing per character what the new state should be, and that would probably be faster and even more work.

You cannot parse HTML with regular expressions: stackoverflow.com/questions/1732348/…
@Adriaan Koster : That's not what I said. I said, if you have HTML use an HTML parser. If it's plain text with HTML tags in it (which isn't parseable by an HTML parser) try it the hard way.
@Adriaan: WRONG! Yes you can parse HTML with regex. However, you probably don’t want to unless you have constrained and limited HTML to work with, such as you yourself have generated. Otherwise although it is entirely possible to parse HTML with regexes, you really and truly do not want to.
A late nitpick: you cannot parse arbitrary HTML with a single regex, because regexes cannot recognize arbitrary depth recursive nesting. You can certainly perform lexical analysis (i.e. tokenize) of arbitrary HTML with one or more regexes, just as you may be able to recognize interesting parts of an HTML file.

Collectives™ on Stack Overflow

Fastest way to perform a lot of strings replace in Java

4 Answers 4

3 Comments

Comments

2 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

2 Comments

4 Comments

Linked

Related