How to Remove or Replace 4-Byte UTF-8 Characters from a String in Java?

Question

How can I replace or remove characters in a UTF-8 string that are represented as 4 bytes in Java?

String input = "example string with some 🥳 emojis";

Answer

In Java, handling strings with different byte representations, particularly UTF-8, can be intricate due to varying character byte sizes. Characters that occupy 4 bytes in UTF-8 are commonly emoji or certain rare symbols. This guide outlines methods to either replace or remove these characters from a UTF-8 encoded string efficiently.

// Removing 4-byte characters
String input = "example string with some 🥳 emojis";
String cleaned = input.replaceAll("[\u{10000}-\u{10FFFF}]", "");
System.out.println(cleaned); // Output: example string with some  emojis

// Replacing 4-byte characters
String replaced = input.replaceAll("[\u{10000}-\u{10FFFF}]", "?");
System.out.println(replaced); // Output: example string with some ? emojis

Causes

4-byte characters are represented in Unicode as code points above U+FFFF.
Java uses UTF-16 internally, where these code points may require surrogate pairs.
Common situations for needing to remove or replace these characters include sanitizing user input or cleaning up string data in applications.

Solutions

Use a regular expression to match and remove characters that are not in the basic multilingual plane (BMP).
Replace 4-byte characters with a placeholder such as a question mark or an empty string. Here's a code snippet for both scenarios.

Common Mistakes

Mistake: Using incorrect regex pattern that does not capture all 4-byte characters.

Solution: Ensure to use the correct Unicode range in your regex: [\u{10000}-\u{10FFFF}].

Mistake: Not handling surrogate pairs correctly, leading to exceptions or unintended results.

Solution: Utilize Java's built-in functions to handle the encoding and ensure that your replacements account for surrogate pairs.

Helpers

remove 4-byte characters
replace UTF-8 characters Java
Java string manipulation
remove emoji from string Java
handle UTF-8 in Java