Question
How can I use Jsoup.clean without automatically adding HTML entities to my output?
// Sample Jsoup.clean usage with HTML String
String cleanedHtml = Jsoup.clean(htmlContent, "", Whitelist.basic(), new CleanVisitor());
class CleanVisitor extends NodeVisitor {
@Override
public void visit(TextNode node) {
// Process text nodes without converting to HTML entities
visitTextNode(YourCustomProcessor, node);
}
}
Answer
Jsoup is a Java library designed for working with real-world HTML. While using Jsoup.clean, you might notice that it often converts certain characters to HTML entities to ensure valid HTML output. However, there are ways to clean your HTML while keeping the original plain text format intact. Below, we explore how to do this effectively.
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class JsoupExample {
public static void main(String[] args) {
String htmlContent = "<p>This is a sample text with special characters: ©, ™, ∞</p>";
// Clean HTML without adding HTML entities
String cleanedHTML = Jsoup.clean(htmlContent, "", Whitelist.simpleText(), new CleanVisitor());
System.out.println(cleanedHTML);
}
}
Causes
- Using Jsoup.clean with default settings processes content to escape special characters.
- Default behavior is to ensure valid HTML by converting non-ASCII characters and symbols to HTML entities.
Solutions
- Customize the Jsoup.clean method by specifying a different Whitelist that allows certain tags while restricting conversions.
- Use a custom CleanVisitor to retain the original text without converting them to HTML entities.
Common Mistakes
Mistake: Not using a custom Whitelist leading to unwanted conversions.
Solution: Always define a Whitelist that fits the requirements of your content.
Mistake: Ignoring the effects of encoding on special characters.
Solution: Employ the UTF-8 encoding standard consistently throughout your application.
Helpers
- Jsoup clean
- removing HTML entities
- Jsoup cleaning HTML
- retain original text Jsoup
- Java HTML parser