How to Use Jsoup.clean Without Adding HTML Entities

Question

How can I use Jsoup.clean without automatically adding HTML entities to my output?

// Sample Jsoup.clean usage with HTML String
String cleanedHtml = Jsoup.clean(htmlContent, "", Whitelist.basic(), new CleanVisitor());

class CleanVisitor extends NodeVisitor {  
    @Override  
    public void visit(TextNode node) {  
        // Process text nodes without converting to HTML entities  
        visitTextNode(YourCustomProcessor, node);  
    }  
}

Answer

Jsoup is a Java library designed for working with real-world HTML. While using Jsoup.clean, you might notice that it often converts certain characters to HTML entities to ensure valid HTML output. However, there are ways to clean your HTML while keeping the original plain text format intact. Below, we explore how to do this effectively.

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;

public class JsoupExample {  
    public static void main(String[] args) {  
        String htmlContent = "<p>This is a sample text with special characters: ©, ™, ∞</p>";
        // Clean HTML without adding HTML entities
        String cleanedHTML = Jsoup.clean(htmlContent, "", Whitelist.simpleText(), new CleanVisitor());
        System.out.println(cleanedHTML);  
    }  
}

Causes

Using Jsoup.clean with default settings processes content to escape special characters.
Default behavior is to ensure valid HTML by converting non-ASCII characters and symbols to HTML entities.

Solutions

Customize the Jsoup.clean method by specifying a different Whitelist that allows certain tags while restricting conversions.
Use a custom CleanVisitor to retain the original text without converting them to HTML entities.

Common Mistakes

Mistake: Not using a custom Whitelist leading to unwanted conversions.

Solution: Always define a Whitelist that fits the requirements of your content.

Mistake: Ignoring the effects of encoding on special characters.

Solution: Employ the UTF-8 encoding standard consistently throughout your application.

Helpers

Jsoup clean
removing HTML entities
Jsoup cleaning HTML
retain original text Jsoup
Java HTML parser