3

I've build a method which extracts data from an html document using the xpath components of saxon-he. I'm using w3c dom object model for this.

I already created a method which returns the text-value, similar like the text value method from jsoup (jsoupElement.text()):

    protected String getNodeValue(Node node) {
    NodeList childNodes = node.getChildNodes();
    for (int x = 0; x < childNodes.getLength(); x++) {
        Node data = childNodes.item(x);
        if (data.getNodeType() == Node.TEXT_NODE)
            return data.getNodeValue();
    }
    return "";
 }

This works fine but i now i need the underlying html of a selected node (with jsoup it would be jsoupElement.html()). Using the w3c dom object model i have org.w3c.dom.Node. How can i get the html from a org.w3c.dom.Node as String? I couldn't find anything regarding this in the documentation.

Just for clarification: I need the inner html (with or without the node element/tag) as String. Similar like http://api.jquery.com/html/ or http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#html--

5
  • 1
    In Java, you can serialize the child nodes or the node itself using LSSerializer or using a default Transformer, however they will give you an XML serialization of the DOM tree, not the original XML or HTML. Commented Nov 10, 2015 at 17:00
  • Thanks for your answer. Is it possible to get the original html using another document object model? I can choose between this models: saxonica.com/documentation/index.html#!xpath-api/jaxp-xpath/… Commented Nov 10, 2015 at 17:18
  • 2
    I don't think there is any way to get the original HTML from any tree model, it stores nodes and not markup. I am not familiar with jsoup but they are likely to serialize their tree as well, only to HTML, not to XML, if you call that method to give you the inner HTML. Saxon as an XSLT 2 processor supports HTML and XHTML serialization with an XSLT stylesheet having the right output method (i.e. <xsl:output method="html"/> or <xsl:output method="xhtml"/>) so you could use a Transformer with a stylesheet setting the method as needed. Perhaps the API offers some way as well. Commented Nov 10, 2015 at 17:34
  • I can't use serialized html output because it may differ from the original html and further extraction processes (regex) may not match the serialized html. So i have to look for another solution or change the way i do the extraction. Thank you very much! Commented Nov 10, 2015 at 17:54
  • The "original HTML" simply isn't there to be extracted: it can only be reconstructed, and there will always be minor variations, e.g. re-ordering of attributes. Commented Nov 10, 2015 at 23:40

1 Answer 1

2

To serialize a W3C DOM Node's child nodes to HTML with Saxon you can use a default Transformer where you set the output method to html:

public static String getInnerHTML(Node node) throws TransformerConfigurationException, TransformerException
{
    StringWriter sw = new StringWriter();
    Result result = new StreamResult(sw);
    TransformerFactory factory = new net.sf.saxon.TransformerFactoryImpl();
    Transformer proc = factory.newTransformer();
    proc.setOutputProperty(OutputKeys.METHOD, "html");
    for (int i = 0; i < node.getChildNodes().getLength(); i++)
    {
        proc.transform(new DOMSource(node.getChildNodes().item(i)), result);
    }
    return sw.toString();
}

But as said, this is a serialization of the tree, the original XML or HTML is not stored in a DOM tree or Saxon's tree model, there is no way to access it.

Sign up to request clarification or add additional context in comments.

1 Comment

needs more likes, helped me a lot

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.