Processing CDATA from XML via DOM parser

Question

I've never processed XMLs before, so I'm not sure how to process CDATA in within an XML file. I'm getting lost in nodes, parents, child nodes, nList, etc.

Can anyone tell me what my problem is from these snippets of code?

My getTagValue() method works on all tags except "Details", which is the one that contains CDATA.

.....
NodeList nList = doc.getElementsByTagName("Assignment");
for (int temp = 0; temp < nList.getLength(); temp++) {
    Node nNode = nList.item(temp);
    if (nNode.getNodeType() == Node.ELEMENT_NODE) {
        Element eElement = (Element) nNode;
        results = ("Class : " + getTagValue("ClassName", eElement)) + 
                  ("Period : " + getTagValue("Period", eElement)) +
                  ("Assignment : " + getTagValue("Details", eElement));
        myAssignments.add(results);
    }
}
.....
private String getTagValue(String sTag, Element eElement) {
    NodeList nlList = eElement.getElementsByTagName(sTag).item(0).getChildNodes();

    Node nValue = (Node) nlList.item(0);
    if((CharacterData)nValue instanceof CharacterData)
    {
        return ((CharacterData) nValue).getData();
    }
    return nValue.getNodeValue();
}

Aside from Bogdan's excellent explanation, if you can use Xom, Dom4J, etc, you'll probably be better for it. — Spencer Kormos
– Spencer Kormos, Commented Apr 7, 2012 at 20:01

Bogdan · Accepted Answer · 2012-06-29 07:29:42Z

I'm suspecting that your problem is in the following line of code from the getTagValue method:

Node nValue = (Node) nlList.item(0);

You are always getting the first child! But you might have more than one.

The following example has 3 children: text node "detail ", CDATA node "with cdata" and text node " here":

<Details>detail <![CDATA[with cdata]]> here</Details>

If you run your code, you get only "detail ", you loose the rest.

The following example has 1 child: a CDATA node "detail with cdata here":

<Details><![CDATA[detail with cdata here]]></Details>

If you run your code, you get everything.

But the same example as above written this way:

<Details>
   <![CDATA[detail with cdata here]]>
</Details>

now has 3 children because the spaces and line feeds are picked up as text nodes. If you run your code you get the first empty text node with a line feed, you loose the rest.

You either have to loop through all children (no matter how many) and concatenate the value of each to get the full result, or if it's not important for you to differentiate between plain text and text inside CDATA, then set the coalescing property on the document builder factory first:

DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setCoalescing(true);
...

Coalescing specifies that the parser produced by this code will convert CDATA nodes to Text nodes and append it to the adjacent (if any) text node. By default the value of this is set to false.

was just looking for same in js, so element.childNodes[0].nodeValue instead of element.nodeValue did a trick for me, thanks!

Collectives™ on Stack Overflow

Processing CDATA from XML via DOM parser

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related