Replace some char within a string (XML format)

Question

I was given with a String variable with the following content:

<main>
<Title title="Hello World" />
<Content content="bla bla bla... by <1% to ??? on other bla bla...." />
</main>

This string will eventually passed to a Stored Procedure for XQuery.

As you can see, the content of "Content" contains of char "<" , which when I try to parse in Stored Procedure, it return with an error.

My question is how to convert the "<" into &lt ; (in this case <1% to &lt ;1%) in an efficient way.

I want to retain other "<" as it is.

Tks

<Content="foo" /> is not valid XML at all as there are no anonymous attributes. Did you mean <Content>foo</Content> instead? — Dai
– Dai, Commented Oct 4, 2016 at 5:40
hi there, I have updated the original codes... it should be as <Content content="bla bla bla... by <1% to ??? on other bla bla...." /> — Trowa
– Trowa, Commented Oct 4, 2016 at 5:46

Dai · Accepted Answer · 2016-10-04 05:55:11Z

Since you updated your question to point out you are dealing with XML, but the unencoded values are in attribute values, not #text nodes, then it makes it somewhat simpler, just extract the attribute value using a similar approach to my previous answer, then use a library function to entitize it, then output.

Note that CDATA only applies to #text, not attributes.

String doc =
@"<main>
<Title title=""Hello World"" />
<Content content=""bla bla bla... by <1% to ??? on other bla bla...."" />
</main>";

Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentAttribContentValueStart = doc.IndexOf("content=\"", contentOpenStart) + "content=\"".Length;
Int32 contentAttibContentValueEnd    = doc.IndexOf("\"", contentAttribContentValueStart);

String attributeValueOld = doc.Substring( contentAttribContentValueStart, contentAttibContentValueEnd );
String attributeValueNew = System.Net.WebUtility.HtmlEncode( attributeValueOld );

String doc2 = String.Concat(
    doc.Substring( 0, contentAttribContentValueStart );
    attributeValueNew,
    doc.Substring( contentAttibContentValueEnd );
);

doc2 then contains the fixed attribute value.

Note that using HtmlEncode to perform HTML-Encoding of entities is not strictly correct in XML, as the set of XML entities is much smaller than HTML's - indeed, XML is only concerned with &, >, <, " and ', all other values should be in the document as raw/native characters.

Good idea for your proposed solution, I managed to construct a solution by minor adjusting your codes.

Dai · Accepted Answer · 2016-10-04 05:48:22Z

(This answer is based on the assumption you're dealing with structurally correct XML, just with unencoded entities in #text nodes - this answer does not apply if your input data really does look like <Title="foo" /> - which isn't XML at all)

If I understand your problem correctly, you have an XML document in a String instance which contains improperly escaped/entitized special characters, which prevents you from using a normal XML parser to read the document.

If you're dealing with an XML-compliant system, then you can use <![DATA[ and then not need to attempt to process the content of the <Content> element, the trick then becomes inserting the CDATA delimiters.

While it's often said one cannot use a regular-expression to parse XML (as XML is not a Regular Language), you can take advantage of the grammatical rules of XML to extract and identify tags.

So if you have this:

<Content someAttribute="someValue">
reduce sales by <1% in order to ensure that profit > loss
</Content>

Then you can do this:

String doc = @"<main><Title...";
Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentOpenEnd   = doc.IndexOf(">", contentOpenStart);

Int32 contentCloseStart = doc.IndexOf("</Content>", contentOpenEnd);

This code then tells us the locatations of the angle-brackets of the <Content> element's two tags, with which we can insert the CDATA delimiters:

String newDocument = String.Concat(
    doc.Substring( 0, contentOpenEnd + 1 ), // "<main>...<Content...>"
    "<![CDATA[",
    doc.Substring( contentOpenEnd + 1, contentCloseStart ),
    "]]>",
    doc.Substring( contentCloseStart ) "</Content>..."
);

newDocument will then be this:

<Content someAttribute="someValue"><![CDATA[
reduce sales by <1% in order to ensure that profit > loss
]]></Content>

...which is valid XML.

hi @Dai, tks for the comment. the issue is that the value is within the attribute of the tag, not in its content. so, instead of having: <Content someAttribute="someValue">reduce sales by <1% in order to ensure that profit > loss</Content>, i got: <Content someAttribute="reduce sales by <1% in order to ensure that profit > loss" /> how can i adjust the codes accordingly? can we use CDATA for attribute as well?

Collectives™ on Stack Overflow

Replace some char within a string (XML format)

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related