0

I was given with a String variable with the following content:

<main>
<Title title="Hello World" />
<Content content="bla bla bla... by <1% to ??? on other bla bla...." />
</main>

This string will eventually passed to a Stored Procedure for XQuery.

As you can see, the content of "Content" contains of char "<" , which when I try to parse in Stored Procedure, it return with an error.

My question is how to convert the "<" into &lt ; (in this case <1% to &lt ;1%) in an efficient way.

I want to retain other "<" as it is.

Tks

3
  • 1
    This doesn't seem to be a valid XML. Commented Oct 4, 2016 at 5:39
  • 1
    <Content="foo" /> is not valid XML at all as there are no anonymous attributes. Did you mean <Content>foo</Content> instead? Commented Oct 4, 2016 at 5:40
  • hi there, I have updated the original codes... it should be as <Content content="bla bla bla... by <1% to ??? on other bla bla...." /> Commented Oct 4, 2016 at 5:46

2 Answers 2

1

Since you updated your question to point out you are dealing with XML, but the unencoded values are in attribute values, not #text nodes, then it makes it somewhat simpler, just extract the attribute value using a similar approach to my previous answer, then use a library function to entitize it, then output.

Note that CDATA only applies to #text, not attributes.

String doc =
@"<main>
<Title title=""Hello World"" />
<Content content=""bla bla bla... by <1% to ??? on other bla bla...."" />
</main>";

Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentAttribContentValueStart = doc.IndexOf("content=\"", contentOpenStart) + "content=\"".Length;
Int32 contentAttibContentValueEnd    = doc.IndexOf("\"", contentAttribContentValueStart);

String attributeValueOld = doc.Substring( contentAttribContentValueStart, contentAttibContentValueEnd );
String attributeValueNew = System.Net.WebUtility.HtmlEncode( attributeValueOld );

String doc2 = String.Concat(
    doc.Substring( 0, contentAttribContentValueStart );
    attributeValueNew,
    doc.Substring( contentAttibContentValueEnd );
);

doc2 then contains the fixed attribute value.

Note that using HtmlEncode to perform HTML-Encoding of entities is not strictly correct in XML, as the set of XML entities is much smaller than HTML's - indeed, XML is only concerned with &amp;, &gt;, &lt;, &quot; and &apos;, all other values should be in the document as raw/native characters.

Sign up to request clarification or add additional context in comments.

1 Comment

Good idea for your proposed solution, I managed to construct a solution by minor adjusting your codes.
0

(This answer is based on the assumption you're dealing with structurally correct XML, just with unencoded entities in #text nodes - this answer does not apply if your input data really does look like <Title="foo" /> - which isn't XML at all)

If I understand your problem correctly, you have an XML document in a String instance which contains improperly escaped/entitized special characters, which prevents you from using a normal XML parser to read the document.

If you're dealing with an XML-compliant system, then you can use <![DATA[ and then not need to attempt to process the content of the <Content> element, the trick then becomes inserting the CDATA delimiters.

While it's often said one cannot use a regular-expression to parse XML (as XML is not a Regular Language), you can take advantage of the grammatical rules of XML to extract and identify tags.

So if you have this:

<Content someAttribute="someValue">
reduce sales by <1% in order to ensure that profit > loss
</Content>

Then you can do this:

String doc = @"<main><Title...";
Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentOpenEnd   = doc.IndexOf(">", contentOpenStart);

Int32 contentCloseStart = doc.IndexOf("</Content>", contentOpenEnd);

This code then tells us the locatations of the angle-brackets of the <Content> element's two tags, with which we can insert the CDATA delimiters:

String newDocument = String.Concat(
    doc.Substring( 0, contentOpenEnd + 1 ), // "<main>...<Content...>"
    "<![CDATA[",
    doc.Substring( contentOpenEnd + 1, contentCloseStart ),
    "]]>",
    doc.Substring( contentCloseStart ) "</Content>..."
);

newDocument will then be this:

<Content someAttribute="someValue"><![CDATA[
reduce sales by <1% in order to ensure that profit > loss
]]></Content>

...which is valid XML.

1 Comment

hi @Dai, tks for the comment. the issue is that the value is within the attribute of the tag, not in its content. so, instead of having: <Content someAttribute="someValue">reduce sales by <1% in order to ensure that profit > loss</Content>, i got: <Content someAttribute="reduce sales by <1% in order to ensure that profit > loss" /> how can i adjust the codes accordingly? can we use CDATA for attribute as well?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.