2
\$\begingroup\$

I'm working on an application that allows users edit/fix XML. A part of this is to format the XML for better readability.

As the XML might be invalid, the existing methods I found for formatting (like XmlWriter or XDocument) don't work for me.
There might be all sorts of problems with the XML, although the most common is unescaped special characters.

public static string FormatXml(string xml)
{
    var tags = xml
        .Split('<')
        .Select(tag => tag.TrimEnd().EndsWith(">") ? tag.TrimEnd() : tag); //Trim whitespace between tags, but not at the end of values

    var previousTag = tags.First(); //Preserve content before the first tag, e.g. if the initial < is missing
    var formattedXml = new StringBuilder(previousTag);
    var indention = 0;
    
    foreach (var tag in tags.Skip(1))
    {
        if (previousTag.EndsWith(">"))
        {
            formattedXml.AppendLine();
            if (tag.StartsWith("/"))
            {
                indention = Math.Max(indention - 1, 0);
                formattedXml.Append(new string('\t', indention));
            }
            else
            {
                formattedXml.Append(new string('\t', indention));
                if (!tag.EndsWith("/>"))
                {
                    indention++;
                }
            }
        }
        else
        {
            indention = Math.Max(indention - 1, 0);
        }

        formattedXml.Append("<");
        formattedXml.Append(tag);
        previousTag = tag;
    }

    return formattedXml.ToString();
}

Sofar the method produces reasonable output for all cases I came up with.

I'm mostly worried that I missed some special cases of valid XML that would get messed up.

\$\endgroup\$
2
  • \$\begingroup\$ Is the xml passed to the method before or after the user edit the xml? \$\endgroup\$ Commented Dec 2, 2020 at 15:01
  • \$\begingroup\$ @Heslacher: The method is invoked by the user through a 'Format XML' button. \$\endgroup\$ Commented Dec 2, 2020 at 16:13

1 Answer 1

5
\$\begingroup\$

There's a test suite of 2000 test cases available at https://www.w3.org/XML/Test/ - try it out.

From a quick glance, it's not clear to me how you're handling content within comments or CDATA sections - which might be well-formed XML, or it might be something approximating to well-formed XML.

Another comment is that messing with whitespace is dangerous in mixed content. With inline markup (bold, italic etc) preserving whitespace as written may be important.

\$\endgroup\$
1
  • \$\begingroup\$ +1 I have a look a the test cases. Mixed content might be problematic. In my specific use case it's not a concern, but generally my code would need some major cases to handle this. \$\endgroup\$ Commented Dec 3, 2020 at 10:23

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.