4

I have about 10,000 XML files where I need to convert them into SQL table.

However, here are the problems, each XML files has some variations between each other thus it is almost impossible for me to specify the element name. For example:

//XML #1
<color>Blue</color>
<height>14.5</height>
<weight>150</weight>
<price>56.78</price>

//XML #2
<color>Red</color>
<distance>98.7</distance>
<height>15.5</height>
<price>56.78</price>

//XML #3: Some of the elements have no value
<color />
<height>14.5</height>
<price>78.11</price>

//XML #4: Elements has parent/child
<color>
    <bodyColor>Blue</bodyColor>
    <frontColor>Yellow</frontColor>
    <backColor>White</backColor>
</color>
<height>14.5</height>
<weight>150</weight>
<price>56.78</price>

With the example above, I should expect a table created with columns name: color, height, weight, price, distance (Because XML #2 has distance), bodyColor, frontColor, backColor.

Expected output:

XML#    color    height    weight    price    distance    bodyColor    frontColor    backColor
1       Blue     14.5      150       56.78    NULL        NULL         NULL          NULL
2       Red      15.5      NULL      56.78    98.7        NULL         NULL          NULL
3       NULL     14.5      NULL      78.11    NULL        NULL         NULL          NULL
4       NULL     14.5      150       56.78    NULL        Blue         Yellow        White

In this case, NULL or empty value are acceptable.

These are just examples, there are at least 500 elements in each XML file. Also, even though I mentioned C# here, if anyone can suggest a better way of doing so, please let me know.

13
  • 1
    Get the files to follow a certain Xml Schema Definition. It is ok for tags to be empty or NULL. It is important for the tag to be present. Then it will be easier to work with the file. Commented Apr 24, 2014 at 18:16
  • @abhi What do you mean by XML Schema Definition? (Sorry, I'm quite new to XML) Commented Apr 24, 2014 at 18:17
  • 1
    One possible solution would be to iterate over every XML file and extract all unique fields and create a table in the database with all of the extracted unique fields. This way you would know what fields you have so that you could perhaps consider to normalize the table(s) later. Commented Apr 24, 2014 at 18:18
  • 1
    @Hituptony I was about to post the same link. Commented Apr 24, 2014 at 18:19
  • 1
    You could get a sample file that has all the possible tags. Then Visual Studio will create the XSD for you. There are other tools in the market that are better at this game. Altova XML SPY and Liquid come to mind. In 2011, I was working on a very similar activity. Commented Apr 24, 2014 at 18:30

2 Answers 2

2

One possibility to iterate over all xml files and get all unique tags could use LINQ2XML, the HashSet class and could look like this:

try
{
    // add as many elements you want, they will appear only once!
    HashSet<String> uniqueTags = new HashSet<String>();
    // recursive helper delegate
    Action<XElement> addSubElements = null;
    addSubElements = (xmlElement) =>
    {
        // add the element name and 
        uniqueTags.Add(xmlElement.Name.ToString());
        // if the given element has some subelements
        foreach (var element in xmlElement.Elements())
        {
            // add them too
            addSubElements(element);
        }
    };

    // load all xml files
    var xmls = Directory.GetFiles("d:\\temp\\xml\\", "*.xml");
    foreach (var xml in xmls)
    {
        var xmlDocument = XDocument.Load(xml);
        // and take their tags
        addSubElements(xmlDocument.Root);
    }
    // list tags
    foreach (var tag in uniqueTags)
    {
        Console.WriteLine(tag);
    }
}
catch (Exception exception)
{
    Console.WriteLine(exception.Message);
}

Now you have the columns for the basic SQL table. With little enhancing, you could also mark the parent and the sub nodes. This could help you for the normalization.

Sign up to request clarification or add additional context in comments.

4 Comments

You gave me a head start, however, I have just discovered even more issues in the XML. I have issues where in just one XML, it has elements like: <color>Red</color> <color>Blue</color> <color>Green</color> These are 3 different colors, but because of the unique filter, it only returns color 1 time instead of 3.
That is why you probably need to consider also the nesting level, so that you can add subelements with the same name too. You need anyway to analyse the whole input data. For example it could be ok, that nested color-tags are renamed to bodyColor, frontColor and so on or to something more appropriate like blueComponent.
thank you anyway, at least your code gave me a head start, I will modify it to fit the condition I need
Glad that the answer helped.
1

You can do this in TSQL using xQuery, a staging table and dynamic pivot.

Staging table:

create table dbo.XMLStage
(
  ID uniqueidentifier not null,
  Name nvarchar(128) not null,
  Value nvarchar(max) not null,
  primary key (Name, ID)
);

ID is unique per file, Name hold the node name and Value the node value.

Stored procedure to populate the staging table:

create procedure dbo.LoadXML
  @XML xml
as

declare @ID uniqueidentifier;
set @ID = newid();

insert into dbo.XMLStage(ID, Name, Value)
select @ID,
       T.X.value('local-name(.)', 'nvarchar(128)'),
       T.X.value('text()[1]', 'nvarchar(max)')
from @XML.nodes('//*[text()]') as T(X);

//*[text()] will give you all nodes that have a text value

Dynamic query to unpivot the data in the staging table:

declare @Cols nvarchar(max);
declare @SQL nvarchar(max);

set @Cols = (
            select distinct ',' + quotename(X.Name)
            from dbo.XMLStage as X
            for xml path(''), type
            ).value('substring(text()[1], 2)', 'nvarchar(max)');

set @SQL = '
select '+@Cols+'
from dbo.XMLStage
pivot (max(Value) for Name in ('+@Cols+')) as P';

exec sp_executesql @SQL;

Try it out in this SQL Fiddle

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.