Parsing a big CSV file C# .net 4

Question

I know this question has been asked before, but I can't seem to get it working with the answers I've read. I've got a CSV file ~ 1.2GB , If I'm running the process like a 32bit i get outOfMemoryException, it works if i run it as a 64bit process, but it still takes 3,4gb in memory, i do know that I'm storing a lot of data in my customData class, but still 3,4gb of ram?, Am I doing something wrong when reading the file? dict is a dictionary in which i just have a mapping to which property to save something in, depending on the column it's in. Am i doing the reading the right way?

StreamReader reader = new StreamReader(File.OpenRead(path));
while(!reader.EndOfStream)  {
            String line = reader.ReadLine();
            String[] values = line.Split(';');
            CustomData data = new CustomData();
            string value;
            for (int i = 0; i < values.Length; i++) {
                dict.TryGetValue(i, out value);
                Type targetType = data.GetType();
                PropertyInfo prop = targetType.GetProperty(value);
                if(values[i]==null)
                {
                    prop.SetValue(data, "NULL",null);
                }
                else
                {
                    prop.SetValue(data, values[i], null);
                }

            }
            dataList.Add(data);
        }

first of all u dont get to use the whole memory for a c# process. and i would recommend you use lumenworks csv parser. and dont use reflection. why do you need reflection? it s csv file. u are torturing yourself. — DarthVader
– DarthVader, Commented Jul 13, 2012 at 7:39
see this for another explanation that you dont have all the memory for urself. stackoverflow.com/questions/1109558/… — DarthVader
– DarthVader, Commented Jul 13, 2012 at 7:40
Dou you really have to keep the whole parsed data in memory ? Maybe you should consider storing them somewhere else (database ?, file with binary serialization ?...). Could you give us an insight of your CustomData class definition ? — Julien Ch.
– Julien Ch., Commented Jul 13, 2012 at 7:41
Thanks, but I'm aware of this, I'm just wondering if I'm doing the reading wrong, it seems like a lot of people on stackoverflow recommends using streamreader, just thinking that I might be using it the wrong way. — najk
– najk, Commented Jul 13, 2012 at 7:42

Julien Ch. · Accepted Answer · 2012-07-13 08:23:06Z

There doesn't seem to be anything wrong in your usage of the stream reader, you read a line in memory, then forget it.

However, in C# a string is encoded in memory as UTF-16 so on the average a character consumes 2 bytes in memory.

If your CSV contains also a lot of empty fields that you convert to "NULL" you add up to 7 bytes for each empty field.

So on the whole, since you basically store all the data from your file in memory, it's not really surprising that you require almost 3 times the size of the file in memory.

The actual solution is to parse your data by chucks of N lines, treat them, and free them from memory.

Note: Consider using a CSV parser, there is more to CSV than just comas or semi-colons, what if one of your field conatins a semi-colon, a newline, a quote... ?

Edit

Actually each string take up to 20+(N/2)*4 bytes in memory see C# in Depth

Dr. Andrew Burnett-Thompson · Accepted Answer · 2012-07-13 08:10:14Z

Ok a couple of points here.

As pointed out in the comments, .NET under x86 can only consume 1.5GBytes per process, so consider that your maximum memory in 32 bit
The StreamReader itself will have an overhead. I don't know if it caches the entire file in memory, or not (maybe someone can clarify?). If so, reading and processing the file in chunks might be a better solution
The CustomData class, how many fields does it have, and how many instances are created? Note you will need 32bits for each reference in x86 and 64 bits for each reference in x64. So if you have CustomData class, which has 10 fields of type System.Object, each CustomData class before storing any data requires 88 bytes.
The dataList.Add at the end. I assume you are adding to a generic List? If so, note that List employes a doubling algorithm to resize. If you have 1GByte in a List and it requires 1 more byte in size, it will create a 2GByte array and copy the 1GByte to the 2GByte array on resize. So all of a sudden the 1GByte + 1 byte actually requires 3GBytes to manipulate. Another alternative is to use a pre-sized array

Thanks mate!, Now that you mention it, I've heard about the doubling of generics before..
Yes its easy to see - download ILSpy and open up the code for List<T> and take a look. When a new point is appended, it checks for size of the backing array, and if its not big enough, it creates a new array of size 2*size and performs a copy. So at one instance in time you have 3*size elements in memory just to add a single point!

Collectives™ on Stack Overflow

Parsing a big CSV file C# .net 4

2 Answers 2

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Linked

Related