0

I am trying to process a very large amount of data (~1000 seperate files, each of them ~30 MB) in order to use as input to the training phase of a machine learning algorithm. Raw data files formatted with JSON and I deserialize them using JsonSerializer class of Json.NET. Towards the end of the program, Newtonsoft.Json.dll throwing 'OutOfMemoryException' error. Is there a way to reduce the data in memory, or do I have to change all of my approach (such as switching to a big data framework like Spark) to handle this problem?

public static List<T> DeserializeJsonFiles<T>(string path)
{
    if (string.IsNullOrWhiteSpace(path))
        return null;

    var jsonObjects = new List<T>();
    //var sw = new Stopwatch();
    try
    {
        //sw.Start();
        foreach (var filename in Directory.GetFiles(path))
        {
            using (var streamReader = new StreamReader(filename))
            using (var jsonReader = new JsonTextReader(streamReader))
            {
                jsonReader.SupportMultipleContent = true;
                var serializer = new JsonSerializer();

                while (jsonReader.Read())
                {
                    if (jsonReader.TokenType != JsonToken.StartObject)
                        continue;

                    var jsonObject = serializer.Deserialize<dynamic>(jsonReader);

                    var reducedObject = ApplyFiltering(jsonObject) //return null if the filtering conditions are not met 
                    if (reducedObject == null)
                        continue;

                    jsonObject = reducedObject;
                    jsonObjects.Add(jsonObject);
                }
            }
        }    
        //sw.Stop();
        //Console.WriteLine($"Elapsed time: {sw.Elapsed}, Elapsed mili: {sw.ElapsedMilliseconds}");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error: {ex}")
        return null;
    }

    return jsonObjects;
}

Thanks.

5
  • Uhm, you're not really using JsonTextReader. You're using JsonSerializer and keep everything read in memory as well. You can switch to 64 bit to avoid the OOM, but your working set will be quite large in any case. Your problem isn't so much JSON but rather that you keep a lot of giant objects around (which the GC also doesn't particularly like). Commented Mar 20, 2017 at 8:49
  • As Joey is saying, you're better off streaming the data through. Rather than giving it all the ML in a big bang - you need to trickle feed it through. Commented Mar 20, 2017 at 8:51
  • @Joey, already it's 64 bit. I will try Tim's approach. By the way, I corrected the misleading part, thank you. Commented Mar 20, 2017 at 10:41
  • I agree with the others; you should be processing each item individually as you get them from the stream, not adding them all to a list, which keeps everything in memory. If you must add things to a list, try using something like a linked list which does not allocate memory in one contiguous block and does not need to reallocate and copy everything when it needs to expand its capacity. Commented Mar 20, 2017 at 17:33
  • I used yield return as Tim suggested. Also, at the beginning of the ApplyFilter() method, I was creating a new instance of a custom model before filtering each deserialized item. I have fixed this to process the deserialized data first, and then map it to my custom model only if all filtering conditions are met (not create an instance and return null if not). The heap size is greatly reduced. I never thought of using a linked list, I will try it, thank you. Commented Mar 21, 2017 at 6:36

1 Answer 1

2

It's not really a problem with Newtonsoft. You are reading all of these objects into one big list in memory. It gets to a point where you ask the JsonSerializer to create another object and it fails.

You need to return IEnumerable<T> from your method, yield return each object, and deal with them in the calling code without storing them in memory. That means iterating the IEnumerable<T>, processing each item, and writing to disk or wherever they need to end up.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.