1

Using this article from MSDN, I'm trying to search through files in a directory. The problem is, every time I execute the program, I get:

"An unhandled exception of type 'System.OutOfMemoryException' occurred in mscorlib.dll".

I have tried to some other options like StreamReader, but I can't get it to work. These files are HUGE. Some of them range in upwards to 1.5-2GB each and there could be 5 or more files per day.

This code fails:

private static string GetFileText(string name)
{
    var fileContents = string.Empty;
    // If the file has been deleted since we took  
    // the snapshot, ignore it and return the empty string. 
    if (File.Exists(name))
    {
        fileContents = File.ReadAllText(name);
    }
    return fileContents;
}

Any ideas what could be happening or how to make it read without memory errors?

Entire code (in case you don't want to open the MSDN article)

class QueryContents {
public static void Main()
{
    // Modify this path as necessary. 
    string startFolder = @"c:\program files\Microsoft Visual Studio 9.0\";

    // Take a snapshot of the file system.
    System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);

    // This method assumes that the application has discovery permissions 
    // for all folders under the specified path.
    IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories);

    string searchTerm = @"Visual Studio";

    // Search the contents of each file. 
    // A regular expression created with the RegEx class 
    // could be used instead of the Contains method. 
    // queryMatchingFiles is an IEnumerable<string>. 
    var queryMatchingFiles =
        from file in fileList
        where file.Extension == ".htm" 
        let fileText = GetFileText(file.FullName)
        where fileText.Contains(searchTerm)
        select file.FullName;

    // Execute the query.
    Console.WriteLine("The term \"{0}\" was found in:", searchTerm);
    foreach (string filename in queryMatchingFiles)
    {
        Console.WriteLine(filename);
    }

    // Keep the console window open in debug mode.
    Console.WriteLine("Press any key to exit");
    Console.ReadKey();
}

// Read the contents of the file. 
static string GetFileText(string name)
{
    string fileContents = String.Empty;

    // If the file has been deleted since we took  
    // the snapshot, ignore it and return the empty string. 
    if (System.IO.File.Exists(name))
    {
        fileContents = System.IO.File.ReadAllText(name);
    }
    return fileContents;
}

}

2 Answers 2

3

The problem you're having is based on trying to load multiple gigabytes of text at the same time. If they're text files, you can stream them and just compare one line at a time.

var queryMatchingFiles =
    from file in fileList
    where file.Extension == ".htm" 
    let fileLines = File.ReadLines(file.FullName) // lazy IEnumerable<string>
    where fileLines.Any(line => line.Contains(searchTerm))
    select file.FullName;
Sign up to request clarification or add additional context in comments.

2 Comments

Exactly what I needed.. Thank you very much!
Just make sure your search term doesn't include a line break ;)
0

I would suggest that you are getting an out of memory error because the way the query is written I believe that you will need to load the entire text of every file into memory and none of the objects can be released until the entire file set has been loaded. Could you not check for the search term in the GetFileText function and then just return a true or false?

If you did that the file text at least falls out of scope at the end of the function and the GC can recover the memory. It would actually be better to rewrite as a streaming function if you are dealing with large files/amounts then you could exit your reading early if you come across the search term and you wouldn't need the entire file in memory all the time.

Previous question on finding a term in an HTML file using a stream

1 Comment

Objects used from previous iterations of linq queries are eligible for GC. But the stream approach is certainly reasonable.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.