7
\$\begingroup\$

I've got a method that parses a file. I take all the words and add them to a SortedSet. Every word contains a list of Lines that contain said word. Words are not strings but a class I created:

class Word : IComparable<Word>
{
    public Word()
    {
        Lines = new List<Line>();
    }
    public string WordStr { get; set; }
    public List<Line> Lines { get; set; }

    public int CompareTo(Word other)
    {
        return string.Compare(this.WordStr, other.WordStr, StringComparison.OrdinalIgnoreCase);
    }
}

And the method that parses the file (I suspect I am not doing something properly here):

private void AddFile(string path)
    {
        Regex regex = new Regex("[^A-Za-z0-9\\s\\']");
        FileInfo fi = new FileInfo(path);
        if (!fi.Exists || Files.Contains(path.ToLower())) //File does not exist or file already indexed
        {
            return;
        }
        Files.Add(path.ToLower());
        StreamReader sr = new StreamReader(path);
        string file = sr.ReadToEnd();
        string saniFile = regex.Replace(file, "");
        string[] saniLines = saniFile.Split(new char[]{'\r', '\n'}, StringSplitOptions.RemoveEmptyEntries);
        int lineNo = 1;
        foreach (var l in saniLines)
        {
            Line line = new Line(l, path, lineNo);
            string[] words = l.Split(' ');
            foreach (var word in words)
            {
                Word w = new Word();
                w.WordStr = word;

                if (Words.Contains(w, new WordComparer())) //Set already contains the word
                {
                    Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
                    if (!wordToAdd.Lines.Contains(line))
                        wordToAdd.Lines.Add(line);
                }
                else
                {
                    w.Lines.Add(line);
                    Words.Add(w);
                }
            }
            lineNo++;
        }
    }

I have the exact same functionality working in C++ and it's orders of magnitudes faster. So is there something I am doing incorrectly? What if I used a SortedDictionary instead of a SortedSet for the words? Then the key could be the string that is the word and the value would be the list of lines that contain that word.

For reference, a 618KB text file takes a few seconds to parse and index in C++. It's taking me minutes to do it in C#.

\$\endgroup\$
1
  • \$\begingroup\$ I would change the foreach to a for if you can. I was trying to do a lot of small things really fast (few thousand in less than a second) and foreach added a great deal of overhead (causing it to run more than 1 second). I don't know if that would apply in a longer running situation. But I do know in general, for will be faster than foreach. I've seen benchmarks on SO where it's upwards of 5x faster in some situations. \$\endgroup\$ Commented May 6, 2011 at 2:10

2 Answers 2

14
\$\begingroup\$

Your instincts are basically right that maybe a dictionary of some sort could help. Let's look at some key lines of your code.

foreach (var l in saniLines)
{
    Line line = new Line(l, path, lineNo);
    string[] words = l.Split(' ');
    foreach (var word in words)
    {
        Word w = new Word();
        w.WordStr = word;

        if (Words.Contains(w, new WordComparer())) //Set already contains the word
        {
            Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
            if (!wordToAdd.Lines.Contains(line))
                wordToAdd.Lines.Add(line);

I see 5 nested loops here. If you step through the code, you'll see them, too. The outer 2 are obvious, you've coded them explicitly. But stepping through the code will reveal the looping being done at Words.Contains. The Linq query is also an abstracted loop. And wordToAdd.Lines.Contains is also going to be a loop. You're going to pay a high price for these.

Profiling will always help. But I would suggest perhaps changing Words to some sort of IDictionary<string, Word>, where you can replace

if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
    Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();

With something more like

if (Words.ContainsKey(w.WordStr))
{
    Word wordToAdd = Words[w.WordStr];
    // ...
}

And you could also change Lines into a HashSet<Line> instead of a list. That should improve the performance of

if (!wordToAdd.Lines.Contains(line))
    wordToAdd.Lines.Add(line);

In fact, HashSet<T>.Add(T val) could be used by itself, without the call to Contains, as Add returns a bool indicating if the addition could be performed. If a matching value is already in the set, it simply returns false without modifying its contents.

Try these changes, measure the performance before and after and see where you are.

Unrelated, but I also recommend that you wrap your StreamReader in a using () { } block so that the resource is properly disposed when you're through with it.

\$\endgroup\$
2
  • \$\begingroup\$ Wow, thank you so much. The same file actually indexes almost instantly now (after implementing your two changes). \$\endgroup\$ Commented May 6, 2011 at 3:18
  • \$\begingroup\$ @Pete, that's a pretty good boost! Glad to help. \$\endgroup\$ Commented May 6, 2011 at 3:21
3
\$\begingroup\$

Your best bet is to not speculate or make assumptions when it comes to performance. Grab a profiler and get some real data so you can make the correct decisions based on that.

\$\endgroup\$
3
  • \$\begingroup\$ Unfortunately, I'm not sure of how to "grab a profiler". I was really hoping someone could see something that could be improved. \$\endgroup\$ Commented May 6, 2011 at 1:09
  • 1
    \$\begingroup\$ @Pete: Google is your friend. \$\endgroup\$ Commented May 6, 2011 at 5:41
  • \$\begingroup\$ JetBrains (ReSharper guys) make one here: jetbrains.com/profiler Also - Visual Studio 2010 Premium has one built in. \$\endgroup\$ Commented May 6, 2011 at 16:50

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.