0

I have been reading quite a lot about Parallel .net 4 and I have to say that I am a bit confused when to use it.

This is my common scenario I have been given a task to migrate lots of xml files to a database.

I typically I have to

  1. Read Xml Files (100.000) and more and order them numerically (each file is named 1.xml, 2.xml etc.).
  2. Save to a database.

I thought the above was a perfect candidate for parallel programming.

Conceptually I would like to process many files at a times.

I am currently doing this:

private ResultEventArgs  progressResults=new ResultEventArgs();

public void ExecuteInParallelTest()
{
    var sw=new Stopwatch();
    sw.Start();
    int index = 0;
    cancelToken = new CancellationTokenSource();
    var parOpts = new ParallelOptions();
    parOpts.CancellationToken = cancelToken.Token;
    parOpts.MaxDegreeOfParallelism = Environment.ProcessorCount;  //It this correct?

    FileInfo[] files = myDirectory.EnumerateFiles("*.xml").ToArray();//Is this faster?
    TotalFiles = files.Count();
    try
    {
        Task t1 = Task.Factory.StartNew(() =>
        {
            try
            {
                Parallel.ForEach(files, parOpts, (file, loopState) =>
                {

                    if (cancelToken.Token.IsCancellationRequested)
                    {
                        cancelToken.Token.ThrowIfCancellationRequested();
                    }

                    index = Interlocked.Increment(ref index);

                    ProcessFile(file,index);

                                progressResults.Status=InProgress                                   

                    OnItemProcessed(TotalFiles,index,etc..);
                });
            }
            catch (OperationCanceledException ex)
            {
                OnOperationCancelled(new progressResults
                    {
                        progressResults.Status=InProgress                               
                        progressResults.TotalCount = TotalFiles;
                        progressResults.FileProcessed= index;
                        //etc..                                  
                    });

            }

            //ContinueWith is used to sync the UI when task completed.
        }, cancelToken.Token).ContinueWith((result) => OnOperationCompleted(new ProcessResultEventArgs
            {
                        progressResults.Status=InProgress
                        progressResults.TotalCount = TotalFiles;
                        progressResults.FileProcessed= index;
                        //etc..
            }), new CancellationTokenSource().Token, TaskContinuationOptions.None, TaskScheduler.FromCurrentSynchronizationContext());
    }
    catch (AggregateException ae)
    {
        //TODO:
    }
   }

My Questions: I am using .net 4.0 Is using Parallel the best/simpler way to speed up the processing of these files. Is the above psudo code good enough or Am I missing vital stuff,locking etc...

The most important question is: Forgetting the "ProcessFile" as I cannot optmize that as I have no control Is there room for optmisation

Should I partition the files in chunks eg 1-1000 - 1001-2000-2001-3000 would that improve performance (how do you do that)

Many thanks for any replies or link/code snippet that can help me understand better how I can improve the above code.

2
  • I would suggest pipeline this process, see this SO post Commented Jan 30, 2013 at 11:27
  • 1
    I also would consider to not using threaind when you have IO Operations. Instead use the Async CTP and await which frees you up of unnecessary threads. Have a look at this great webcast channel9.msdn.com/Shows/AppFabric-tv/… Commented Feb 4, 2013 at 12:33

2 Answers 2

0

The reason you are not receiving responses is because your code is so horribly wrong. AsParallel() does not do anything for GetFiles(), files.Count() actually iterates the enumerable, so not only you read the files (or just the directory) twice, but doing Count() first, and then iterating through them later will read the files twice and could produce inconsistent counts if the directory is modified. It does not look like it's necessary to do Task.Factory.StartNew since it's your only task (which spawns parallel processing inside it). Parallel.ForEach will encapsulate all OperationCancelledException's into single AggregateException, but it will only do that after all parallel threads finish their work.

Sign up to request clarification or add additional context in comments.

6 Comments

@Andrej tanas Hi thanks for you comment!! Very valuable.That is why I posted the question to have feedback.Could you provide a code snippet of how you would refactor the code,because I am bit confused about some of your comments and how I would address the issues . For starters .I need the total count for reporting.As for the parallel code how would you improve it.Thanks
@Andrej also what I find intriguing in your answer is this:You are saying that Count Iterates again. So I do I avoid iterating? Also you mention that GetFiles.AsParallel does nothing. Why? in my get files there is "directoryInfo.EnumerateFiles(pattern).ToArray();"
See this: link regarding IEnumerable.Count() extension method. If you are using Directory.GetFiles(), just don't use Count() method, use Length property of the returned string array.
Good explanation of how AsParallel() should be used could be found here: link
@Andrej thanks for the link! but not really something I did not know but just to keep things in prospective the iteration of the files when I get the files and the count after that is a flash neglible and that is on 100.000.I have edit my code so that you can see in full.I use EnumerateFiles not getFiles and I was using the count and not the lenght can change to length.I have reflected on MS code and if there is a count it will return it otherwise iterate.
|
0

I left the code as it is as nobody provided me with a suitable answer

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.