0

I've written code to process a big binary file (more than 2 GB) reading in chunks of 1024 bytes each. The file contains blocks of data and each block is separated by two bytes in sequence, 5D5B = 0x5D 0x5B.

The code works, but for big files the execution time is more than 1:30 hours and when I do the same with a kind of equivalent Ruby script the execution time is less than 15 min.

You can test the code with the file "input.txt" below, and you'll see that it prints each block correctly. You can create file "input.txt" with the line "File.WriteAllBytes()..." or create file "input.txt" in Notepad with the following content without (double quotes):

"][How][many][words][we][have][here?][6][or][more?]"

I'm using the BinaryReader class and the seek method to read in chunks of 20 bytes in this example (1024 bytes with a big file), since the file contains only 50 bytes and then looking for position of the beginning of last block within each chunk and store it in var lastPos, since the last chunk could be incomplete.

Is there a way to improve my code to get a faster execution time?

I'm not sure if the issue is BinaryReader or to do with thousands of seek operations. The first goal is to get each block to apply some parsing to each one, but it seems much of the time is being consumed in the separation of blocks.

static void Main(string[] args)
{
    File.WriteAllBytes("C:/input.txt", new byte[] { 0x5d, 0x5b, 0x48, 0x6f, 0x77, 0x5d, 0x5b, 0x6d, 0x61, 0x6e,
                                                    0x79, 0x5d, 0x5b, 0x77, 0x6f, 0x72, 0x64, 0x73, 0x5d, 0x5b,
                                                    0x77, 0x65, 0x5d, 0x5b, 0x68, 0x61, 0x76, 0x65, 0x5d, 0x5b,
                                                    0x68, 0x65, 0x72, 0x65, 0x3f, 0x5d, 0x5b, 0x36, 0x5d, 0x5b,
                                                    0x6f, 0x72, 0x5d, 0x5b, 0x6d, 0x6f, 0x72, 0x65, 0x3f, 0x5d } );

    using (BinaryReader br = new BinaryReader(File.Open("C:/input.txt", FileMode.Open)))
    {
        int lastPos = 0;
        int EachChunk = 20;
        long ReadFrom = 0;
        int c = 0;
        int count = 0;
        while(lastPos != -1 ) {
            lastPos = -1;
            br.BaseStream.Seek(ReadFrom, SeekOrigin.Begin);
            byte[] data = br.ReadBytes(EachChunk);

            //Loop to look for position of last clock in current chunk
            int k = data.Length - 1;
            while(k > 0 && lastPos == -1) {
                lastPos = (data[k] == 91 && data[k-1] == 93 ? (k - 1) : (-1) );
                k--;
            }

            if (lastPos != -1) {
                Array.Resize(ref data, lastPos);
            } // Resizing array up to the last block position

            // Storing position of pointer where will begin next chunk
            ReadFrom += lastPos + 2;

            //Converting Binary data to string of hex numbers.
            SoapHexBinary shb = new SoapHexBinary(data);

            //Replace separator by Newline
            string str = shb.ToString().Replace("5D5B", Environment.NewLine);

            //Use StringReader to process each block as a line, using the newline as separator
            using (StringReader reader = new StringReader(str))
            {
                // Loop over the lines(blocks) in the string.
                string Block;
                count = c;
                while ((Block = reader.ReadLine()) != null)
                {
                    if ((String.IsNullOrWhiteSpace(Block) ||
                         String.IsNullOrEmpty(Block)) == false) {

                        // +++++ Further process for each block +++++++++++++++++++++++++
                        count++;
                        Console.WriteLine("Block # {0}: {1}", count, Block);
                        // ++++++++++++++++++++++++++++++++++++++++++++++++++
                    }
                }
            }
            c = count;
        }
    }
    Console.ReadLine();
}

Update:

I found an issue. In Mike Burdick's code the buffer begins to growp up when 5B is found and is printed when 5D is found, but since each block is separated by 0x5D0x5B, if there is a 5D alone or a 5B alone inside any block, the code is beginning to load or clear the buffer and only should load the buffer when the sequence 5D5B is found, not only when 5B is found, if not the result is different.

You can test with this input, where I added a 5D or a 5B inside the blocks. I resume only when 5D5B is found and the buffer can be loaded, since 5D5B is like the "newline" separator.

    File.WriteAllBytes("C:/input1.txt", new byte[] {
                                        0x5D, 0x5B, 0x48, 0x5D, 0x77, 0x5D, 0x5B, 0x6d, 0x5B, 0x6e,
                                        0x5D, 0x5D, 0x5B, 0x77, 0x6f, 0x72, 0x64, 0x73, 0x5D, 0x5B,
                                        0x77, 0x65, 0x5D, 0x5B, 0x68, 0x61, 0x76, 0x65, 0x5D, 0x5B,
                                        0x68, 0x65, 0x72, 0x65, 0x3f, 0x5D, 0x5B, 0x36, 0x5D, 0x5B,
                                        0x6f, 0x72, 0x5D, 0x5B, 0x6d, 0x6f, 0x72, 0x65, 0x3f, 0x5D });

Update 2:

I've tried Mike Burdick's code, but it is not given correct outputs. For example, if you change the content of the input file to contain this:

82-F][How]]][ma[ny]][words%][we][[have][here?]]

The output should be (the below output is presented in ASCII to see it more clearly):

    82-F
    How]]
    ma[ny]
    words%
    we
    [have
    here?]]

Besides that, do you think BinaryReader is a kind of slow? When I test with a bigger file, the execution is still very slow.

Update #3:

I've been testing Mike Burdick's code. Maybe it is not the best modification of Mike Burdick's code, since I've modified to handle ] or [ that could appear in the middle of each block. It seems to work and only seems to fail to print the last "]" if the file ends with "]".

For example, the same content as before: "][How][many][words][we][have][here?][6][or][more?]"

My modification of Mike Burdick code is:

    static void OptimizedScan(string fileName)
    {
        const byte startDelimiter = 0x5d;
        const byte endDelimiter = 0x5b;

        using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
        {
            List<byte> buffer = new List<byte>();
            List<string> buffer1 = new List<string>();

            bool captureBytes = false;
            bool foundStartDelimiter = false;
            int wordCount = 0;

            SoapHexBinary hex = new SoapHexBinary();

            while (true)
            {
                byte[] chunk = reader.ReadBytes(1024);

                if (chunk.Length > 0)
                {
                    foreach (byte data in chunk)
                    {
                        if (data == startDelimiter && foundStartDelimiter == false)
                        {
                            foundStartDelimiter = true;
                        }
                        else if (data == endDelimiter && foundStartDelimiter)
                        {
                            wordCount = DisplayWord(buffer, wordCount, hex);

                            // Start capturing
                            captureBytes = true;
                            foundStartDelimiter = false;
                        }
                        else if ((data == startDelimiter && foundStartDelimiter) ||
                                 (data == endDelimiter && foundStartDelimiter == false))
                        {
                            buffer.Add(data);
                        }
                        else if (captureBytes)
                        {
                            buffer.Add(data);
                        }
                    }
                }
                else
                {
                    break;
                }
            }

            if (foundStartDelimiter)
            {
                buffer.Add(startDelimiter);
            }
            DisplayWord(buffer, wordCount, hex);
8
  • Why do you do this: br.BaseStream.Seek(ReadFrom, SeekOrigin.Begin)? That could be where you are losing perf. Do you really need to reposition the stream? Commented Dec 17, 2014 at 1:29
  • Tell me what your requirements are when the input is malformed. Are there any other rules for the input stream I need to be aware of? Commented Dec 17, 2014 at 17:43
  • Hello Mike. The only rule to get the blocks is: A new block begins after each sequence of 2 bytes "5D5B". Your code it works when there are no 5D nor 5B in the middle of a block. I mean, if the file contains 12F35D-5D5B-5D725DAB-5D5B-985B12DE, the blocks should be 5D725DAB and 985B12DE, but with your current code I'm getting 72, AB, 98, 12DE. I hope make sense. Thanks again Commented Dec 18, 2014 at 4:51
  • Updated answer with slightly different parsing logic to meet your requirements... Commented Dec 19, 2014 at 22:24
  • Hi Mike, thanks for your help. Please see my original post, I've updated it below "Update 2". Thanks again Commented Dec 20, 2014 at 0:55

1 Answer 1

1

I think this is faster and much simpler in terms of code:

    static void OptimizedScan(string fileName)
    {
        const byte startDelimiter = 0x5d;
        const byte endDelimiter = 0x5b;

        using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
        {
            List<byte> buffer = new List<byte>();

            bool captureBytes = false;
            bool foundStartDelimiter = false;
            int wordCount = 0;

            SoapHexBinary hex = new SoapHexBinary();

            while (true)
            {
                byte[] chunk = reader.ReadBytes(1024);

                if (chunk.Length > 0)
                {
                    foreach (byte data in chunk)
                    {
                        if (data == startDelimiter)
                        {
                            foundStartDelimiter = true;
                        }
                        else if (data == endDelimiter && foundStartDelimiter)
                        {
                            wordCount = DisplayWord(buffer, wordCount, hex);

                            // Start capturing
                            captureBytes = true;
                            foundStartDelimiter = false;
                        }
                        else if (captureBytes)
                        {
                            if (foundStartDelimiter)
                            {
                                buffer.Add(startDelimiter);
                            }

                            buffer.Add(data);
                        }
                    }
                }
                else
                {
                    break;
                }
            }

            if (foundStartDelimiter)
            {
                buffer.Add(startDelimiter);
            }

            DisplayWord(buffer, wordCount, hex);
        }
    }
Sign up to request clarification or add additional context in comments.

3 Comments

Hello Mike, Many thanks for your help. I don't need to use SeekOrigin.Begin, is only the way I found to make work my algorithm. It seems to work, I see you do the operation doing one byte at a time, I'll test with a big file and test performance and speed. Does your code convert from bin to hex string each byte at a time either with SoapHexBinary? It's oly I don't see a direct reference to each byte. Thanks again.
When I see the 0x5b byte I start putting bytes into my buffer List<byte> instance until I see the 0x5d byte. Then I convert the contents of buffer to the hex string.
Hi Mike, thanks for the help. May you see my update in my original post please. I show a different input for which the code doesn't work. This is when there is a 5D or 5B in the middle of any block and they are not in sequence, I mean 5D is not followed by 5B. Thanks again.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.