Revisions to Simple compression reloaded++

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:40

Community Bot

1

Simplify write_count() – Instead of counting all bits used, and splitting out 5 bits and then 8 bits at a time, I would opt for a simpler method to write the count, whilst still maintaining a variable byte count to write larger numbers. Before presenting that, I'm not sure if you really need to count the actual bits, as you always write bytes, so your implementation could most probably be simplified as well.

However if you use the 8th bit to indicate that more bits follows, there is no need to do all the precalculations, and you can do the following:
```
  const int BIT_SHIFT = 7;
  const int LOWER_BITS = 0x7f;
  const int HIGH_BIT = 0x80;

  void write_count(uint64_t count)
  {
      int lower_bits = count & LOWER_BITS;

      while (lower_bits != count || count > LOWER_BITS) {
          write_char(lower_bits | HIGH_BIT);
          count >>= BIT_SHIFT;
          lower_bits = count & LOWER_BITS;
      }

      write_char(lower_bits);
  }
```
This change will allow up to 127 repetitions to be represented using 2 bytes, instead of only 31 repetitions as in your original code. When using 3 bytes this implementation allows for 16384 repetitions (14 bits), whilst your allows 8192 repetitions (13 bits). After that your implementation is slightly more space efficient, but still slightly harder as you need to precalculate it.

Here is an example run showing the transition from 127 to 128 repetitions:
```
  echo -n "eaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc" | ./cr115765_rle -x 
  0x65 0x1 0x61 0x7f 0x62 0x80 0x1 0x63 0x1 
```
That is 1 e, 127 (0x7f) a's, 128 b's (0x80 0x1), and 1 c.
Avoid magic numbers – In your version of write_count() you use the magic numbers of 5 and 8, whilst you've defined the NUM_LENGTH_BITS. Be consistent, and try to avoid most use of magic numbers. If you changed your NUM_LENGTH_BITS the rest of your code would not follow that change.
Avoid repeating your self – In write_char() the perror() & exit() part is repeated. If you change to using only printf() this can be avoided, and simplified into:
```
  void write_char(int c)
  {
      if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
          perror("Error printing to stdout.");
          exit(EXIT_FAILURE);
      }
  }
```
Here I use a ternary to choose the proper format string, and I've also added a space to the end of the hex output to make it a little easier to read.
Be vary about globals – Global variables are in general to be avoided, but stuff like the print_hex is awkward to pass as parameters everywhere. I tend to either uppercase them, or add G_ in front of them so that it becomes G_print_hex. The general advice is though to make some sort of distinction so that you easily detect where you are using globals.
Improve argument handling – I would look into alternate ways of handling your parameters, as the current implementation is somewhat handicapped due to the following reasons:
- stricmp is limited to Windows – stricmp is Windows specific (see here here), so please use strcasecmp instead which is compliant with C99
- If used without parameters it prints error message – As it stands it prints an error message if you don't use any parameters.
- If you add more parameters the argc == 2 fails – For test purposes I tried adding a -v parameter, but this turned out to be cumbersome as this required some rewriting so I ended up hard-coding it (and removing it at the end).
```
A proper parameter handling would require a `for` loop, and should most likely at least include the `-x` / `--hex` parameters, as well as a `-h` / `--help` parameters. In addition to allowing no parameters.
```
Consider simplifying main() – A common pattern is to have as simple main method as possible, that is to do parameter parsing and then calling the appopriate function to be executed. This allows for a simple entry point, and also easier reuse of functions if you extend your program.

I.e. if you renamed your current main() to encode(), you could easily add the decode() to the mix based on parameters. Having both of these within main() would not look tidy. This would also be a clearer segregation of duty, and better single responsibility design.
Try to reduce nesting of if's – This a little based on taste and personal opinions, but I tend to avoid using while (true) and break if possible, and try to keep the nesting of if's to a minimum.

In your case, this can be done using the following:
```
  while ((current_char = getchar()) != EOF) {

      if (current_char == previous_char) {
          current_char_count += 1;

      } else {

          if (current_char_count > 0) {
               write_count(current_char_count);
          }

          write_char(current_char);
          current_char_count = 1;
      }

      previous_char = current_char;
  }

  if (current_char_count > 0) {
      write_count(current_char_count);
  } 
```
To me this this clearer indicates the main purpose of reading characters until end of file, and by switching the if's around, I also clearer indicates that the main thingy is counting equal characters. And it is a clearer connection between the if and the else so that it is easier to see why we entered the else clause.

I've removed the reference related to MAX_COUNT as I consider it rather esoteric if you run this code on something with 2^64 repetitions of a single character. Another change is that I need to finish of the writing of the last count in an additional write_count() at the end. Still I think this reads somewhat easier than your implementation.
Optionally change brace style – This is totally a personal preference, and the main point is to keep bracing consistent, which you do! But I prefer having the opening brace on the same line in C, as I feel it make the code somewhat easier to read and slightly more compact.

Do however note that I still keep braces around one-line blocks, as you do.

Simplify write_count() – Instead of counting all bits used, and splitting out 5 bits and then 8 bits at a time, I would opt for a simpler method to write the count, whilst still maintaining a variable byte count to write larger numbers. Before presenting that, I'm not sure if you really need to count the actual bits, as you always write bytes, so your implementation could most probably be simplified as well.

However if you use the 8th bit to indicate that more bits follows, there is no need to do all the precalculations, and you can do the following:
```
  const int BIT_SHIFT = 7;
  const int LOWER_BITS = 0x7f;
  const int HIGH_BIT = 0x80;

  void write_count(uint64_t count)
  {
      int lower_bits = count & LOWER_BITS;

      while (lower_bits != count || count > LOWER_BITS) {
          write_char(lower_bits | HIGH_BIT);
          count >>= BIT_SHIFT;
          lower_bits = count & LOWER_BITS;
      }

      write_char(lower_bits);
  }
```
This change will allow up to 127 repetitions to be represented using 2 bytes, instead of only 31 repetitions as in your original code. When using 3 bytes this implementation allows for 16384 repetitions (14 bits), whilst your allows 8192 repetitions (13 bits). After that your implementation is slightly more space efficient, but still slightly harder as you need to precalculate it.

Here is an example run showing the transition from 127 to 128 repetitions:
```
  echo -n "eaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc" | ./cr115765_rle -x 
  0x65 0x1 0x61 0x7f 0x62 0x80 0x1 0x63 0x1 
```
That is 1 e, 127 (0x7f) a's, 128 b's (0x80 0x1), and 1 c.
Avoid magic numbers – In your version of write_count() you use the magic numbers of 5 and 8, whilst you've defined the NUM_LENGTH_BITS. Be consistent, and try to avoid most use of magic numbers. If you changed your NUM_LENGTH_BITS the rest of your code would not follow that change.
Avoid repeating your self – In write_char() the perror() & exit() part is repeated. If you change to using only printf() this can be avoided, and simplified into:
```
  void write_char(int c)
  {
      if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
          perror("Error printing to stdout.");
          exit(EXIT_FAILURE);
      }
  }
```
Here I use a ternary to choose the proper format string, and I've also added a space to the end of the hex output to make it a little easier to read.
Be vary about globals – Global variables are in general to be avoided, but stuff like the print_hex is awkward to pass as parameters everywhere. I tend to either uppercase them, or add G_ in front of them so that it becomes G_print_hex. The general advice is though to make some sort of distinction so that you easily detect where you are using globals.
Improve argument handling – I would look into alternate ways of handling your parameters, as the current implementation is somewhat handicapped due to the following reasons:
- stricmp is limited to Windows – stricmp is Windows specific (see here), so please use strcasecmp instead which is compliant with C99
- If used without parameters it prints error message – As it stands it prints an error message if you don't use any parameters.
- If you add more parameters the argc == 2 fails – For test purposes I tried adding a -v parameter, but this turned out to be cumbersome as this required some rewriting so I ended up hard-coding it (and removing it at the end).
```
A proper parameter handling would require a `for` loop, and should most likely at least include the `-x` / `--hex` parameters, as well as a `-h` / `--help` parameters. In addition to allowing no parameters.
```
Consider simplifying main() – A common pattern is to have as simple main method as possible, that is to do parameter parsing and then calling the appopriate function to be executed. This allows for a simple entry point, and also easier reuse of functions if you extend your program.

I.e. if you renamed your current main() to encode(), you could easily add the decode() to the mix based on parameters. Having both of these within main() would not look tidy. This would also be a clearer segregation of duty, and better single responsibility design.
Try to reduce nesting of if's – This a little based on taste and personal opinions, but I tend to avoid using while (true) and break if possible, and try to keep the nesting of if's to a minimum.

In your case, this can be done using the following:
```
  while ((current_char = getchar()) != EOF) {

      if (current_char == previous_char) {
          current_char_count += 1;

      } else {

          if (current_char_count > 0) {
               write_count(current_char_count);
          }

          write_char(current_char);
          current_char_count = 1;
      }

      previous_char = current_char;
  }

  if (current_char_count > 0) {
      write_count(current_char_count);
  } 
```
To me this this clearer indicates the main purpose of reading characters until end of file, and by switching the if's around, I also clearer indicates that the main thingy is counting equal characters. And it is a clearer connection between the if and the else so that it is easier to see why we entered the else clause.

I've removed the reference related to MAX_COUNT as I consider it rather esoteric if you run this code on something with 2^64 repetitions of a single character. Another change is that I need to finish of the writing of the last count in an additional write_count() at the end. Still I think this reads somewhat easier than your implementation.
Optionally change brace style – This is totally a personal preference, and the main point is to keep bracing consistent, which you do! But I prefer having the opening brace on the same line in C, as I feel it make the code somewhat easier to read and slightly more compact.

Do however note that I still keep braces around one-line blocks, as you do.

Simplify write_count() – Instead of counting all bits used, and splitting out 5 bits and then 8 bits at a time, I would opt for a simpler method to write the count, whilst still maintaining a variable byte count to write larger numbers. Before presenting that, I'm not sure if you really need to count the actual bits, as you always write bytes, so your implementation could most probably be simplified as well.

However if you use the 8th bit to indicate that more bits follows, there is no need to do all the precalculations, and you can do the following:
```
  const int BIT_SHIFT = 7;
  const int LOWER_BITS = 0x7f;
  const int HIGH_BIT = 0x80;

  void write_count(uint64_t count)
  {
      int lower_bits = count & LOWER_BITS;

      while (lower_bits != count || count > LOWER_BITS) {
          write_char(lower_bits | HIGH_BIT);
          count >>= BIT_SHIFT;
          lower_bits = count & LOWER_BITS;
      }

      write_char(lower_bits);
  }
```
This change will allow up to 127 repetitions to be represented using 2 bytes, instead of only 31 repetitions as in your original code. When using 3 bytes this implementation allows for 16384 repetitions (14 bits), whilst your allows 8192 repetitions (13 bits). After that your implementation is slightly more space efficient, but still slightly harder as you need to precalculate it.

Here is an example run showing the transition from 127 to 128 repetitions:
```
  echo -n "eaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc" | ./cr115765_rle -x 
  0x65 0x1 0x61 0x7f 0x62 0x80 0x1 0x63 0x1 
```
That is 1 e, 127 (0x7f) a's, 128 b's (0x80 0x1), and 1 c.
Avoid magic numbers – In your version of write_count() you use the magic numbers of 5 and 8, whilst you've defined the NUM_LENGTH_BITS. Be consistent, and try to avoid most use of magic numbers. If you changed your NUM_LENGTH_BITS the rest of your code would not follow that change.
Avoid repeating your self – In write_char() the perror() & exit() part is repeated. If you change to using only printf() this can be avoided, and simplified into:
```
  void write_char(int c)
  {
      if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
          perror("Error printing to stdout.");
          exit(EXIT_FAILURE);
      }
  }
```
Here I use a ternary to choose the proper format string, and I've also added a space to the end of the hex output to make it a little easier to read.
Be vary about globals – Global variables are in general to be avoided, but stuff like the print_hex is awkward to pass as parameters everywhere. I tend to either uppercase them, or add G_ in front of them so that it becomes G_print_hex. The general advice is though to make some sort of distinction so that you easily detect where you are using globals.
Improve argument handling – I would look into alternate ways of handling your parameters, as the current implementation is somewhat handicapped due to the following reasons:
- stricmp is limited to Windows – stricmp is Windows specific (see here), so please use strcasecmp instead which is compliant with C99
- If used without parameters it prints error message – As it stands it prints an error message if you don't use any parameters.
- If you add more parameters the argc == 2 fails – For test purposes I tried adding a -v parameter, but this turned out to be cumbersome as this required some rewriting so I ended up hard-coding it (and removing it at the end).
```
A proper parameter handling would require a `for` loop, and should most likely at least include the `-x` / `--hex` parameters, as well as a `-h` / `--help` parameters. In addition to allowing no parameters.
```
Consider simplifying main() – A common pattern is to have as simple main method as possible, that is to do parameter parsing and then calling the appopriate function to be executed. This allows for a simple entry point, and also easier reuse of functions if you extend your program.

I.e. if you renamed your current main() to encode(), you could easily add the decode() to the mix based on parameters. Having both of these within main() would not look tidy. This would also be a clearer segregation of duty, and better single responsibility design.
Try to reduce nesting of if's – This a little based on taste and personal opinions, but I tend to avoid using while (true) and break if possible, and try to keep the nesting of if's to a minimum.

In your case, this can be done using the following:
```
  while ((current_char = getchar()) != EOF) {

      if (current_char == previous_char) {
          current_char_count += 1;

      } else {

          if (current_char_count > 0) {
               write_count(current_char_count);
          }

          write_char(current_char);
          current_char_count = 1;
      }

      previous_char = current_char;
  }

  if (current_char_count > 0) {
      write_count(current_char_count);
  } 
```
To me this this clearer indicates the main purpose of reading characters until end of file, and by switching the if's around, I also clearer indicates that the main thingy is counting equal characters. And it is a clearer connection between the if and the else so that it is easier to see why we entered the else clause.

I've removed the reference related to MAX_COUNT as I consider it rather esoteric if you run this code on something with 2^64 repetitions of a single character. Another change is that I need to finish of the writing of the last count in an additional write_count() at the end. Still I think this reads somewhat easier than your implementation.
Optionally change brace style – This is totally a personal preference, and the main point is to keep bracing consistent, which you do! But I prefer having the opening brace on the same line in C, as I feel it make the code somewhat easier to read and slightly more compact.

Do however note that I still keep braces around one-line blocks, as you do.

Added note on using `-1` for previous_char

Source Link

edited Jan 5, 2016 at 0:13

holroy

11.8k
1
27
59

_{PS! Using previous_char=-1 one avoids a bug when testing this with dd if=/dev/zero bs=1 count=65, as getchar() only returns unsigned char's or EOF. But previous_char is declared as an int, and as such using -1 as a start value, is safer than the original 0.}

Changed previous_char=-1, to avoid test case related to if=/dev/zero

Source Link

edited Jan 5, 2016 at 0:06

holroy

11.8k
1
27
59

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>

const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;

bool G_print_hex = false;

void write_char(int c)
{
    if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
        perror("Error printing to stdout.");
        exit(EXIT_FAILURE);
    }
}


void write_count(uint64_t count)
{
    int lower_bits = count & LOWER_BITS;
   
    while (lower_bits != count || count > LOWER_BITS) {
        write_char(lower_bits | HIGH_BIT);
        count >>= BIT_SHIFT;
        lower_bits = count & LOWER_BITS;
    }

    write_char(lower_bits);
}
 

void parse_args(int argc, char **argv)
{
    if (argc == 2) {
        G_print_hex = (strcasecmp(argv[1], "-x") == 0 || strcasecmp(argv[1], "--hex") == 0);

    } else if (argc != 1) {
        printf("Invalid number of arguments passed: %d\n", argc - 1);
        exit(EXIT_FAILURE);
    }
}


int main(int argc, char** argv)
{
    int current_char = 0;
    int previous_char = 0;-1; // A non-legal char, so it triggers a change on first character begin chr(0)
    uint64_t current_char_count = 0;

    parse_args(argc, argv);

    while ((current_char = getchar()) != EOF) {

        if (current_char == previous_char) {
            current_char_count += 1;

        } else {

            if (current_char_count > 0) {
                 write_count(current_char_count);
            }
           
            write_char(current_char);
            current_char_count = 1;
        }

        previous_char = current_char;
    }
  
    if (current_char_count > 0) {
        write_count(current_char_count);
    } 
}

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>

const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;

bool G_print_hex = false;

void write_char(int c)
{
    if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
        perror("Error printing to stdout.");
        exit(EXIT_FAILURE);
    }
}


void write_count(uint64_t count)
{
    int lower_bits = count & LOWER_BITS;
   
    while (lower_bits != count || count > LOWER_BITS) {
        write_char(lower_bits | HIGH_BIT);
        count >>= BIT_SHIFT;
        lower_bits = count & LOWER_BITS;
    }

    write_char(lower_bits);
}
 

void parse_args(int argc, char **argv)
{
    if (argc == 2) {
        G_print_hex = (strcasecmp(argv[1], "-x") == 0 || strcasecmp(argv[1], "--hex") == 0);

    } else if (argc != 1) {
        printf("Invalid number of arguments passed: %d\n", argc - 1);
        exit(EXIT_FAILURE);
    }
}


int main(int argc, char** argv)
{
    int current_char = 0;
    int previous_char = 0;
    uint64_t current_char_count = 0;

    parse_args(argc, argv);

    while ((current_char = getchar()) != EOF) {

        if (current_char == previous_char) {
            current_char_count += 1;

        } else {

            if (current_char_count > 0) {
                 write_count(current_char_count);
            }
           
            write_char(current_char);
            current_char_count = 1;
        }

        previous_char = current_char;
    }
  
    if (current_char_count > 0) {
        write_count(current_char_count);
    } 
}

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>

const int BIT_SHIFT = 7;
const int LOWER_BITS = 0x7f;
const int HIGH_BIT = 0x80;

bool G_print_hex = false;

void write_char(int c)
{
    if (printf(G_print_hex ? "0x%x " : "%c", c) < 0) {
        perror("Error printing to stdout.");
        exit(EXIT_FAILURE);
    }
}


void write_count(uint64_t count)
{
    int lower_bits = count & LOWER_BITS;
   
    while (lower_bits != count || count > LOWER_BITS) {
        write_char(lower_bits | HIGH_BIT);
        count >>= BIT_SHIFT;
        lower_bits = count & LOWER_BITS;
    }

    write_char(lower_bits);
}
 

void parse_args(int argc, char **argv)
{
    if (argc == 2) {
        G_print_hex = (strcasecmp(argv[1], "-x") == 0 || strcasecmp(argv[1], "--hex") == 0);

    } else if (argc != 1) {
        printf("Invalid number of arguments passed: %d\n", argc - 1);
        exit(EXIT_FAILURE);
    }
}


int main(int argc, char** argv)
{
    int current_char = 0;
    int previous_char = -1; // A non-legal char, so it triggers a change on first character begin chr(0)
    uint64_t current_char_count = 0;

    parse_args(argc, argv);

    while ((current_char = getchar()) != EOF) {

        if (current_char == previous_char) {
            current_char_count += 1;

        } else {

            if (current_char_count > 0) {
                 write_count(current_char_count);
            }
           
            write_char(current_char);
            current_char_count = 1;
        }

        previous_char = current_char;
    }
  
    if (current_char_count > 0) {
        write_count(current_char_count);
    } 
}

Source Link

answered Jan 4, 2016 at 12:31

holroy

11.8k
1
27
59

Loading

Stack Exchange Network

Return to Answer