Command for counting and grouping into columns

Question

I have a file that looks like this:

2017-07-30 A
2017-07-30 B
2017-07-30 B
2017-07-30 A
2017-07-30 A
2017-07-30 C
2017-07-31 A
2017-07-31 B
2017-07-31 C
2017-07-31 B
2017-07-31 C

Each line represent an event (A, B, or C) and the day it occured on. I want to count the number events per type for each day. This can be done with sort file | uniq -c, giving output like this:

  3     2017-07-30 A
  2     2017-07-30 B
  1     2017-07-30 C
  1     2017-07-31 A
  2     2017-07-31 B
  2     2017-07-31 C

However, I would like to have each event type as a column:

              A    B    C
2017-07-30    3    2    1
2017-07-31    1    2    2

Is there a reasonably common command line tool that can do this? If necessary, it can be assumed that all event types (A, B, C) are known in advance, but it's better if it isn't necessary. Likewise it can be assumed that each event occurs at least once per day (meaning no zeros in the output), but here too it's better if it isn't necessary.

If you do a lot of data processing and aggregation, you might consider setting up a real database. Postgres is free, mature, and very fully featured. (I do such processing in a sandbox Postgres instance, which is fairly easy to set up.) — Wildcard
– Wildcard, Commented Aug 9, 2017 at 22:49

steeldriver · Accepted Answer · 2017-08-09 14:44:42Z

5

If "reasonably common" includes GNU datamash, then

datamash -Ws crosstab 1,2 < file

ex.

$ datamash -Ws crosstab 1,2 < file
    A   B   C
2017-07-30  3   2   1
2017-07-31  1   2   2

(unfortunately the formatting of this site doesn't preserve tabs - the actual output is tab aligned).

answered Aug 9, 2017 at 14:44

steeldriver

83.8k12 gold badges124 silver badges175 bronze badges

it gives me datamash: invalid operation 'crosstab'

RomanPerekhrest
– RomanPerekhrest

2017-08-09 15:17:38 +00:00
Commented Aug 9, 2017 at 15:17
@Roman Needs version 1.1.0. Looks like Ubuntu has been on 1.0.7 for almost two years.

Viktor Dahl
– Viktor Dahl

2017-08-09 15:26:32 +00:00
Commented Aug 9, 2017 at 15:26
@ViktorDahl, yes, the version is datamash (GNU datamash) 1.0.7. It's worth to add a note, I suppose

RomanPerekhrest
– RomanPerekhrest

2017-08-09 15:27:44 +00:00
Commented Aug 9, 2017 at 15:27
1

Note that the site does actually preserve formatting, it looks exactly like this in the terminal as well. However, this is, despite appearances, correctly aligned. It's just that 2017-07-30 is longer than a tab so it looks wrong. If you use something like awk or cut to process it, however, everything will be fine.

terdon
– terdon ♦

2017-08-09 15:40:11 +00:00
Commented Aug 9, 2017 at 15:40

Add a comment |

RomanPerekhrest · Accepted Answer · 2017-08-09 21:07:58Z

2

awk solution:

awk '{ d[$1]; k[$2]; a[$2,$1]++ }END{ 
       printf("%10s"," ");
       for(i in k) printf("\t%s",i); print ""; 
       for(j in d) { 
           printf("%-10s",j); 
           for(i in k) printf("\t%d",a[i,j]); print "" 
       } }' file

The output:

            A   B   C
2017-07-30  3   2   1
2017-07-31  1   2   2

answered Aug 9, 2017 at 21:07

RomanPerekhrest

30.8k5 gold badges47 silver badges68 bronze badges

Add a comment |

score 2 · Accepted Answer · 2017-08-10 08:31:35Z

Shorter version where we don't assign empty values to zero:

perl -lane '
   ++$h{$i[!$h{$F[0]} ? @i : -1]=$F[0]}{$F[1]}}{
   print join "\t", "\t", @h = sort keys %{ +{ map { map { $_ => 1 } keys %$_ } values %h } };
   print join "\t", $_, @{$h{$_}}{@h} for @i;
' yourfile

perl -lane '
   $i[@i]=$F[0] unless $h{$F[0]};
   ++$h{$F[0]}{$F[1]}}{
   @h = sort keys %{ +{ map { map { $_ => 1 } keys %$_ } values %h } };
   print join "\t", "\t", @h;
   for my $date ( @i ) {
      my $href = $h{$date};
      print join "\t", $date, map { $href->{$_} || 0 } @h;
   }
' yourfile

Results

                A       B       C
2017-07-30      3       2       1
2017-07-31      1       2       2

Data Structures:

hash %h which has keys the dates and values sub-hashes whose keys are A, B, C, etc. and corresponding values are their respective counts on those particular dates.

  %h = (
       2017-07-30 => {
           A => 3,
           B => 2,
           C => 1,
       },
       ...
  );

Array @i which stores the dates in the order they were encountered. We push the dates into the array @i only when it's not been seen earlier IOW, when it's seen for the first time only. The order is provided by the array position.
Array @h has the uniquified keys after totaling all the "A", "B", "C", etc. keys from the hash %h.

steve · Accepted Answer · 2017-08-09 23:41:53Z

1

Plain old bash version, using arrays.

#!/bin/bash
declare -A values letters dates
while read date letter; do
 values[$date$letter]=$(( ${values[$date$letter]} + 1 ))
 letters[$letter]=1
 dates["$date"]=1
done <file.txt
echo ' ' ${!letters[@]} | sed 's/ /\t/g'
for date in ${!dates[@]}; do
 printf "%-8s\t" $date
 for letter in ${!letters[@]}; do
  printf "%s\t" ${values[$date$letter]}
 done
 echo
done

answered Aug 9, 2017 at 23:41

steve

22.3k5 gold badges53 silver badges79 bronze badges

Add a comment |

MiniMax · Accepted Answer · 2017-08-09 22:44:27Z

Usage: ./count.awk input.txt | column -t -n

#!/usr/bin/gawk -f

{
    dates[$1] = $1;
    events[$2] = $2;
    numbers[$1][$2]++;
}

END {
    num_dates=asort(dates);
    num_events=asort(events);

    for (i = 1; i <= num_events; i++) {
        printf " %s", events[i];
    }
    print "";

    for (i = 1; i <= num_dates; i++ ) {
        printf "%s ", dates[i];
        for (j = 1; j <= num_events; j++) {
            printf "%s ", numbers[dates[i]][events[j]];
        }
        print "";
    }
}

Testing:

Input (complicated for testing)

2017-07-30 A
2017-07-30 D
2017-07-29 D
2017-07-30 B
2017-07-28 E
2017-07-30 B
2017-07-30 A
2017-07-30 A
2017-07-30 C
2017-07-31 A
2017-07-31 B
2017-07-31 C
2017-07-31 B 
2017-07-31 C

Output

            A  B  C  D  E
2017-07-28              1  
2017-07-29           1     
2017-07-30  3  2  1  1     
2017-07-31  1  2  2

Stack Exchange Network

Command for counting and grouping into columns

5 Answers 5

Results

Data Structures:

Testing:

You must log in to answer this question.

Hot Network Questions

Command for counting and grouping into columns

5 Answers 5

Results

Data Structures:

Testing:

You must log in to answer this question.

Related

Hot Network Questions