2

I have a file that looks like this:

2017-07-30 A
2017-07-30 B
2017-07-30 B
2017-07-30 A
2017-07-30 A
2017-07-30 C
2017-07-31 A
2017-07-31 B
2017-07-31 C
2017-07-31 B
2017-07-31 C

Each line represent an event (A, B, or C) and the day it occured on. I want to count the number events per type for each day. This can be done with sort file | uniq -c, giving output like this:

  3     2017-07-30 A
  2     2017-07-30 B
  1     2017-07-30 C
  1     2017-07-31 A
  2     2017-07-31 B
  2     2017-07-31 C

However, I would like to have each event type as a column:

              A    B    C
2017-07-30    3    2    1
2017-07-31    1    2    2

Is there a reasonably common command line tool that can do this? If necessary, it can be assumed that all event types (A, B, C) are known in advance, but it's better if it isn't necessary. Likewise it can be assumed that each event occurs at least once per day (meaning no zeros in the output), but here too it's better if it isn't necessary.

1
  • 1
    If you do a lot of data processing and aggregation, you might consider setting up a real database. Postgres is free, mature, and very fully featured. (I do such processing in a sandbox Postgres instance, which is fairly easy to set up.) Commented Aug 9, 2017 at 22:49

5 Answers 5

5

If "reasonably common" includes GNU datamash, then

datamash -Ws crosstab 1,2 < file

ex.

$ datamash -Ws crosstab 1,2 < file
    A   B   C
2017-07-30  3   2   1
2017-07-31  1   2   2

(unfortunately the formatting of this site doesn't preserve tabs - the actual output is tab aligned).

4
  • it gives me datamash: invalid operation 'crosstab' Commented Aug 9, 2017 at 15:17
  • @Roman Needs version 1.1.0. Looks like Ubuntu has been on 1.0.7 for almost two years. Commented Aug 9, 2017 at 15:26
  • @ViktorDahl, yes, the version is datamash (GNU datamash) 1.0.7. It's worth to add a note, I suppose Commented Aug 9, 2017 at 15:27
  • 1
    Note that the site does actually preserve formatting, it looks exactly like this in the terminal as well. However, this is, despite appearances, correctly aligned. It's just that 2017-07-30 is longer than a tab so it looks wrong. If you use something like awk or cut to process it, however, everything will be fine. Commented Aug 9, 2017 at 15:40
2

awk solution:

awk '{ d[$1]; k[$2]; a[$2,$1]++ }END{ 
       printf("%10s"," ");
       for(i in k) printf("\t%s",i); print ""; 
       for(j in d) { 
           printf("%-10s",j); 
           for(i in k) printf("\t%d",a[i,j]); print "" 
       } }' file

The output:

            A   B   C
2017-07-30  3   2   1
2017-07-31  1   2   2
2

Shorter version where we don't assign empty values to zero:

perl -lane '
   ++$h{$i[!$h{$F[0]} ? @i : -1]=$F[0]}{$F[1]}}{
   print join "\t", "\t", @h = sort keys %{ +{ map { map { $_ => 1 } keys %$_ } values %h } };
   print join "\t", $_, @{$h{$_}}{@h} for @i;
' yourfile

perl -lane '
   $i[@i]=$F[0] unless $h{$F[0]};
   ++$h{$F[0]}{$F[1]}}{
   @h = sort keys %{ +{ map { map { $_ => 1 } keys %$_ } values %h } };
   print join "\t", "\t", @h;
   for my $date ( @i ) {
      my $href = $h{$date};
      print join "\t", $date, map { $href->{$_} || 0 } @h;
   }
' yourfile

Results

                A       B       C
2017-07-30      3       2       1
2017-07-31      1       2       2

Data Structures:

  • hash %h which has keys the dates and values sub-hashes whose keys are A, B, C, etc. and corresponding values are their respective counts on those particular dates.

  %h = (
       2017-07-30 => {
           A => 3,
           B => 2,
           C => 1,
       },
       ...
  );

  • Array @i which stores the dates in the order they were encountered. We push the dates into the array @i only when it's not been seen earlier IOW, when it's seen for the first time only. The order is provided by the array position.
  • Array @h has the uniquified keys after totaling all the "A", "B", "C", etc. keys from the hash %h.
1

Plain old bash version, using arrays.

#!/bin/bash
declare -A values letters dates
while read date letter; do
 values[$date$letter]=$(( ${values[$date$letter]} + 1 ))
 letters[$letter]=1
 dates["$date"]=1
done <file.txt
echo ' ' ${!letters[@]} | sed 's/ /\t/g'
for date in ${!dates[@]}; do
 printf "%-8s\t" $date
 for letter in ${!letters[@]}; do
  printf "%s\t" ${values[$date$letter]}
 done
 echo
done
0

Usage: ./count.awk input.txt | column -t -n

#!/usr/bin/gawk -f

{
    dates[$1] = $1;
    events[$2] = $2;
    numbers[$1][$2]++;
}

END {
    num_dates=asort(dates);
    num_events=asort(events);

    for (i = 1; i <= num_events; i++) {
        printf " %s", events[i];
    }
    print "";

    for (i = 1; i <= num_dates; i++ ) {
        printf "%s ", dates[i];
        for (j = 1; j <= num_events; j++) {
            printf "%s ", numbers[dates[i]][events[j]];
        }
        print "";
    }
}

Testing:

Input (complicated for testing)

2017-07-30 A
2017-07-30 D
2017-07-29 D
2017-07-30 B
2017-07-28 E
2017-07-30 B
2017-07-30 A
2017-07-30 A
2017-07-30 C
2017-07-31 A
2017-07-31 B
2017-07-31 C
2017-07-31 B 
2017-07-31 C

Output

            A  B  C  D  E
2017-07-28              1  
2017-07-29           1     
2017-07-30  3  2  1  1     
2017-07-31  1  2  2        

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.