8

Given a file with two columns:

Id  ht
510 69
510 67
510 65
510 62
510 59
601 29
601 26
601 21
601 20

I need a way to coalesce all rows with the same ID into one that has an average height. In this case, (69 + 67 + 65 + 62 + 59) / 5 = 64 and (29 + 26 + 21 + 20) / 4 = 24, so the output should be:

Id  Avg.ht
 510 64
 601 24

How can I do that using sed/awk/perl?

1
  • Are the same ids grouped together as shown in the sample? Commented Oct 1, 2012 at 22:18

4 Answers 4

7

Using awk :

The input file

$ cat FILE
Id  ht
510 69
510 67
510 65
510 62
510 59
601 29
601 26
601 21
601 20

Awk in a shell :

$ awk '
    NR>1{
        arr[$1]   += $2
        count[$1] += 1
    }
    END{
        for (a in arr) {
            print "id avg " a " = " arr[a] / count[a]
        }
    }
' FILE

Or with Perl in a shell :

$ perl -lane '
    END {
        foreach my $key (keys(%hash)) {
            print "id avg $key = " . $hash{$key} / $count{$key};
        }
    }
    if ($. > 1) {
        $hash{$F[0]}  += $F[1];
        $count{$F[0]} += 1;
    }
' FILE

Output is :

id avg 601 = 24
id avg 510 = 64.4

And last for the joke, a Perl dark-obfuscated one-liner =)

perl -lane'END{for(keys(%h)){print"$_:".$h{$_}/$c{$_}}}($.>1)&&do{$h{$F[0]}+=$F[1];$c{$F[0]}++}' FILE
0
2
#!/usr/bin/perl
use strict;
use warnings;

my %sum_so_far;
my %count_so_far;
while ( <> ) {
    # Skip lines that don't start with a digit
    next if m/^[^\d]/;

    # Accumulate the sum and the count
    my @line = split();
    $sum_so_far{$line[0]}   += $line[1];
    $count_so_far{$line[0]} += 1;
}

# Dump the output
print "Id Avg.ht\n";
foreach my $id ( keys %count_so_far ) {
    my $avg = $sum_so_far{$id}/$count_so_far{$id};
    print " $id $avg\n";
}

Output:

ire@localhost$ perl make_average.pl input.txt 
Id Avg.ht
 510 64.4
 601 24

Note that your sample output is wrong. There's no way you can get an average of 52 when every value for that id is 59 or larger.

Also, you have a letter l in one of your columns, masquerading as the number 1...

2

With gnu datamash:

datamash -H -s -g 1 mean 2 <file
GroupBy(Id) mean()
510 64.4
601 24

This sorts and groups by 1st field calculating 2nd field mean value, preserving Headers. It assumes the fields are separated by single tab. Use -W, --whitespace if they're separated by multiple blanks or -t, --field-separator= to define another field separator (space, comma etc). Since datamash requires sorted input, the output will be sorted by the grouped column.

1

Take a look at what is done here: http://www.sugihartono.com/programming/group-by-count-and-sorting-using-perl-script/

The essential difficult part is doing a 'group by' operation. The linked script does that using a hash.

In that link they are calculating the sum, but getting the average will not be far different.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.