Average rows with same first column

Question

Given a file with two columns:

I need a way to coalesce all rows with the same ID into one that has an average height. In this case, (69 + 67 + 65 + 62 + 59) / 5 = 64 and (29 + 26 + 21 + 20) / 4 = 24, so the output should be:

Id  Avg.ht
 510 64
 601 24

How can I do that using sed/awk/perl?

Are the same ids grouped together as shown in the sample?

choroba
– choroba

2012-10-01 22:18:29 +00:00
Commented Oct 1, 2012 at 22:18 — choroba
– choroba, Commented Oct 1, 2012 at 22:18

Gilles Quénot · Accepted Answer · 2012-10-02 16:19:19Z

Using awk :

The input file

Awk in a shell :

$ awk '
    NR>1{
        arr[$1]   += $2
        count[$1] += 1
    }
    END{
        for (a in arr) {
            print "id avg " a " = " arr[a] / count[a]
        }
    }
' FILE

Or with Perl in a shell :

$ perl -lane '
    END {
        foreach my $key (keys(%hash)) {
            print "id avg $key = " . $hash{$key} / $count{$key};
        }
    }
    if ($. > 1) {
        $hash{$F[0]}  += $F[1];
        $count{$F[0]} += 1;
    }
' FILE

Output is :

id avg 601 = 24
id avg 510 = 64.4

And last for the joke, a Perl dark-obfuscated one-liner =)

perl -lane'END{for(keys(%h)){print"$_:".$h{$_}/$c{$_}}}($.>1)&&do{$h{$F[0]}+=$F[1];$c{$F[0]}++}' FILE

ire_and_curses · Accepted Answer · 2012-10-01 23:05:23Z

#!/usr/bin/perl
use strict;
use warnings;

my %sum_so_far;
my %count_so_far;
while ( <> ) {
    # Skip lines that don't start with a digit
    next if m/^[^\d]/;

    # Accumulate the sum and the count
    my @line = split();
    $sum_so_far{$line[0]}   += $line[1];
    $count_so_far{$line[0]} += 1;
}

# Dump the output
print "Id Avg.ht\n";
foreach my $id ( keys %count_so_far ) {
    my $avg = $sum_so_far{$id}/$count_so_far{$id};
    print " $id $avg\n";
}

Output:

ire@localhost$ perl make_average.pl input.txt 
Id Avg.ht
 510 64.4
 601 24

Note that your sample output is wrong. There's no way you can get an average of 52 when every value for that id is 59 or larger.

Also, you have a letter l in one of your columns, masquerading as the number 1...

don_crissti · Accepted Answer · 2015-10-01 21:36:54Z

2

With gnu datamash:

datamash -H -s -g 1 mean 2 <file

GroupBy(Id) mean()
510 64.4
601 24

This sorts and groups by 1st field calculating 2nd field mean value, preserving Headers. It assumes the fields are separated by single tab. Use -W, --whitespace if they're separated by multiple blanks or -t, --field-separator= to define another field separator (space, comma etc). Since datamash requires sorted input, the output will be sorted by the grouped column.

answered Oct 1, 2015 at 21:36

community wiki

don_crissti

Add a comment |

jwd · Accepted Answer · 2012-10-01 22:40:04Z

1

Take a look at what is done here: http://www.sugihartono.com/programming/group-by-count-and-sorting-using-perl-script/

The essential difficult part is doing a 'group by' operation. The linked script does that using a hash.

In that link they are calculating the sum, but getting the average will not be far different.

answered Oct 1, 2012 at 22:40

jwd

1,5199 silver badges15 bronze badges

Add a comment |

Stack Exchange Network

Average rows with same first column

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Average rows with same first column

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions