BASH - conditional sum of columns and rows in csv file

Question

i have CSV file with some database benchmark results here is the example:

Date;dbms;type;description;W;D;S;results;time;id
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;570;265;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;420;215;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;500;365;50
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;530;255;50

Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;870;265;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;620;215;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;700;365;99
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;530;255;99

i need to process all rows with the same id (value of the last column) and get this:

Date;dbms;type;description;W;D;S;time;results;results/time
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;sum column 8;sum column 9;(sum column 8 / sum column 9)
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;sum column 8;sum column 9;(sum column 8 / sum column 9)

for now i can only do the sum of column 8 with the awk command:

awk -F";" '{print;sum+=$8 }END{print "sum " sum}' ./file.CSV

Edit:
need help with some modification of script iam already using. here are real input data:

Date;dbms;type;description;W;D;time;TotalTransactions;NOTransactions;id
Mon Jun 15 14:53:41 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;10;272270;117508;50
Mon Jun 15 15:03:46 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;10;280080;110063;50
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;10;144170;31815;60
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;10;137570;33910;60
Mon Jun 15 15:24:04 CEST 2015;hsql;in-memory;TPC-C test results;2;1;10;226660;97734;70
Mon Jun 15 15:34:08 CEST 2015;hsql;in-memory;TPC-C test results;2;1;10;210420;95113;70
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;10;288360;119328;80
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;10;270360;124328;80

i need to sum values in time, TotalTransactions and NOTransactions columns and then add a column with value (sum NOTransactions/sum time)

iam using this script:

awk 'BEGIN {FS=OFS=";"}
(NR==1) {$10="results/time"; print $0} 
(NR>1 && NF) {sum7[$10]+=$7; sum8[$10]+=$8; sum9[$10]+=$9;  other[$10]=$0} 
 END {for (i in sum8) 
        {$0=other[i]; $7=sum7[i];$8=sum8[i]; $9=sum9[i]; $10=sprintf("%.0f", sum9[i]/sum7[i]); print}}' ./logsFinal.csv

gives me this output:

;;;;;;;;;results/time
Mon Jun 15 15:03:46 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;20;552350;227571;11379
Mon Jun 15 15:13:53 CEST 2015;sqlite;in-memory;TPC-C test results;5;1;20;281740;65725;3286
Mon Jun 15 15:34:08 CEST 2015;hsql;in-memory;TPC-C test results;2;1;20;437080;192847;9642
Mon Jun 15 15:44:16 CEST 2015;hsql;in-memory;TPC-C test results;5;1;20;558720;243656;12183
Date;dbms;type;description;W;D;0;0;0;-nan

values looks good (except header row). But i need to get these results without id column (i want delete id column) So i need to get same values but instead of identify processed rows with same values in id column it must be rows with same values in dbms AND W AND D columns

Hint: awk -F';' '{col8[$NF]+=$8; col9[$NF]+=$9} END {for (i in col8) print i, col8[i]}' file. Also it is best to print what is the exact expected output so that it is clear we all understand the same. — fedorqui
– fedorqui, Commented Jun 15, 2015 at 15:29

anubhava · Accepted Answer · 2015-06-15 17:09:07Z

2

You can use this awk:

awk 'BEGIN{ FS=OFS=";" }
NR>1 && NF {
        s=""
        for(i=1; i<=7; i++)
           s=s $i OFS;
        a[$NF]=s;
        sum8[$NF]+=$8
        sum9[$NF]+=$9
} END{
   for (i in a)
       print a[i] sum8[i], sum9[i], (sum9[i]?sum8[i]/sum9[i]:"NaN")
}' file
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;2020;1100;1.83636
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;2720;1100;2.47273

edited Jun 15, 2015 at 17:09

answered Jun 15, 2015 at 15:36

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

anubhava Over a year ago

Thanks. It was stupid of me to not to use , as field separator. However I didn't notice average was also needed in question.

Andrzej Pronobis · Accepted Answer · 2015-06-16 17:18:58Z

This awk program will print the modified header and modify the output to contain the sums and their division:

awk 'BEGIN {FS=OFS=";"} 
     (NR==1) {$10="results/time"; print $0} 
     (NR>1 && NF) {sum8[$10]+=$8; sum9[$10]+=$9; other[$10]=$0} 
     END {for (i in sum8) 
            {$0=other[i]; $8=sum8[i]; $9=sum9[i]; $10=(sum9[i]?sum8[i]/sum9[i]:"NaN"); print}}'

which gives:

Date;dbms;type;description;W;D;S;results;time;results/time
Mon Jun 15 14:22:20 CEST 2015;sqlite;on-disk;text;2;1;1;2020;1100;1.83636
Mon Jun 15 14:22:20 CEST 2015;hsql;on-disk;text;2;1;1;2720;1100;2.47273

You don't seem to care for the ID in the result, but if you do, just replace $10= with $11=.

Also, if you need to sum things based on values of more than one column, you can create a temporary variable (a in the example below) which is a concatenation of two columns and use it as an index in the arrays, like this:

awk 'BEGIN {FS=OFS=";"}
     (NR==1) {$10="results/time"; print $0}
     (NR>1 && NF) {a=$5$6; sum8[a]+=$8; sum9[a]+=$9; other[a]=$0}
     END {for (i in sum8)
            {$0=other[i]; $8=sum8[i]; $9=sum9[i]; $10=(sum9[i]?sum8[i]/sum9[i]:"NaN"); print}}'

Your awk code seems to work fine for me. I don't get the problem with the header you mentioned.
What I get is: Date;dbms;type;description;W;D;time;TotalTransactions;NOTransactions;results/time' 'Mon Jun 15 15:03:46 CEST 2015;sqlite;in-memory;TPC-C test results;2;1;20;552350;227571;11379
its almost what i need but the sum is made based on W and D columns i need also column dbms. i have changed your script to sum column time and tried to add make sum based on dbms value: awk 'BEGIN {FS=OFS=";"} (NR==1) {$10="results/time"; print $0} (NR>1 && NF) {a=$2; b=$5$6; c=$b$a; sum7[c]+=$7; sum8[c]+=$8; sum9[c]+=$9; other[c]=$0} END {for (i in sum8) {$0=other[i]; $7=sum7[i]; $8=sum8[i]; $9=sum9[i]; $10=(sum9[i]?sum8[i]/sum9[i]:"NaN"); print}}' ./input.csv but this is giving me bad output

Collectives™ on Stack Overflow

BASH - conditional sum of columns and rows in csv file

2 Answers 2

1 Comment

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Related