using AWK to edit a csv file [duplicate]

Question

I have a csv file in which I have 2 columns (1st column is a number and 2nd is a sequence of characters) like this small example:

45373,VAREAKAVVLRDRKSTRLN
1678,SCTFAEGMLFEDCCGP
1524,SCPHFWLAECGP
20,SCTFAEGMLFEDCCGP

I want to look for the similar sequences of characters in the 2nd field and if they are similar I would sum up the corresponding 1st column for the similar rows. for instance for the above example the 2nd and the 4th rows have similar sequences of characters (2nd field) and in the below expected output I summed up the 1st field of those rows and put them in the single row:

expected output:

45373,VAREAKAVVLRDRKSTRLN
1698,SCTFAEGMLFEDCCGP
1524,SCPHFWLAECGP

to get the expected output I made the following command in AWK:

awk -F "," ({for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF)) { print } }' infile.csv > outfile.csv

but it does not return the expected output. do you know how to fix it?

Please say "identical" instead of "similar" if that's what you mean. Of course identical strings are trivially similar, too, but the word suggests that minor differences would be tolerated, and begs the question how much they are allowed to differ. — tripleee
– tripleee, Commented Nov 3, 2021 at 13:06

Daweo · Accepted Answer · 2021-11-03 12:47:11Z

I would use GNU AWK for this task as follows, let file.txt content be

45373,VAREAKAVVLRDRKSTRLN
1678,SCTFAEGMLFEDCCGP
1524,SCPHFWLAECGP
20,SCTFAEGMLFEDCCGP

then

awk 'BEGIN{FS=OFS=","}{arr[$2]+=$1}END{for(i in arr){print arr[i],i}}' file.txt

output

1524,SCPHFWLAECGP
1698,SCTFAEGMLFEDCCGP
45373,VAREAKAVVLRDRKSTRLN

Explanation: I set both field separator (FS) and output field separator (OFS) to ,. I use array arr to store total for given $2, which I computed by adding value of $1 for each line. In END I do print all $2 - total pairs using for. Disclaimer: my solution assume you are happy with any order of rows in output

(gawk 4.2.1)

Renaud Pacalet · Accepted Answer · 2021-11-03 19:34:15Z

0

If you don't care about the output order:

awk -F, -v OFS=, '{str[$2]+=$1} END {for(s in str) print str[s], s}' infile.csv

edited Nov 3, 2021 at 19:34

answered Nov 3, 2021 at 12:46

Renaud Pacalet

30.7k3 gold badges42 silver badges60 bronze badges

1 Comment

Ed Morton Over a year ago

If you change -vOFS=, to -v OFS=, than that'll work in any awk, not just in GNU awk.

Collectives™ on Stack Overflow

using AWK to edit a csv file [duplicate]

2 Answers 2

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Linked

Related