0

I have a csv file in which I have 2 columns (1st column is a number and 2nd is a sequence of characters) like this small example:

45373,VAREAKAVVLRDRKSTRLN
1678,SCTFAEGMLFEDCCGP
1524,SCPHFWLAECGP
20,SCTFAEGMLFEDCCGP

I want to look for the similar sequences of characters in the 2nd field and if they are similar I would sum up the corresponding 1st column for the similar rows. for instance for the above example the 2nd and the 4th rows have similar sequences of characters (2nd field) and in the below expected output I summed up the 1st field of those rows and put them in the single row:

expected output:

45373,VAREAKAVVLRDRKSTRLN
1698,SCTFAEGMLFEDCCGP
1524,SCPHFWLAECGP

to get the expected output I made the following command in AWK:

awk -F "," ({for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF)) { print } }' infile.csv > outfile.csv

but it does not return the expected output. do you know how to fix it?

1
  • 1
    Please say "identical" instead of "similar" if that's what you mean. Of course identical strings are trivially similar, too, but the word suggests that minor differences would be tolerated, and begs the question how much they are allowed to differ. Commented Nov 3, 2021 at 13:06

2 Answers 2

0

I would use GNU AWK for this task as follows, let file.txt content be

45373,VAREAKAVVLRDRKSTRLN
1678,SCTFAEGMLFEDCCGP
1524,SCPHFWLAECGP
20,SCTFAEGMLFEDCCGP

then

awk 'BEGIN{FS=OFS=","}{arr[$2]+=$1}END{for(i in arr){print arr[i],i}}' file.txt

output

1524,SCPHFWLAECGP
1698,SCTFAEGMLFEDCCGP
45373,VAREAKAVVLRDRKSTRLN

Explanation: I set both field separator (FS) and output field separator (OFS) to ,. I use array arr to store total for given $2, which I computed by adding value of $1 for each line. In END I do print all $2 - total pairs using for. Disclaimer: my solution assume you are happy with any order of rows in output

(gawk 4.2.1)

Sign up to request clarification or add additional context in comments.

Comments

0

If you don't care about the output order:

awk -F, -v OFS=, '{str[$2]+=$1} END {for(s in str) print str[s], s}' infile.csv

1 Comment

If you change -vOFS=, to -v OFS=, than that'll work in any awk, not just in GNU awk.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.