0

I have a file that would like to filter duplicate values based column 1 and 6

ID,sample,NAME,reference,app_name,appession_id,workflow,execution_status,status,date_created
1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022

and the final output should look like

ID,sample,NAME,reference,app_name,appession_id,workflow,execution_status,status,date_created
1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022

So far this is what I have tried

awk '!a[$1 $6]++ { print ;}' input.csv > output.csv

I end up with

ID,sample,NAME,reference,app_name,appession_id,workflow,execution_status,status,date_created
1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022

Any suggestion would be helpful. Thank you

5
  • if you do use !a[$1, $6]++ (which is better you do it) instead of the !a[$1 $6]++, are you geting the same wrong output? what is the field delimiter of your input file? what is your file type file input.csv? Commented Oct 14, 2022 at 16:11
  • I can't get the same failure as you with your code and data, although that command may be shortened to awk '!a[$1,$6]++' file. Commented Oct 14, 2022 at 16:12
  • its a csv file, that's strange, I still get 3 records with awk '!a[$1,$6]++' file Commented Oct 14, 2022 at 16:22
  • @nbn It's not a CSV file, there does not seem to be a fix field delimiter. awk will use any sequence of spaces and/or tabs as the delimiter. If one of your fields contain spaces or tabs, it will be treated as more than one field by awk. You have no such data in your example in the text as far as I can see, unless ABC XYZ DOP or 2022-08-18 13:31:09Z is supposed to be a single field. Commented Oct 14, 2022 at 16:25
  • 2022-08-18 13:31:09Z is one field but ABC XYZ DOP are all three different columns Commented Oct 17, 2022 at 7:51

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.