0

Trying to remove duplicates from each rows after ","

Input:
rs10993127  9:94266397-94266397,9:94266397-94266397 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169
rs11533012  9:94267817-94267817,9:94267817-94267817 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169

Desired output:
rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169

My codes:
awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'

Thank you!

6
  • Have you tried anything on your own? Commented Jul 3, 2020 at 9:53
  • Sorry, I forgot to post my code. awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}' Commented Jul 3, 2020 at 9:56
  • Hello, and welcome to SO. Instead of posting relevant code in comments, try to edit your original post instead. Commented Jul 3, 2020 at 9:58
  • Can we always assume that each column will have a comma, or, if a column has a comma, that would be followed by a duplicate of what is before the comma? Commented Jul 3, 2020 at 9:59
  • Hi @DaemonPainter, yes, every column will have commas, and if the exact string comes after the comma, it is considered duplicate. Commented Jul 3, 2020 at 10:04

3 Answers 3

2

The method below does not assume that duplicates are consecutive

awk '{ for(i=1;i<=NF;++i) { 
         n=split($i,a,",");
         for(j=1;j<=n;++j) {
            s = s (a[j] in b ? "" : (s ? "," : "")  a[j])
            b[a[j]]
         }
         $i=s; s=""; delete b
     }}1' file

Which returns the output:

rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169

The idea in the above is to rebuild each field. Each field is split into various entries using split and stored in the array a. When rebuilding the field, we check if an entry a[j] has already been added to the new value s of the field. This check is done by validating if a key of the associative array b exists with the same value of the current processed entry (a[j] in b).

Sign up to request clarification or add additional context in comments.

1 Comment

@austin7923, if this solves your question, consider accepting it.
0

With GNU sed and other implementations that support \b

$ sed -E 's/\b([^,]+),\1\b/\1/g' ip.txt
rs10993127  9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012  9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169
  • ([^,]+) match non , characters
  • ,\1 match , and text that was captured with ([^,]+)
  • \1 also helps in replacement

Word boundaries are need to avoid partial matches, for example:

$ echo 'a bc,bcd 123,23' | sed -E 's/([^,]+),\1/\1/g'
a bcd 123
$ echo 'a bc,bcd 123,23' | sed -E 's/\b([^,]+),\1\b/\1/g'
a bc,bcd 123,23

If the column content can start/end with non-word characters like : then the above solution will not work if there are partial matches.

3 Comments

Is it not possible to say something like: sed -E 's/\b([^,]+)(,\1){1,}\b/\1/g' ?
yes, that'll help if there can be more than one consecutive duplicates
sed is not in the tags of the question
-1

One liner alternative based on the assumption:

awk '{output="";for(f=1;f<=NF;f++){split($f,a,",");output=output" "a[1]}print output}'

output:

 rs10993127 9:94266397-94266397 intron_variant ZNF169
 rs11533012 9:94267817-94267817 intron_variant ZNF169

known issue is that it happens a whitespace before the first field.

2 Comments

Not sure about the downvote, probably due to the whitespace. Care on explaining?
Sorry, yes I can: The output is not the expected one. Each field can contain several different entries which are comma separated. Only the duplicates need to be removed. The solution presented here just prints the first entry.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.