Remove repeated string from every column

Question

Trying to remove duplicates from each rows after ","

Input:
rs10993127  9:94266397-94266397,9:94266397-94266397 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169
rs11533012  9:94267817-94267817,9:94267817-94267817 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169

Desired output:
rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169

My codes:
awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'

Thank you!

Sorry, I forgot to post my code. awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}' — austin7923
– austin7923, Commented Jul 3, 2020 at 9:56
Hello, and welcome to SO. Instead of posting relevant code in comments, try to edit your original post instead. — Daemon Painter
– Daemon Painter, Commented Jul 3, 2020 at 9:58
Can we always assume that each column will have a comma, or, if a column has a comma, that would be followed by a duplicate of what is before the comma? — Daemon Painter
– Daemon Painter, Commented Jul 3, 2020 at 9:59
Hi @DaemonPainter, yes, every column will have commas, and if the exact string comes after the comma, it is considered duplicate. — austin7923
– austin7923, Commented Jul 3, 2020 at 10:04

kvantour · Accepted Answer · 2020-07-03 10:10:33Z

The method below does not assume that duplicates are consecutive

awk '{ for(i=1;i<=NF;++i) { 
         n=split($i,a,",");
         for(j=1;j<=n;++j) {
            s = s (a[j] in b ? "" : (s ? "," : "")  a[j])
            b[a[j]]
         }
         $i=s; s=""; delete b
     }}1' file

Which returns the output:

rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169

The idea in the above is to rebuild each field. Each field is split into various entries using split and stored in the array a. When rebuilding the field, we check if an entry a[j] has already been added to the new value s of the field. This check is done by validating if a key of the associative array b exists with the same value of the current processed entry (a[j] in b).

@austin7923, if this solves your question, consider accepting it.

Sundeep · Accepted Answer · 2020-07-03 10:09:35Z

With GNU sed and other implementations that support \b

$ sed -E 's/\b([^,]+),\1\b/\1/g' ip.txt
rs10993127  9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012  9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169

([^,]+) match non , characters
,\1 match , and text that was captured with ([^,]+)
\1 also helps in replacement

Word boundaries are need to avoid partial matches, for example:

$ echo 'a bc,bcd 123,23' | sed -E 's/([^,]+),\1/\1/g'
a bcd 123
$ echo 'a bc,bcd 123,23' | sed -E 's/\b([^,]+),\1\b/\1/g'
a bc,bcd 123,23

If the column content can start/end with non-word characters like : then the above solution will not work if there are partial matches.

Is it not possible to say something like: sed -E 's/\b([^,]+)(,\1){1,}\b/\1/g' ?
yes, that'll help if there can be more than one consecutive duplicates

Daemon Painter · Accepted Answer · 2020-07-03 10:16:04Z

-1

One liner alternative based on the assumption:

awk '{output="";for(f=1;f<=NF;f++){split($f,a,",");output=output" "a[1]}print output}'

output:

 rs10993127 9:94266397-94266397 intron_variant ZNF169
 rs11533012 9:94267817-94267817 intron_variant ZNF169

known issue is that it happens a whitespace before the first field.

answered Jul 3, 2020 at 10:16

Daemon Painter

3,5964 gold badges34 silver badges53 bronze badges

2 Comments

Daemon Painter Over a year ago

Not sure about the downvote, probably due to the whitespace. Care on explaining?

kvantour Over a year ago

Sorry, yes I can: The output is not the expected one. Each field can contain several different entries which are comma separated. Only the duplicates need to be removed. The solution presented here just prints the first entry.

Collectives™ on Stack Overflow

Remove repeated string from every column

3 Answers 3

1 Comment

3 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

2 Comments

Related