remove duplicate entries in one column and linearize the values in multiple rows to a single row

Question

I have a table that looks like this:

DAPPUDRAFT_194440   Phosphorous     
DAPPUDRAFT_194440   Temperature     
DAPPUDRAFT_194472   Phosphorous Fishkairomones  
DAPPUDRAFT_194472   Temperature     
DAPPUDRAFT_194512   Fishkairomones      
DAPPUDRAFT_194512   Cadmium Zinc    Quantumdots
DAPPUDRAFT_195644   Salinity        
DAPPUDRAFT_195644   Phosphorous     
DAPPUDRAFT_196131   Salinity        
DAPPUDRAFT_196131   Phosphorous     
DAPPUDRAFT_196131   hypoxia     
DAPPUDRAFT_196694   Salinity

As you can see, it can have data in variable number of columns (separated by a tab).

The duplicate entries in the first column starting "DAPPUDRAFT_" should be removed and all the other values that occur in multiple rows should occur in a single row.

For Example in my input table, "DAPPUDRAFT_194440" occurs 2 times in the table and it has two values "temperature" in one row, "phosphorus" in second row as seen in the subset of data here :

 DAPPUDRAFT_194440   Phosphorous     
 DAPPUDRAFT_194440   Temperature

What I would like to see is: "DAPPUDRAFT_" should occur only once, and the two entries "temperature" and "phosphorus" should occur in the same row separated by a tab as seen here:

 DAPPUDRAFT_194440   Phosphorous   Temperature

Expected output:

DAPPUDRAFT_194440   Phosphorous Temperature     
DAPPUDRAFT_194472   Phosphorous Fishkairomones  Temperature 
DAPPUDRAFT_194512   Fishkairomones  Cadmium Zinc    Quantumdots
DAPPUDRAFT_195644   Salinity    Phosphorous     
DAPPUDRAFT_196694   Salinity            
DAPPUDRAFT_196131   Salinity    Phosphorous hypoxia

I tried with the "reshape2" package in R with the dcast function. But it does something totally different than what i wanted. Is there a way on the command line or R or perl that can help solving this?

RomanPerekhrest · Accepted Answer · 2017-09-15 12:29:59Z

2

Simply with awk:

awk '{ r=$0; sub($1,"",r); a[$1]=(a[$1])? a[$1]"\t"r : r }
     END{ for(i in a) { gsub(/[[:space:]]{2,}/," ",a[i]); print i,a[i] } }' file

r=$0 - capturing copy of the record
sub($1,"",r) - removing the 1st field the copy to store remaining fields in r variable
a[$1]=(a[$1])? a[$1]"\t"r : r - accumulating values for same group(presented by the 1st field)
for(i in a) - iterating through all grouped items
gsub(/[[:space:]]{2,}/," ",a[i]) - removing excessive whitespaces between words
print i,a[i] - print the group name and its values

The output:

DAPPUDRAFT_194440  Phosphorous Temperature 
DAPPUDRAFT_196694  Salinity
DAPPUDRAFT_194512  Fishkairomones Cadmium Zinc Quantumdots
DAPPUDRAFT_194472  Phosphorous Fishkairomones Temperature 
DAPPUDRAFT_196131  Salinity Phosphorous hypoxia 
DAPPUDRAFT_195644  Salinity Phosphorous

edited Sep 15, 2017 at 12:29

answered Sep 15, 2017 at 12:16

RomanPerekhrest

30.9k5 gold badges47 silver badges68 bronze badges

That's awesome! Could you please explain the command?

biobudhan
– biobudhan

2017-09-15 12:24:56 +00:00
Commented Sep 15, 2017 at 12:24
@biobudhan, sure, see my explanation

RomanPerekhrest
– RomanPerekhrest

2017-09-15 12:30:13 +00:00
Commented Sep 15, 2017 at 12:30

Add a comment |

Essex Boy · Accepted Answer · 2017-09-15 12:24:08Z

0

Or

$ perl -e 'while(<ARGV>){chomp;($x,$y)=split(/\s+/,$_,2);$hash{$x}.=$y;}for(keys %hash){print "$_ $hash{$_}\n";}' test1
DAPPUDRAFT_196694 Salinity
DAPPUDRAFT_194440 Phosphorous     Temperature
DAPPUDRAFT_195644 Salinity        Phosphorous
DAPPUDRAFT_194472 Phosphorous Fishkairomones  Temperature
DAPPUDRAFT_194512 Fishkairomones      Cadmium Zinc    Quantumdots
DAPPUDRAFT_196131 Salinity        Phosphorous     hypoxia

answered Sep 15, 2017 at 12:24

Essex Boy

1913 bronze badges

Add a comment |

Philippos · Accepted Answer · 2017-09-15 12:27:49Z

0

If you don't care how lines and elements are ordered:

sed 'G;s/^\(.*\)\(\t.*\)\n\(.*\)\1/\3\1\2/;h;$!d;s/\n$//' file

For non-GNU sed replace the \t by a litaral TAB.

answered Sep 15, 2017 at 12:27

Philippos

13.7k2 gold badges42 silver badges82 bronze badges

Add a comment |

Stack Exchange Network

remove duplicate entries in one column and linearize the values in multiple rows to a single row

3 Answers 3

You must log in to answer this question.

Hot Network Questions

remove duplicate entries in one column and linearize the values in multiple rows to a single row

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions