I have a table that looks like this:
DAPPUDRAFT_194440   Phosphorous     
DAPPUDRAFT_194440   Temperature     
DAPPUDRAFT_194472   Phosphorous Fishkairomones  
DAPPUDRAFT_194472   Temperature     
DAPPUDRAFT_194512   Fishkairomones      
DAPPUDRAFT_194512   Cadmium Zinc    Quantumdots
DAPPUDRAFT_195644   Salinity        
DAPPUDRAFT_195644   Phosphorous     
DAPPUDRAFT_196131   Salinity        
DAPPUDRAFT_196131   Phosphorous     
DAPPUDRAFT_196131   hypoxia     
DAPPUDRAFT_196694   Salinity        
As you can see, it can have data in variable number of columns (separated by a tab).
The duplicate entries in the first column starting "DAPPUDRAFT_" should be removed and all the other values that occur in multiple rows should occur in a single row.
For Example in my input table, "DAPPUDRAFT_194440" occurs 2 times in the table and it has two values "temperature" in one row, "phosphorus" in second row as seen in the subset of data here :
 DAPPUDRAFT_194440   Phosphorous     
 DAPPUDRAFT_194440   Temperature  
What I would like to see is: "DAPPUDRAFT_" should occur only once, and the two entries "temperature" and "phosphorus" should occur in the same row separated by a tab as seen here:
 DAPPUDRAFT_194440   Phosphorous   Temperature  
Expected output:
DAPPUDRAFT_194440   Phosphorous Temperature     
DAPPUDRAFT_194472   Phosphorous Fishkairomones  Temperature 
DAPPUDRAFT_194512   Fishkairomones  Cadmium Zinc    Quantumdots
DAPPUDRAFT_195644   Salinity    Phosphorous     
DAPPUDRAFT_196694   Salinity            
DAPPUDRAFT_196131   Salinity    Phosphorous hypoxia 
I tried with the "reshape2" package in R with the dcast function. But it does something totally different than what i wanted. Is there a way on the command line or R or perl that can help solving this?


