I have a table that looks like this:
DAPPUDRAFT_194440 Phosphorous
DAPPUDRAFT_194440 Temperature
DAPPUDRAFT_194472 Phosphorous Fishkairomones
DAPPUDRAFT_194472 Temperature
DAPPUDRAFT_194512 Fishkairomones
DAPPUDRAFT_194512 Cadmium Zinc Quantumdots
DAPPUDRAFT_195644 Salinity
DAPPUDRAFT_195644 Phosphorous
DAPPUDRAFT_196131 Salinity
DAPPUDRAFT_196131 Phosphorous
DAPPUDRAFT_196131 hypoxia
DAPPUDRAFT_196694 Salinity
As you can see, it can have data in variable number of columns (separated by a tab).
The duplicate entries in the first column starting "DAPPUDRAFT_" should be removed and all the other values that occur in multiple rows should occur in a single row.
For Example in my input table, "DAPPUDRAFT_194440" occurs 2 times in the table and it has two values "temperature" in one row, "phosphorus" in second row as seen in the subset of data here :
DAPPUDRAFT_194440 Phosphorous
DAPPUDRAFT_194440 Temperature
What I would like to see is: "DAPPUDRAFT_" should occur only once, and the two entries "temperature" and "phosphorus" should occur in the same row separated by a tab as seen here:
DAPPUDRAFT_194440 Phosphorous Temperature
Expected output:
DAPPUDRAFT_194440 Phosphorous Temperature
DAPPUDRAFT_194472 Phosphorous Fishkairomones Temperature
DAPPUDRAFT_194512 Fishkairomones Cadmium Zinc Quantumdots
DAPPUDRAFT_195644 Salinity Phosphorous
DAPPUDRAFT_196694 Salinity
DAPPUDRAFT_196131 Salinity Phosphorous hypoxia
I tried with the "reshape2" package in R with the dcast function. But it does something totally different than what i wanted. Is there a way on the command line or R or perl that can help solving this?