Remove everything in a third column but only keep specific text

Question

I have a data set with three columns:

https://drive.google.com/file/d/1gtCssfAXHxRjGfX8uTAaimGPWCA2cnci/view?usp=sharing

Here are the first few lines:

ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010    
MA_10000405g0010    MA_10000405g0010    GO:0006468-protein phosphorylation;GO:0030246-carbohydrate binding;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_1000049g0010 MA_1000049g0010 
MA_10000516g0010    MA_10000516g0010    GO:0005515-protein binding
MA_10001015g0010    MA_10001015g0010    
MA_10001337g0010    MA_10001337g0010    
MA_10001425g0010    MA_10001425g0010    
MA_10001478g0010    MA_10001478g0010    
MA_10001558g0010    MA_10001558g0010    
MA_10001g0010   MA_10001g0010   
MA_10002030g0010    MA_10002030g0010    GO:0005737-cytoplasm;GO:0000184-nuclear-transcribed mRNA catabolic process, nonsense-mediated decay;GO:0004386-helicase activity;GO:0008270-zinc ion binding;GO:0003677-DNA binding;GO:0005524-ATP binding
MA_10002157g0010    MA_10002157g0010    GO:0006468-protein phosphorylation;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_10002549g0010    MA_10002549g0010    
MA_10002583g0010    MA_10002583g0010    GO:0008168-methyltransferase activity
MA_10002614g0010    MA_10002614g0010    
MA_10002643g0010    MA_10002643g0010    GO:0055114-oxidation-reduction process

In the third column, I would like to remove all the text and only keep the GO:xxxxxxx where each of these terms should be separated by a comma. For example:

GO:0006468, GO:0030246

The first two columns should remain unchanged. How can I do this?

Then please download it as tab separated since you cannot use space as a separator when your fields also contain spaces (e.g. GO:0030246-carbohydrate binding;). You can do that directly from Google Sheets: File=> Download => Tab Separated Values (tsv). — terdon
– terdon ♦, Commented Dec 2, 2023 at 17:34

terdon · Accepted Answer · 2024-02-04 15:33:45Z

Assuming your data are tab separated, you can do this:

perl -F'\t' -lane '
  $F[2] = join(",", $F[2]=~/GO:\d+/g); 
  print join "\t",@F
' gene_table_Go\ -\ gene_table_Go.tsv > fixed.tsv

The -a makes perl act like awk nd split each input line into the array @F on the character given by -F. Here, I am passing a tab (you can always save your file as tab separated if it isn't already). Next, I am setting the third field, $F[2], to be the result of joining all occurrences of GO: followed by digits present in the existing 3rd field, in other words, keeping only the GO terms and nothing else. Then, I print the @F array joined by tabs.

Here's a more condensed version of the same basic idea:

perl -F'\t' -lane '
 $,="\t"; 
 print @F[0..1], join(",",$F[2]=~/GO:\d+/g);
' gene_table_Go\ -\ gene_table_Go.tsv

Both examples produce the same output:

$ perl -F'\t' -lane '$,="\t"; print @F[0,1], join(",",$F[2]=~/GO:\d+/g);' gene_table_Go\ -\ gene_table_Go.tsv | head
ID  transcript_id   
MA_10000213g0010    MA_10000213g0010    
MA_10000405g0010    MA_10000405g0010    GO:0006468,GO:0030246,GO:0005524,GO:0004672
MA_1000049g0010 MA_1000049g0010 
MA_10000516g0010    MA_10000516g0010    GO:0005515
MA_10001015g0010    MA_10001015g0010    
MA_10001337g0010    MA_10001337g0010    
MA_10001425g0010    MA_10001425g0010    
MA_10001478g0010    MA_10001478g0010    
MA_10001558g0010    MA_10001558g0010

To also keep the header intact, since it doesn't actually match the GO:\d+ pattern, you can do:

$ perl -F'\t' -lane '$,="\t"; print $. == 1 ? @F : @F[0,1], join(",",$F[2]=~/GO:\d+/g);' gene_table_Go\ -\ gene_table_Go.tsv | head
ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010    
MA_10000405g0010    MA_10000405g0010    GO:0006468,GO:0030246,GO:0005524,GO:0004672
MA_1000049g0010 MA_1000049g0010 
MA_10000516g0010    MA_10000516g0010    GO:0005515
MA_10001015g0010    MA_10001015g0010    
MA_10001337g0010    MA_10001337g0010    
MA_10001425g0010    MA_10001425g0010    
MA_10001478g0010    MA_10001478g0010    
MA_10001558g0010    MA_10001558g0010

Prabhjot Singh · Accepted Answer · 2023-12-03 06:28:38Z

1

AWK:

$ awk '
     BEGIN{FS=OFS="\t"}
     {
  while(match($3,/GO:[0-9]+/)) 
     {
  col= sprintf("%s",((col) ? col ",": "")substr($3,RSTART,RLENGTH));  
  $3= substr($3,RSTART+RLENGTH)
     }
     }
  col{ $3=col; col=""}
  1'

edited Dec 3, 2023 at 6:28

answered Dec 3, 2023 at 6:03

Prabhjot Singh

2,3751 gold badge6 silver badges20 bronze badges

Add a comment |

jubilatious1 · Accepted Answer · 2023-12-16 05:04:08Z

Using Raku (formerly known as Perl_6)

Assuming \t tab as column separator:

~$ raku -ne 'BEGIN put get;   \
             my @a = .split(:skip-empty, / \t /, 3);   \
             @a[2] = (@a[2] // "").comb(/ GO\: \d+ /).join(",");   \
             @a.join("\t").trim-trailing.put;'   file

Here's an answer coded in Raku, a member of the Perl-family of programming languages. Going line by line:

The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).
The body lines (rows) can be split on \t and saved into @a array. Note, it might be possible (but risky) to split on \s**4 i.e. four-consecutive whitespace characters, or even \h**4 (four-consecutive horizontal whitespace characters), if the third column does not contain that pattern. But again, this is risky.
The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for one-or-more matches to the Regex GO\: \d+. Think of comb as the converse of split (which is destructive). The selected GO-IDs are then joined with commas.
Finally, the split columns are joined back together on \t tabs, and output.

Sample Input:

ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010
MA_10000405g0010    MA_10000405g0010    GO:0006468-protein phosphorylation;GO:0030246-carbohydrate binding;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_1000049g0010 MA_1000049g0010
MA_10000516g0010    MA_10000516g0010    GO:0005515-protein binding
MA_10001015g0010    MA_10001015g0010
MA_10001337g0010    MA_10001337g0010
MA_10001425g0010    MA_10001425g0010
MA_10001478g0010    MA_10001478g0010
MA_10001558g0010    MA_10001558g0010
MA_10001g0010   MA_10001g0010
MA_10002030g0010    MA_10002030g0010    GO:0005737-cytoplasm;GO:0000184-nuclear-transcribed mRNA catabolic process, nonsense-mediated decay;GO:0004386-helicase activity;GO:0008270-zinc ion binding;GO:0003677-DNA binding;GO:0005524-ATP binding
MA_10002157g0010    MA_10002157g0010    GO:0006468-protein phosphorylation;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_10002549g0010    MA_10002549g0010
MA_10002583g0010    MA_10002583g0010    GO:0008168-methyltransferase activity
MA_10002614g0010    MA_10002614g0010
MA_10002643g0010    MA_10002643g0010    GO:0055114-oxidation-reduction process

Sample Output:

ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010
MA_10000405g0010    MA_10000405g0010    GO:0006468,GO:0030246,GO:0005524,GO:0004672
MA_1000049g0010 MA_1000049g0010
MA_10000516g0010    MA_10000516g0010    GO:0005515
MA_10001015g0010    MA_10001015g0010
MA_10001337g0010    MA_10001337g0010
MA_10001425g0010    MA_10001425g0010
MA_10001478g0010    MA_10001478g0010
MA_10001558g0010    MA_10001558g0010
MA_10001g0010   MA_10001g0010
MA_10002030g0010    MA_10002030g0010    GO:0005737,GO:0000184,GO:0004386,GO:0008270,GO:0003677,GO:0005524
MA_10002157g0010    MA_10002157g0010    GO:0006468,GO:0005524,GO:0004672
MA_10002549g0010    MA_10002549g0010
MA_10002583g0010    MA_10002583g0010    GO:0008168
MA_10002614g0010    MA_10002614g0010
MA_10002643g0010    MA_10002643g0010    GO:0055114

https://docs.raku.org
https://raku.org

steeldriver · Accepted Answer · 2023-12-02 22:12:54Z

0

You could consider using Miller, which has built in support for TSV files - for example

split the $go_description field on ; into an array
apply a regular expression to extract the GO:nnnnnn substring from each element of the array
join the result back into a , separated string
assign the result back to $go_description

So

mlr --tsv put '
  $go_description = joinv(
    apply(splitax($go_description,";"),func(e) {return regextract(e,"GO:[[:digit:]]+")}),
    ","
  )
' gene_table_Go.tsv

answered Dec 2, 2023 at 22:12

steeldriver

83.9k12 gold badges124 silver badges175 bronze badges

Add a comment |

Stack Exchange Network

Remove everything in a third column but only keep specific text

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Remove everything in a third column but only keep specific text

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions