1

I have a data set with three columns:

https://drive.google.com/file/d/1gtCssfAXHxRjGfX8uTAaimGPWCA2cnci/view?usp=sharing

Here are the first few lines:

ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010    
MA_10000405g0010    MA_10000405g0010    GO:0006468-protein phosphorylation;GO:0030246-carbohydrate binding;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_1000049g0010 MA_1000049g0010 
MA_10000516g0010    MA_10000516g0010    GO:0005515-protein binding
MA_10001015g0010    MA_10001015g0010    
MA_10001337g0010    MA_10001337g0010    
MA_10001425g0010    MA_10001425g0010    
MA_10001478g0010    MA_10001478g0010    
MA_10001558g0010    MA_10001558g0010    
MA_10001g0010   MA_10001g0010   
MA_10002030g0010    MA_10002030g0010    GO:0005737-cytoplasm;GO:0000184-nuclear-transcribed mRNA catabolic process, nonsense-mediated decay;GO:0004386-helicase activity;GO:0008270-zinc ion binding;GO:0003677-DNA binding;GO:0005524-ATP binding
MA_10002157g0010    MA_10002157g0010    GO:0006468-protein phosphorylation;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_10002549g0010    MA_10002549g0010    
MA_10002583g0010    MA_10002583g0010    GO:0008168-methyltransferase activity
MA_10002614g0010    MA_10002614g0010    
MA_10002643g0010    MA_10002643g0010    GO:0055114-oxidation-reduction process

In the third column, I would like to remove all the text and only keep the GO:xxxxxxx where each of these terms should be separated by a comma. For example:

GO:0006468, GO:0030246

The first two columns should remain unchanged. How can I do this?

3
  • 1
    What is the field separator? Can we assume tabs? Commented Dec 2, 2023 at 17:24
  • They are four spaces as it seems. Commented Dec 2, 2023 at 17:29
  • 1
    Then please download it as tab separated since you cannot use space as a separator when your fields also contain spaces (e.g. GO:0030246-carbohydrate binding;). You can do that directly from Google Sheets: File=> Download => Tab Separated Values (tsv). Commented Dec 2, 2023 at 17:34

4 Answers 4

2

Assuming your data are tab separated, you can do this:

perl -F'\t' -lane '
  $F[2] = join(",", $F[2]=~/GO:\d+/g); 
  print join "\t",@F
' gene_table_Go\ -\ gene_table_Go.tsv > fixed.tsv

The -a makes perl act like awk nd split each input line into the array @F on the character given by -F. Here, I am passing a tab (you can always save your file as tab separated if it isn't already). Next, I am setting the third field, $F[2], to be the result of joining all occurrences of GO: followed by digits present in the existing 3rd field, in other words, keeping only the GO terms and nothing else. Then, I print the @F array joined by tabs.

Here's a more condensed version of the same basic idea:

perl -F'\t' -lane '
 $,="\t"; 
 print @F[0..1], join(",",$F[2]=~/GO:\d+/g);
' gene_table_Go\ -\ gene_table_Go.tsv 

Both examples produce the same output:

$ perl -F'\t' -lane '$,="\t"; print @F[0,1], join(",",$F[2]=~/GO:\d+/g);' gene_table_Go\ -\ gene_table_Go.tsv | head
ID  transcript_id   
MA_10000213g0010    MA_10000213g0010    
MA_10000405g0010    MA_10000405g0010    GO:0006468,GO:0030246,GO:0005524,GO:0004672
MA_1000049g0010 MA_1000049g0010 
MA_10000516g0010    MA_10000516g0010    GO:0005515
MA_10001015g0010    MA_10001015g0010    
MA_10001337g0010    MA_10001337g0010    
MA_10001425g0010    MA_10001425g0010    
MA_10001478g0010    MA_10001478g0010    
MA_10001558g0010    MA_10001558g0010    

To also keep the header intact, since it doesn't actually match the GO:\d+ pattern, you can do:

$ perl -F'\t' -lane '$,="\t"; print $. == 1 ? @F : @F[0,1], join(",",$F[2]=~/GO:\d+/g);' gene_table_Go\ -\ gene_table_Go.tsv | head
ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010    
MA_10000405g0010    MA_10000405g0010    GO:0006468,GO:0030246,GO:0005524,GO:0004672
MA_1000049g0010 MA_1000049g0010 
MA_10000516g0010    MA_10000516g0010    GO:0005515
MA_10001015g0010    MA_10001015g0010    
MA_10001337g0010    MA_10001337g0010    
MA_10001425g0010    MA_10001425g0010    
MA_10001478g0010    MA_10001478g0010    
MA_10001558g0010    MA_10001558g0010    
1

AWK:

$ awk '
     BEGIN{FS=OFS="\t"}
     {
  while(match($3,/GO:[0-9]+/)) 
     {
  col= sprintf("%s",((col) ? col ",": "")substr($3,RSTART,RLENGTH));  
  $3= substr($3,RSTART+RLENGTH)
     }
     }
  col{ $3=col; col=""}
  1'
1

Using Raku (formerly known as Perl_6)

Assuming \t tab as column separator:

~$ raku -ne 'BEGIN put get;   \
             my @a = .split(:skip-empty, / \t /, 3);   \
             @a[2] = (@a[2] // "").comb(/ GO\: \d+ /).join(",");   \
             @a.join("\t").trim-trailing.put;'   file

Here's an answer coded in Raku, a member of the Perl-family of programming languages. Going line by line:

  1. The BEGIN statement outputs the header line (can be omitted if the header line is \t tab separated like the body rows).

  2. The body lines (rows) can be split on \t and saved into @a array. Note, it might be possible (but risky) to split on \s**4 i.e. four-consecutive whitespace characters, or even \h**4 (four-consecutive horizontal whitespace characters), if the third column does not contain that pattern. But again, this is risky.

  3. The third column (i.e. @a[2]) is replaced by @a[2] column text that has been combed (i.e. positively-selected) for one-or-more matches to the Regex GO\: \d+. Think of comb as the converse of split (which is destructive). The selected GO-IDs are then joined with commas.

  4. Finally, the split columns are joined back together on \t tabs, and output.

Sample Input:

ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010
MA_10000405g0010    MA_10000405g0010    GO:0006468-protein phosphorylation;GO:0030246-carbohydrate binding;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_1000049g0010 MA_1000049g0010
MA_10000516g0010    MA_10000516g0010    GO:0005515-protein binding
MA_10001015g0010    MA_10001015g0010
MA_10001337g0010    MA_10001337g0010
MA_10001425g0010    MA_10001425g0010
MA_10001478g0010    MA_10001478g0010
MA_10001558g0010    MA_10001558g0010
MA_10001g0010   MA_10001g0010
MA_10002030g0010    MA_10002030g0010    GO:0005737-cytoplasm;GO:0000184-nuclear-transcribed mRNA catabolic process, nonsense-mediated decay;GO:0004386-helicase activity;GO:0008270-zinc ion binding;GO:0003677-DNA binding;GO:0005524-ATP binding
MA_10002157g0010    MA_10002157g0010    GO:0006468-protein phosphorylation;GO:0005524-ATP binding;GO:0004672-protein kinase activity
MA_10002549g0010    MA_10002549g0010
MA_10002583g0010    MA_10002583g0010    GO:0008168-methyltransferase activity
MA_10002614g0010    MA_10002614g0010
MA_10002643g0010    MA_10002643g0010    GO:0055114-oxidation-reduction process

Sample Output:

ID  transcript_id   go_description
MA_10000213g0010    MA_10000213g0010
MA_10000405g0010    MA_10000405g0010    GO:0006468,GO:0030246,GO:0005524,GO:0004672
MA_1000049g0010 MA_1000049g0010
MA_10000516g0010    MA_10000516g0010    GO:0005515
MA_10001015g0010    MA_10001015g0010
MA_10001337g0010    MA_10001337g0010
MA_10001425g0010    MA_10001425g0010
MA_10001478g0010    MA_10001478g0010
MA_10001558g0010    MA_10001558g0010
MA_10001g0010   MA_10001g0010
MA_10002030g0010    MA_10002030g0010    GO:0005737,GO:0000184,GO:0004386,GO:0008270,GO:0003677,GO:0005524
MA_10002157g0010    MA_10002157g0010    GO:0006468,GO:0005524,GO:0004672
MA_10002549g0010    MA_10002549g0010
MA_10002583g0010    MA_10002583g0010    GO:0008168
MA_10002614g0010    MA_10002614g0010
MA_10002643g0010    MA_10002643g0010    GO:0055114

https://docs.raku.org
https://raku.org

0

You could consider using Miller, which has built in support for TSV files - for example

  • split the $go_description field on ; into an array
  • apply a regular expression to extract the GO:nnnnnn substring from each element of the array
  • join the result back into a , separated string
  • assign the result back to $go_description

So

mlr --tsv put '
  $go_description = joinv(
    apply(splitax($go_description,";"),func(e) {return regextract(e,"GO:[[:digit:]]+")}),
    ","
  )
' gene_table_Go.tsv

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.