Extract part of string from each column' rows

Question

I have a text file with > 20.000 lines, like this:

7   128550681   128550681   Intron:1:36:RETAINED-RETAINED;Transcript:NM_001135914.1;Gene:KCP:protein_coding 1   1   0   0
1   17718672    17718672    Intron:9:16:RETAINED-RETAINED;Transcript:NM_207421.4;Gene:PADI6:protein_coding  1   1   0   0
1   17718672    17718672    Intron:9:16:RETAINED-RETAINED;Transcript:NM_207421.4;Gene:PADI6:protein_coding  1   1   0   0
4   86035   86035   Exon:4:5:RETAINED;Transcript:NM_001286052.1;Gene:ZNF595:protein_coding  1   1   0   0
3   12942851    12942851    Intron:14:14:SKIPPED-ALTTENATIVE_3SS;Transcript:NM_001134382.2;Gene:IQSEC1:protein_coding   1   1   0   0

What I need is 4th column contain just Gene:genename, so the output be like that:

7   128550681   128550681   Gene:KCP    1   1   0   0
1   17718672    17718672    Gene:PADI6  1   1   0   0
1   17718672    17718672    Gene:PADI6  1   1   0   0
4   86035   86035   Gene:ZNF595 1   1   0   0
3   12942851    12942851    Gene:IQSEC1 1   1   0   0

* The problem is Gene:genename not always in the same location when try to split by : or ;

I know very basic awk/sed like how to select specific column, how to grep rows that contain some pattern

Mostly but not always in the same location :( @steeldriver

LamaMo
– LamaMo

2018-07-13 19:03:01 +00:00
Commented Jul 13, 2018 at 19:03 — LamaMo
– LamaMo, Commented Jul 13, 2018 at 19:03

jesse_b · Accepted Answer · 2018-07-13 18:56:33Z

2

I was able to accomplish this with the following awk command:

awk '{sub(/^.*;/,"",$4); print}' input

This will remove everything in column 4 up to the last ; which may not work for you (see steeldriver's comment). In which case please update your question with clarification.

answered Jul 13, 2018 at 18:56

jesse_b

41.6k14 gold badges108 silver badges162 bronze badges

Add a comment |

Inian · Accepted Answer · 2018-07-13 19:26:32Z

2

Using awk with only POSIX defined constructs,

awk 'match($4, /Gene:(.+)\:/){ $4=substr($4, RSTART, RLENGTH-1) }1' file

To make the output a bit more neatly aligned, pipe the output to | column -t which will tab separate your columns. If you are unsure of the position of Gene:genename in your line, change the awk to look for the pattern anywhere within the line and modify the 4th column with the required value. Changing to $4 to $0 (the whole line) should work just fine.

awk 'match($0, /Gene:(.+)\:/){ $4=substr($0, RSTART, RLENGTH-1) }1' file

edited Jul 13, 2018 at 19:26

answered Jul 13, 2018 at 18:58

Inian

13.1k2 gold badges42 silver badges55 bronze badges

Add a comment |

Rakesh Sharma · Accepted Answer · 2018-07-15 05:38:33Z

0

perl -pale 's#(?:\H+\h+){3}\K\H+#($F[3] =~ /(?:^|;)(Gene:[^:]+)/)[0]#e' input-file.txt

° in the case of no fixed location of gene in the fourth field, we can do as above.

° we zero in on the fourth field by means of regex (?:\H+\h+){3}\K\H+ and promptly replace that with another regex used in the substitute portion of the s///e command.

edited Jul 15, 2018 at 5:38

answered Jul 13, 2018 at 19:30

Rakesh Sharma

8591 gold badge5 silver badges7 bronze badges

Add a comment |

Kusalananda · Accepted Answer · 2018-07-16 06:57:36Z

0

Treating column four as a list of strings delimited by either ; or : and replacing the whole column with the string Gene and the following one (the gene name) by first splitting that field and then finding the bit we want:

$ awk -vOFS='\t' '{ split($4,a,"[;:]"); for (i in a) if (a[i]=="Gene") { $4 = a[i] ":" a[i+1]; break } } 1' file
7       128550681       128550681       Gene:KCP        1       1       0       0
1       17718672        17718672        Gene:PADI6      1       1       0       0
1       17718672        17718672        Gene:PADI6      1       1       0       0
4       86035   86035   Gene:ZNF595     1       1       0       0
3       12942851        12942851        Gene:IQSEC1     1       1       0       0

edited Jul 16, 2018 at 6:57

answered Jul 15, 2018 at 6:11

Kusalananda♦

356k42 gold badges735 silver badges1.1k bronze badges

Make sure that the output matches with what is wanted. The ":protein_coding"s are extras.

Rakesh Sharma
– Rakesh Sharma

2018-07-16 06:52:49 +00:00
Commented Jul 16, 2018 at 6:52
@RakeshSharma Thanks, I didn't notice that the biotype wasn't needed.

Kusalananda
– Kusalananda ♦

2018-07-16 06:54:37 +00:00
Commented Jul 16, 2018 at 6:54

Add a comment |

Rakesh Sharma · Accepted Answer · 2018-07-16 08:12:44Z

Perl:

perl -F'\h+' -lane '
    for ( $F[3] ) {
        my $a = index(";$_", ";Gene:"     );
        my $b = index(";$_", ":",    $a+6 );
        $_ = substr(";$_", $a+1, $b-$a-1);
    }
    print join "\t", @F;
' input-file.txt

Output:

7   128550681   128550681   Gene:KCP    1   1   0   0
1   17718672    17718672    Gene:PADI6  1   1   0   0
1   17718672    17718672    Gene:PADI6  1   1   0   0
4   86035   86035   Gene:ZNF595 1   1   0   0
3   12942851    12942851    Gene:IQSEC1 1   1   0   0
$   128550681   128550681   Gene:$$$    1   1   0   0

Explanation:

perl options:
- -n => invoke a line-by-line read of the input.
- -F => will make FS = horizontal whitespace.
- -a => split each line into fields (based on whatever FS was set by -F option or a single space by default) and store them in the array @F.
- -l => will make RS = ORS = "\n".
- -e => what follows is to be treated as Perl code and applied on each line, a.ka., record.
data structures involved:
- @F => the array populated with the fields got by splitting the record. Indexed from 0. So $F[3] is the fourth field in the record.
- $a => holds the position of the substring ;Gene: in the 4th field.
- $b => holds the position of the substring : in the 4th field, that is found looking 6 digits AFTER the position of ;Gene:. IOW, it finds the second : after the ;Gene. Note: We pad a semicolon to the search string, that is, $F[3], since the position of Gene: can be anywhere, so it can very well be at the beginning of he fourth field as well. This is to take care of that eventuality.
- $_ => holds the localized version of $F[3] inside a for loop. the substr builtin will extract the gene:... info and store it back in $F[3].
- Note: the my qualifier before the variable definitions for $a,$b mark them to be lexical variables whose scope is limited to the for loop only.
- Note: $_ inside the for loop does NOT refer to the current record/line. It has been overloaded and localized for the duration of the for loop to be $F[3].

GNU Sed:

sed -Ee '
    s/\S+/\n&\n/4
    s/\n(.*;)?(Gene:[^:]+):.*\n/\2/
' input-file.txt

Explanation:

We mark out the fourth field with newlines.
Having staked out the region in the current line, we then fish out the required data, in our case, Gene: and after that, as many non colons we meet on the way before we hit the next colon.
This method does not disturb the spacing existing between the various fields. This may or may not be important.
Note FYI: Assumes a single gene in the fourth field. For multiple genes, it won't error out or warn, rather silently pick the last gene of the 4th field of that record.

Stack Exchange Network

Extract part of string from each column' rows

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Extract part of string from each column' rows

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions