3

I have an input file with these fields:

ENST00000456328.2   1657    1350.015    0   0

I am trying awk to remove the number after the decimal and print the rest as it is

awk -F[.] '{print $1"\t"$2"\t"$3}{next;}'

But it doesn't work, as it gives an output like this:

ENST00000456328 2   1657    1350    015 0   0

Can someone help.

regards.

3
  • When you use [.] as the field separator, $2 is the decimal part that you want to remove - so just print $1"\t"$3 Commented Jan 3, 2020 at 18:37
  • Does every line have decimals in the exact same place? Commented Jan 3, 2020 at 18:43
  • @Jesse_b If they are all human transcript IDs, yes. If the list contains data from another species, then the identifier (before the dot) may be longer depending on the specie's name. Commented Jan 3, 2020 at 19:03

3 Answers 3

6

Assuming the input is tab-delimited and that you'd like to keep it that way, you can remove the version numbers from the Ensembl stable IDs with

$ awk 'BEGIN { OFS=FS="\t" } { sub("\\..*", "", $1); print }' file
ENST00000456328 1657    1350.015        0       0

This applies a substitution to the first tab-delimited field (only) that removes everything after the first dot.

Similarly with sed:

$ sed 's/\.[^[:blank:]]*//' file
ENST00000456328 1657    1350.015        0       0

This removes any non-blank characters after the first dot on each line. You could also use \.[[:digit:]]* as the pattern, which would explicitly match digits instead of non-blanks.

If you have non-versioned Ensembl IDs, or IDs from another database, in your data, then you may want to make sure that you match a versioned Ensembl ID before modifying the line. With awk, this may be done with

$ awk 'BEGIN { OFS=FS="\t" } /^ENS[^[:blank:]]*\./ { sub("\\..*", "", $1) } { print }' file
ENST00000456328 1657    1350.015        0       0

The print is now in a separate block from the block that does the modification to the first field. This is so that lines that all lines, modified or not, are printed. The whole { print } block may be replaced by the shorter 1, if you are short on time or space for typing.

And with sed:

$ sed '/^ENS[^[:blank:]]*\./s/\.[^[:blank:]]*//' file
ENST00000456328 1657    1350.015        0       0

The sed code already prints all lines, whether modified or not, so no other modification has to be made (whereas in the awk code, the outputting of the result had to be slightly justified compared with the first awk variation).

In these last two variants, we match a versioned Ensembl ID at the start of a line with the regular expression ^ENS[^[:blank:]]*\. before attempting to do any modifications.

None of the variations above cares or need to care about the rest of the data on the line. Each line may contain additional fields, and these will be passed on unmodified.


Using a dot as the field delimiter is inspired, but will lead to issues as more data on the line contains dots.

1

If you want to remove all decimals regardless of field and be able to handle the potential for decimals being in different fields you can use the gsub function:

awk '{gsub(/\.[0-9]+ /, " ")}1'

This will find any decimal followed by one to unlimited numbers and a space and then replace them with a space.

0

Using Raku (formerly known as Perl_6)

~$ raku -ne '.words andthen put join "\t", .[0].subst(/\.\d+/), .[1..*];'  file  

Raku is a programming language in the Perl-family. While it still has a small ecosystem, it's text-processing capabilities (like Perl) make it a good choice for Bioinformatics.

Above, Raku is called at the command line with the -ne non-autoprinting linewise flags (i.e. awk-like behavior). Lines are split into whitespace-separated words, and the first word (.[0]) uses subst to recognize and delete the trailing dot-number(s). [Using subst without a replacement instructs Raku to delete the recognized pattern]. Then the modified first word along with .[1..*] (the remainder of the line) are joined on tabs and output.

Sample Input:

ENST00000456328.2   1657    1350.015    0   0
ENST00000456329 1657    1350.015    0   0

Sample Output:

ENST00000456328 1657    1350.015    0   0
ENST00000456329 1657    1350.015    0   0

Note above, only the first column is modified.

https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.