5

I'm working with sequence data and I stupidly cannot find the correct way to replace "." by "X" in lines not starting with ">" using awk. I really need to use awk and not sed.

I got this far, but simply all "." are replaced in this way:

awk '/^>/ {next} {gsub(/\./,"X")}1' Sfr.pep > Sfr2.pep

Example subdata:

>sequence.1
GTCAGTCAGTCA.GTCAGTCA

Result I want to get:

>sequence.1
GTCAGTCAGTCAXGTCAGTCA
0

5 Answers 5

11

It seems more natural to do this with sed:

sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep

This would match ^> against the current line ("does this line start with a > character?"). If that expression does not match, the y command is used to change each dot in that line to X.

Testing:

$ cat Sfr.pep
>sequence.1
GTCAGTCAGTCA.GTCAGTCA
$ sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep
$ cat Sfr2.pep
>sequence.1
GTCAGTCAGTCAXGTCAGTCA

The main issue with your awk code is that next is executed whenever you come across a fasta header line. This means that you code only produces sequence data, without headers. That sequence data should look ok though, but that would not be much help.

Simply negating the test and dropping the next block (or preceding the next with print) would solve it in awk for you, but, and this is my personal opinion, using the y command in sed is more elegant than using gsub() (or s///g in sed) for transliterating single characters.

2
  • The reason why I would like to try with AWK specifically is because I tried it with sed before: { sed -e '/^>/!s/\./X/' } And although this worked for most of the files, two did not get edited. However, with your sed code it worked! Thanks! Commented Apr 24, 2020 at 7:19
  • 3
    @TUnix That's because you lack the /g at the end of the s command. Your sed approach would only replace the first dot on each line, as if you had used sub() in place of gsub() in awk. Commented Apr 24, 2020 at 7:26
9

You can try with:

awk '!/^>/ { gsub(/\./, "X") }1' Sfr.pep > Sfr2.pep

Output:

>sequence.1
GTCAGTCAGTCAXGTCAGTCA
2

You're not printing the lines that begin with >, you're only printing the lines that you perform the substitution in. Use the print command to print before skipping to the next line.

awk '/^>/ {print;next} {gsub(/\./,"X")}1' Sfr.pep > Sfr2.pep
0

Using Raku (née Perl_6), and/or Perl:

raku -pe 's:g/^^ <-[>]>  <-[.]>*?  <(\.)> /X/;'

OR (maybe more readable):

raku -pe 's:g{ ^^ <-[>]>  <-[.]>*?  <(\.)> } = Q{X};'

Raku is called at the shell command line with the -pe autoprint flags. The s/// in-place substitution operator is used here, in two guises. The first is the classic while the second is Raku's update to the Perl(5) s{...}{...}; idiom.

Briefly, reading the atoms left-to-right, "search starting from the ^^ start-of-line and where <-[>]> no ">" individual character is found, then where <-[.]>*? non-greedily no literal zero-or-more "." are found, if a <(\.)> literal "." is then found, drop all matches before/after and replace these "." with "X"; do this globally, linewise, autoprinting all lines with the substitution(s) as described."

Speaking of the Perl5 lineage, here's how you would do the P5 cognate to the second Raku example above (but the -pE command line flag might be better on older installs):

perl -pe 's{^ [^>]  [^.]*?  \K\. }{X}gx;'

(Special thanks to @Sinan Ünür for P5 guidance, link below).

https://stackoverflow.com/a/15578028/7270649
https://stackoverflow.com/a/24542792/7270649
https://docs.raku.org/language/operators#s///_in-place_substitution
https://raku.org/

0
#!/usr/bin/python
import re
g=re.compile(r'^>')
rep=re.compile(r'\.')
k=open('file','r')
for b in k:
    if not re.search(g,b):
        er=re.sub(rep,"X",b)
        print er.strip()
    else:
        print b.strip()

output

>sequence.1
GTCAGTCAGTCAXGTCAGTCA

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.