AWK to replace character for lines not starting with ">"

Question

I'm working with sequence data and I stupidly cannot find the correct way to replace "." by "X" in lines not starting with ">" using awk. I really need to use awk and not sed.

I got this far, but simply all "." are replaced in this way:

awk '/^>/ {next} {gsub(/\./,"X")}1' Sfr.pep > Sfr2.pep

Example subdata:

>sequence.1
GTCAGTCAGTCA.GTCAGTCA

Result I want to get:

>sequence.1
GTCAGTCAGTCAXGTCAGTCA

Kusalananda · Accepted Answer · 2020-04-23 19:34:52Z

11

It seems more natural to do this with sed:

sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep

This would match ^> against the current line ("does this line start with a > character?"). If that expression does not match, the y command is used to change each dot in that line to X.

Testing:

$ cat Sfr.pep
>sequence.1
GTCAGTCAGTCA.GTCAGTCA

$ sed '/^>/!y/./X/' Sfr.pep >Sfr2.pep

$ cat Sfr2.pep
>sequence.1
GTCAGTCAGTCAXGTCAGTCA

The main issue with your awk code is that next is executed whenever you come across a fasta header line. This means that you code only produces sequence data, without headers. That sequence data should look ok though, but that would not be much help.

Simply negating the test and dropping the next block (or preceding the next with print) would solve it in awk for you, but, and this is my personal opinion, using the y command in sed is more elegant than using gsub() (or s///g in sed) for transliterating single characters.

edited Apr 23, 2020 at 19:34

answered Apr 23, 2020 at 17:25

Kusalananda♦

356k42 gold badges735 silver badges1.1k bronze badges

The reason why I would like to try with AWK specifically is because I tried it with sed before: { sed -e '/^>/!s/\./X/' } And although this worked for most of the files, two did not get edited. However, with your sed code it worked! Thanks!

TUnix
– TUnix

2020-04-24 07:19:23 +00:00
Commented Apr 24, 2020 at 7:19
3

@TUnix That's because you lack the /g at the end of the s command. Your sed approach would only replace the first dot on each line, as if you had used sub() in place of gsub() in awk.

Kusalananda
– Kusalananda ♦

2020-04-24 07:26:58 +00:00
Commented Apr 24, 2020 at 7:26

Add a comment |

schrodingerscatcuriosity · Accepted Answer · 2020-04-23 17:16:42Z

9

You can try with:

awk '!/^>/ { gsub(/\./, "X") }1' Sfr.pep > Sfr2.pep

Output:

>sequence.1
GTCAGTCAGTCAXGTCAGTCA

answered Apr 23, 2020 at 17:16

schrodingerscatcuriosity

12.8k5 gold badges38 silver badges64 bronze badges

Add a comment |

Barmar · Accepted Answer · 2020-04-24 16:39:46Z

2

You're not printing the lines that begin with >, you're only printing the lines that you perform the substitution in. Use the print command to print before skipping to the next line.

awk '/^>/ {print;next} {gsub(/\./,"X")}1' Sfr.pep > Sfr2.pep

answered Apr 24, 2020 at 16:39

Barmar

10.6k1 gold badge22 silver badges29 bronze badges

Add a comment |

jubilatious1 · Accepted Answer · 2021-09-20 19:00:18Z

Using Raku (née Perl_6), and/or Perl:

raku -pe 's:g/^^ <-[>]>  <-[.]>*?  <(\.)> /X/;'

OR (maybe more readable):

raku -pe 's:g{ ^^ <-[>]>  <-[.]>*?  <(\.)> } = Q{X};'

Raku is called at the shell command line with the -pe autoprint flags. The s/// in-place substitution operator is used here, in two guises. The first is the classic while the second is Raku's update to the Perl(5) s{...}{...}; idiom.

Briefly, reading the atoms left-to-right, "search starting from the ^^ start-of-line and where <-[>]> no ">" individual character is found, then where <-[.]>*? non-greedily no literal zero-or-more "." are found, if a <(\.)> literal "." is then found, drop all matches before/after and replace these "." with "X"; do this globally, linewise, autoprinting all lines with the substitution(s) as described."

Speaking of the Perl5 lineage, here's how you would do the P5 cognate to the second Raku example above (but the -pE command line flag might be better on older installs):

perl -pe 's{^ [^>]  [^.]*?  \K\. }{X}gx;'

(Special thanks to @Sinan Ünür for P5 guidance, link below).

https://stackoverflow.com/a/15578028/7270649
https://stackoverflow.com/a/24542792/7270649
https://docs.raku.org/language/operators#s///_in-place_substitution
https://raku.org/

Praveen Kumar BS · Accepted Answer · 2021-09-20 19:32:38Z

0

#!/usr/bin/python
import re
g=re.compile(r'^>')
rep=re.compile(r'\.')
k=open('file','r')
for b in k:
    if not re.search(g,b):
        er=re.sub(rep,"X",b)
        print er.strip()
    else:
        print b.strip()

output

>sequence.1
GTCAGTCAGTCAXGTCAGTCA

answered Sep 20, 2021 at 19:32

Praveen Kumar BS

5,3112 gold badges11 silver badges16 bronze badges

Add a comment |

Stack Exchange Network

AWK to replace character for lines not starting with ">"

5 Answers 5

You must log in to answer this question.

Hot Network Questions

AWK to replace character for lines not starting with ">"

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions